Taking LabVIEW beyond the CPU to FPGAs and GPU
LabVIEW developers often hit performance walls when their applications grow beyond what traditional CPU-based processing can handle. Whether it's real-time signal processing that needs sub-microsecond response times, image analysis requiring massive parallel computation, or data acquisition systems that must process streaming data faster than it arrives - the solution isn't always obvious. Expanding your application beyond CPU may be appropriate if you are running into limitations with:
- Timing: Your application requires more precise and reliable timing than Windows can offer
- Throughput: A CPU can process your data, but it is prohibitively slow for your required sample rates or real-time constraints
- Efficiency: Datasets could be filtered, compressed, or preprocessed if only you had the computational resources to perform these operations in real-time.
- Scalability: Your algorithm works perfectly for small datasets but cannot be scaled to production volumes
If any of these sound familiar, accelerated computing with FPGAs or GPUs might be the solution - but choosing the right approach requires understanding both your application's specific requirements, the strengths of each technology, and where you are in the development process.
Decision Framework: CPU vs GPU vs FPGA
When deciding how and where to offload work from the CPU to achieve better performance, first consider your application’s critical performance requirements.
Technical comparison guide for choosing between FPGA, GPU, and CPU based on performance and application requirements.
Timing
Timing requirements can vary from nothing more than creating a smooth user experience while post-processing data, to safety-critical real-time applications.
- FPGA: The algorithm is programmed into the very fabric of the FPGA, resulting in hard real-time performance with deterministic timing and nanosecond precision that surpasses even LabVIEW Real-Time.
- GPU: There can be variable latency due to driver layers and overhead in transferring data and scheduling tasks.
- CPU: Varies by Operating System. Windows cannot guarantee sub-100 ms precision, while NI’s Linux Real-Time achieves millisecond precision when properly implemented.
Data Characteristics
At each stage of your data processing, consider the structure, size, and representation of the data.
- FPGA: Excels at processing continuous streams. At any given time the data processed is small, but throughput can be several gigabytes per second. FPGAs execute logic in parallel allowing for multitasking, or pipelining data to increase execution rate. FPGA’s work best if the signal magnitudes are well bounded, allowing for integer and fixed point math. Floating point logic quickly exhausts FPGA fabric.
- GPU: Excels at processing large batches of data or manipulating vast arrays, as occurs in image processing or Machine Learning tasks. GPU’s can handle integer, fixed point, or single precision float data. Using the default LabVIEW datatype of Double-precision floating point (DBL or float64) will incur a huge performance penalty.
-
CPU: Can handle mixed or unpredictable data patterns, and any data representation. Naturally, this is the best place to start developing, then offload processing steps as appropriate.
Algorithmic Complexity
At each stage of your data processing, look at the types of operations being performed.
- FPGA: Simple, repetitive operations such as digital filters, peak or edge detection, basic math applied to relatively small chunks of data, or especially single samples. Avoid complex, branching logic, or large lookup-tables.
- GPU: Look for what are known as “embarrassingly parallel problems”. These are moderately complex operations performed across many datasets, where the input of one dataset does not depend on the output of another. In other words, the datasets can be processed in parallel.
- CPU: Sequential algorithms, complex decision trees, or operations requiring frequent memory access patterns that don't benefit from parallelization.
FPGA Implementation
If sections of the data processing are a candidate for moving to an FPGA, the next step is to move the logic to code that can be deployed to an FPGA, and then select the appropriate hardware. Of course, an experienced FPGA developer could write verilog or VHDL that runs on an off-the-shelf FPGA System-on-Module, and allow LabVIEW to communicate through a TCP connection. However developing within the NI/LabVIEW ecosystem offers rapid development and time to delivery without needing to hire specialists.
Software
Moving your LabVIEW code into FPGA is as simple as installing the LabVIEW FPGA module, and learning a few basics about how to write FPGA code.
- Install the LabVIEW FPGA module (which requires a license fee) and the necessary NI hardware drivers such as RIO.
- Work from template projects to write and deploy FPGA code within the LabVIEW development environment.FPGA code can be compiled locally or on the NI compile farm, taking 10 minutes to several hours depending on complexity.
- Code can be tested in simulation in case you don't have hardware.
Hardware
LabVIEW FPGA is only supported on NI Hardware, but the rapid, easy development is often worth the price of admission, particularly for low volume or R&D applications. Ni Hardware solutions can vary depending on price and performance needs.
- Single Board RIO (sbRIO): An affordable, compact board that runs Linux-RT, an FPGA, and limited IO - all programmable through the LabVIEW FPGA module. A range of models are available, but sbRIOs typically support 4-32 IO channels at rates less than 500 kS/s.
- CompactRIO (cRIO): A rugged, modular IO platform that runs Linux-RT and has an integrated FPGA that can be easily linked to modular IO cards that plug into the chassis slots. cRIO modules typically support 4-32 channels each at acquisition rates <1 MHz, and have a variety of specialized cards for reading temperature, pressure, strain, etc.
- PXI: PXI platforms offer the highest performance but require dedicated FPGA cards. These include FlexRIO boards, reconfigurable oscilloscopes, and high-speed digitizers. Some high-performance reconfigurable I/O cards are also available in PCIe or standalone USB form factors. Channel count can be 4-32 per card, with acquisition rates beyond 1 GHz.
GPU Implementation
GPU Software
From LabVIEW, there are a variety of ways to offload CPU processing to a GPU - you can have a separate service running in Python, or use a Cloud service and send work via REST APIs. Here, we’ll look at two approaches to accelerating your application with a GPU: a quick and convenient approach that stays within the LabVIEW IDE, and an advanced, comprehensive approach.
LabVIEW GPU Libraries
All LabVIEW users are familiar with the built in LabVIEW primitive functions for array manipulation and basic mathematics, which are executed on a CPU. JKI can help you identify libraries that can allow you to intuitively code in LabVIEW while the work is deployed behind-the-scenes to a GPU. This approach works best if your algorithm can be broken down into a few common operations of array manipulation, linear algebra, and arithmetic.
To have full, unbridled use of a GPU, a developer can write CUDA code, deploy it to the GPU, and communicate with the DLL through a LabVIEW library call. In this way, anything that can be thought of can be computed on the GPU, and accessed through LabVIEW. This approach is recommended if your application requires operations that are not supported by existing labVIEW toolkits, or if additional performance gains are required. This route takes more time and experience, but no license fees apply.
- First, install the CUDA toolkit, and the NVIDIA Driver compatible with your GPU and CUDA Toolkit version.
- Install Visual studio, which in addition to being a development environment for CUDA code, includes a compiler.
- Develop and build your CUDA code as a DLL
- Use the LabVIEW CallLibrary node to execute your code on a GPU
You can conceptualize CUDA code as written in C/C++ with some extra CUDA-specific syntax and functions. Your CUDA instructions are deployed simultaneously to thousands of mini CPUs called CUDA cores. Each core executes the same instructions on a different subset of data, similar to how a “for loop” might iterate over a range of data. Except in CUDA, many iterations happen simultaneously.
GPU Hardware
To get started, all you need is a CUDA compatible GPU. Various form factors are available
- PCIe GPU: For maximum GPU performance, installing a CUDA compatible GPU in a computer’s PCIe slot is a versatile, straightforward way to get started, and allows the latest GPUs to be installed as soon as they hit the market.
- eGPU: If the PC has no PCIe slots left or if the application is on a PXIe, then the GPU can be installed into an eGPU chassis and connected through a thunderbolt connection. Check if your PXIe controller or PC has a thunderbolt connection. The convenience of eGPUs tend to come at the expense of latency and bandwidth.
- PXIe GPU: For lowest latency and highest bandwidth on a PXIe system, some manufacturers offer GPU cards that insert into a PXIe chassis slot. This is ideal for applications that require real-time decisions based on the processed GPU data.
Always check the power and cooling requirements of your GPU card, and what the hardware power supply/chassis can provide.
Practical Example
Refactoring to incorporate FPGAs and GPUs
The example below presents a simple producer-consumer architecture. Raw data is acquired, and sent to a consumer loop for processing, and finally presented on the UI. The raw data is filtered, an FFT is performed, and the result is presented to the UI. How might the dataflow be refactored to push processing off the CPU?
Supposing the filtered data was required for time-critical decision making, it would be appropriate to move acquisition and filtering to an FPGA. Digital filters require very little logic and memory, so they are a perfect application for an FPGA. Meanwhile, if the FFT was performed on very large arrays and was needed only for post-processing, the application could be accelerated by offloading it to a GPU. Now, the CPU is only needed for User Interaction, and managing the data transport.
Conclusion
Of course, any combination of GPUs and FPGAs can be incorporated into an application. Initial testing with some of the above methods may reveal which parts of your application would benefit from being offloaded from the CPU, allowing a more mature design to materialize.
Ready to Talk to JKI?
JKI brings years of specialized experience working with every NI platform, GPUs, and FPGAs. We can help you find the hardware you need, and help develop your application to the edge of what’s possible.
Enjoyed the article? Leave us a comment