Using oneAPI for Low Latency AI Inference with Altera® FPGAs

DuncanMackay · ‎05-09-2024

Hyperdimensional Computing (HDC) is a machine learning method inspired by the brain's high-dimensional and distributed representation of information. HDC is based on the concept of hypervectors, and like other AI methods, HDC uses training and inference steps in its implementation.

HDC Hypervectors, Training and Inference

HDC represents data as large vectors called hypervectors. Using basic HDC operations (similarity, bundling, binding, and permutation), hypervectors can be compared for similarities and differences, and an encoding system can be developed.

Figure 1 and 2

diagram1 (1).png

Training: During training, input feature vectors are converted to hypervectors, which are then compared and sorted into similar classes based on their similarity or lack thereof. Finally, the hypervectors within each class are bundled to create a class hypervector representing all class elements. This class hypervector enables the development of an HDC-based classification system.
Inference: Once training has succeeded in creating class hypervectors from a series of input feature vectors, each new feature vector is similarly converted to a hypervector that may be compared against the existing class hypervectors to determine if the data is similar, allowing for a fast, efficient, and robust classification system.

A classification system built on HDC has the following inherent benefits:

Efficiency: Hypervectors allow for efficient and compact representation of complex patterns, making processing and classification tasks more efficient than traditional methods.
Low Power: Due to its simplicity and the binary nature of its operations (for example, XOR, AND), HDC can be implemented in hardware with low power consumption. This is particularly beneficial for wearable devices, IoT devices, and edge computing applications where energy efficiency is crucial.
Highly Parallel: The distributed nature of HDC allows for parallel processing, akin to the parallel processing capabilities of the human brain. This can significantly speed up computation times for classification tasks.
Fast Learning: HDC can perform one-shot or few-shot learning, where the system learns from very few examples, unlike deep learning models that often require extensive training data. This capability makes HDC highly advantageous in scenarios where data is scarce or rapidly changing.
Robust & Reliable: The high-dimensional nature of HDC makes it inherently robust to noise and errors. Small changes or distortions in the input data do not significantly affect the overall representation, enabling reliable classification even in noisy environments.

Additionally, manipulating hypervectors involves many repeated, basic operations, making them very amenable to acceleration with hardware platforms.

HDC applications are efficient, low-power, highly parallel, and implement well in hardware, and hence are ideal for implementation on efficient, low-power¹, and highly parallel Altera® FPGAs.

Using oneAPI and Altera® FPGAs

Intel® oneAPI Base ToolKit is a software development toolkit designed to simplify creating high-performance, cross-architecture applications. As Intel’s implementation of the oneAPI industry standard, the Intel® oneAPI Base Toolkit can work across various processors, including CPUs, GPUs, FPGAs, and AI accelerators.

One of the main benefits of oneAPI is that it simplifies the development process. With oneAPI, developers can create SYCL/C++ applications that run on different architectures without learning different programming languages. This means that developers can write code once and run it on different processors, saving a lot of time and effort: develop applications in software and implement them on the most efficient and cost-effective platform without any code changes.

HDC applications written in SYCL/C++ may be implemented directly onto Altera FPGAs using Intel® oneAPI Base Toolkit.

Details on getting started with oneAPI for FPGAs, including self-start learning videos, tutorial examples, and reference designs, can be found in Boosting Productivity with High-Level Synthesis in Intel® Quartus® with oneAPI.

HDC Image Classification on Altera FPGAs

An example of using HDC in a real-world application is image classification. In this example², created by Ian Peitzsch at the Center for Space, High-Performance, and Resilient Computing (SHREC), an HDC image classification system was implemented on an Altera FPGA for training and inference.

Figure 2 and 3

diagram2 (1).png

The data flow in both cases is similar, but there are differences in the compute. Feature vectors stream from the host to the FPGA. In the training flow, each vector is routed to one of 8 compute units, and in the inference flow, each vector is routed to one of 25 compute units.

Training:
- Feature vectors stream from the host to the FPGA.
- Each vector is routed to parallel encode compute units (CUs).
- The output-encoded partial hypervectors are then bundled, and a label is generated.
- The hypervectors are bundled into the class corresponding to the label.
- After all the training data is processed, the class hypervectors are streamed to the host for normalization.
Inference:
- Feature vectors stream from the host to the FPGA.
- Each vector is routed to parallel encode CUs.
- The data is piped to the classification kernel to form a single hypervector.
- This hypervector is compared to each class hypervector, and the class with the highest similarity is selected as the prediction.
- The prediction is then streamed back to the host.

In addition to HDC applications generally being very amenable to implementation with hardware, inherent features of oneAPI and FPGAs contribute to making HDC classification algorithms a natural fit for Altera FPGAs.

oneAPI supports the SYCL language Universal Shared Memory (USM) feature. Unlike C or C++ solutions, SYCL USM uniquely allows the host and accelerator to share the same memory in the code-based and final hardware. This enables an intuitive industry-standard coding practice of using a pointer to explicitly access data, whether on the host or the accelerator and reduces the system latency to improve overall performance.
The encoding stage is the bottleneck. Since each dimension of a hypervector can be encoded independently, the parallel nature of a programable FPGA allows multiple compute units to be used in parallel. This parallel execution significantly reduces the inference time compared to using a single compute unit.

Parallel processing was used in the encoding stage during training and inference, and the USM feature, unique to Altera FPGA, ensured a low latency solution.

Altera FPGAs Excel at AI Inference

In a real-world evaluation, this HDC image classification algorithm was implemented on CPU, GPU, and FPGA resources³.

Intel Xeon® Platinum 8256 (Cascade Lake) CPU (3.8GHz, 4 Cores).
Intel UHD 630 11th generation GPU.
Intel Stratix® 10 GX FPGA.

All implementations used an HDC classification model with 2000 hyperdimensions of 32-bit floating-point values. Using the NeuralHD retraining algorithm, all three implementations achieved similar accuracy of approximately 94-95%.

The results show the benefits of using an FPGA in the inference stage.

Figure 5 and 6

diagram3 (2).png

During training, the importance metric is the total time to train the model.
During inference, the metric of concern is latency.

The Intel GPU achieved a 60x speed-up over the CPU during the training process, while the FPGA provided an x18 speed-up. However, the CPU and FPGA implementations achieved an accuracy of approximately 97%, while the GPU implementation only achieved an accuracy of approximately 94%.

In this training example, the FPGA suffered from being memory-bound, and only 8 CU could be used in parallel during encoding. A larger FPGA would allow more parallel CUs and reduce the FPGA training time. This evaluation was conducted on Intel DevCloud with fixed resources, and a larger FPGA was unavailable.

FPGA AI Latency Advantages

The FPGA shows the best latency during the inference stage with a 3× speedup over the CPU. For inference, the GPU is shown to be overkill, taking much longer than either the FPGA or even CPU (showing a slow-down rather than an acceleration).

These results highlight, once again⁴ ⁵ ⁶, the benefits of using an FPGA for AI inference. Given that any AI algorithm is trained/developed once but implemented for inference on a mass scale, the performance and typically lower cost benefits, coupled with the license-free oneAPI development tools, make Altera FPGAs the ideal choice for fast and efficient AI inference in data center and edge applications.