Tools
Explore new features and tools within Intel® products, communities, and platforms
90 Discussions

Multiarchitecture Hardware Acceleration of Hyperdimensional Computing Using oneAPI

Anshul_Gupta
Employee
1 0 3,306

 

Ian Peitzsch, PhD Student

University of Pittsburgh

NSF Center for Space, High-performance, and Resilient Computing 

 

What is Hyperdimensional Computing

Hyperdimensional computing (HDC) is a novel machine learning paradigm based on the high-dimensional nature of the cerebellum. HDC has seen use for tasks such as activity recognition, language recognition, and robot navigation. HDC is well suited for many applications as HDC models are able to fit on embedded and edge devices, highly parallel, well suited for hardware-level optimization, human interpretable, and robust to noise. At the National Science Foundation Center for Space, High-performance, and Resilient Computing (SHREC), Ian Peitzsch is investigating using high-level tools for accelerating HDC.

 

Fig1.png

 

The Challenge

HDC models are naturally parallel and easily translate into a pipelined execution flow, making them an ideal target for hardware acceleration using both GPUs and FPGAs. Current work in hardware acceleration for these platforms makes use of different toolchains for each platform, increasing the complexity of the build system. Furthermore, specifically for FPGA hardware acceleration for HDC, most previous work has been implemented in a hardware descriptor language (HDL) like Verilog* or RTL*. While using HDL can give greater speedups, it has a slow and costly development process, reducing the possible extent of design-space exploration for complex HDC applications.

 

The Solution

Cross-architecture Intel® oneAPI Software Developer Tools (oneAPI), which enable single-language and platform applications to be ported to (and optimized for) multiple single and heterogeneous architecture-based platforms. Free.

Which is precisely what Ian did. He used oneAPI  to develop accelerators for HDC targeting both FPGA and GPU platforms and benchmarked them for inference latency and inference throughput.

 

GPU Design

For the GPU designs, Ian utilized the Intel® oneAPI Math Kernel Library (oneMKL) to perform the necessary matrix-vector multiplication and element-wise cosine for encoding the feature vector into hyperdimensional space. This encoding implementation is easily batched, as batching it transforms it from a matrix-vector multiplication to a true matrix-matrix multiplication. The encoded hypervector is then sent to the classification kernel which calculates similarities between the hypervector and each class in separate parallel work items. Finally, the maximum similarity is calculated using a reduction, and the class value associated with the max similarity is sent back to the host.

Picture2.png

The encoding stage for the single-pass training with the GPU is exactly the same as the encoding stage for inference. After the encoding stage, the encoded hypervectors are sent to the fitting stage. The fitting stage creates a separate work item for each class. Each work item goes through the entire training set of hypervectors and bundles hypervectors with labels matching their designated class value into that class hypervector.

Picture3.png

Below is the implementation of the encoding stage on GPU. For the sake of readability, a wrapper function around oneMKL’s GEMM is used. Written more densely, this encoding implementation using API function calls only requires a handful of lines of code, which is much less than what would be required if implemented from scratch.

Picture4.png

 

Additionally, below is the GPU implementation of the classification stage of the inference dataflow. The classification stage is implemented in a way that gives each query hypervector its own work item. This implementation strategy allows for good throughput scaling with larger input batch sizes.

 

Picture5.png

 

FPGA Design

For the first FPGA design, Ian made use of Intel® oneAPI DPC++/C++ Compiler to develop an accelerated inference pipeline. It begins with streaming in feature vectors from the host using unified shared memory (USM) with explicit data movement. The input vectors are then scattered to 25 compute units for encoding from the feature space into hyperdimensional space. As each dimension of a hypervector can be encoded independently, each compute unit can run in parallel. This parallel execution significantly reduces the inference time compared to using a single compute unit, as the encoding stage is the bottleneck of the data pipeline. From the encoders, the encoded hypervectors are then piped to a single classification kernel. This classification kernel pieces the parts from each encoding compute unit together to form a single hypervector. Then, this hypervector is compared to each class hypervector and the class with the highest similarity is selected as the prediction. The prediction is then streamed out using USM with explicit data movement. All data transfers between kernels make use of FIFO pipes, which greatly reduce the number of reads/writes from/to global memory.

Picture6.png

For the second FPGA design, Ian made use of oneAPI DPC++/C++ Compiler to develop an accelerated single-pass training pipeline. Again, training feature data is first streamed onto the FPGA from the host using USM with explicit data movement. This data is then scattered to eight compute units for the encoding stage. The number of compute units is reduced because of memory constraints. The output encoded partial hypervectors are sent to the fitting kernel. This kernel combines the partial hypervectors to form a single hypervector and reads in the corresponding label. Then the kernel bundles the hypervector into the class corresponding to that label. After all the training data has gone through the encoding and fitting stages, the class hypervectors are streamed from the FPGA to the host using USM with explicit data movement.

Picture7.png

Below is the implementation of the encoding modules used by both the inference and training implementations. First, each module copies in basis vectors from USM to local Block RAM (BRAM). Then, each module receives the feature vector from its corresponding pipe in the pipe array and puts the feature vector into local BRAM. Next, the module calculates its portion of the corresponding hypervector. Finally, the portion is sent to either the classification kernel in the inference implementation, or to the bundling kernel for training.

Picture8.png

 

Results

Both inference designs were benchmarked for throughput and inference latency using the MNIST handwritten numbers dataset. Their performances were against a serial CPU implementation. The GPU design exhibited the greatest throughput. This high-throughput is due to the data-parallel nature of the GPU design, allowing for many inferences to be conducted in parallel. The FPGA design exhibited the lowest inference latency. This is due to the optimized dataflow of the design. Additionally, the GPU exhibited slowdown for inference latency, but this is due to the implementation being better optimized for throughput rather than inference latency.

 

Picture9-MOD.png

 Picture91-MOD.png

 

Both single-pass training designs were also benchmarked on the MNIST handwritten digits dataset and were compared to a serial CPU baseline. All implementations achieved similar accuracy of approximately 85%. Of the three architectures benchmarked, the GPU achieved the fastest training time with speedup of almost 60x over the CPU baseline. The FPGA design only achieved a speedup of 17.9x over the CPU baseline.

 

Picture92-MOD.png

 

 

Platform Configuration Used for Tests

Software

Hardware

  • Intel® Xeon® Platinum 8256 Processor with 192 GB of memory with 4 3.8GHz cores.
  • Intel® UHD Graphics 630 with 32 GB of memory and up to 512 work-items, which were used to parallelize the HDC operations.
  • Intel® Stratix® 10 FPGA (PAC D5005) with 16 GB of RAM.

 

Let’s Get Started

Picture93.png

 

Conclusion

Future plans for this project are to better fine-tune the parameters to get the best performances possible for all implementations, explore model quantization for the FPGA design, and to benchmark the work on newer FPGA hardware, such as the Intel® Agilex® FPGA product family, and on discrete GPUs, such as the Intel® Data Center GPU Max Series. This work is part of a larger goal of expanding the application space of HDC for more complex datasets, and oneAPI helps achieve this goal by allowing for quick and easy development of multiple hardware platforms, thus shortening the development time while also allowing for the targeting of more hardware platforms for benchmarking.

 

Notes and Disclaimers

10% to up to 25% rendering efficiency/thousands of hours saved in rendering production time/15\ h per frame per shot to 12-13h.

Cinesite Configuration: 18-core Intel® Xeon® Scalable processors (W-2295) used in render farm, 2nd gen Intel® Xeon® processor-based workstations (W-2135 and -2195) used. Rendering tools: Gaffer, Arnold, along with optimizations by Intel® Open Image Denoise.

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex​.  

Your costs and results may vary. 

Intel technologies may require enabled hardware, software, or service activation.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. 

*Other names and brands may be claimed as the property of others.  ​