oneAPI and SIMD Instructions are a natural fit for database acceleration on Intel FPGAs

DuncanMackay · ‎12-13-2023

Single Instruction Multiple Data (SIMD) is a cutting-edge technique for enhancing the computational performance of single-threaded tasks on modern CPUs.

FPGAs are renowned for delivering high-performance computing by tailoring circuits to specific algorithms. They provide a customized and optimized hardware solution, which can significantly accelerate complex computations.

While SIMD and FPGAs have little in common at first glance, this blog post will show that they are a natural fit. By enabling data parallel processing, FPGAs can harness the benefits of SIMD, further enhancing their processing speed. FPGA adaptability and SIMD efficiency offer a compelling solution for a wide range of computationally intensive tasks.

SIMDified: High-Performance Programming

SIMD reflects a flavor of parallel processing in which a single instruction is applied to multiple data items simultaneously. This can be achieved using special hardware extensions to execute the same instruction on multiple data items in parallel.

Consequently, SIMDified processing is a technique that leverages data independence to improve the performance of software applications by rewriting the application code to utilize the power of SIMD instructions heavily.

Some of the primary benefits of using SIMDified processing are:

Increased performance: SIMDified processing can significantly improve the performance of computationally intensive software applications [FKLU05, KYLT12, LLS+11, MJR+13, UPD+20].
Integrability: SIMDified processing is appealing as it offers direct usability through intrinsics and dedicated data types.
Availability: SIMDified processing is readily available on many modern processors, making it an accessible and reasonable choice for enhancing computational performance.

Despite the benefits, SIMDified processing is not the most optimal solution for all types of applications. For example, applications with low data parallelism will not benefit from SIMDified processing. However, it is a compelling technique that can improve the performance of data-intensive software applications.

SIMD Portability Embraces Heterogeneous Design

SIMD instruction sets consist of SIMD registers and the instructions that work on them. Low-level programming using SIMD intrinsics in C/C++ is the state-of-the-art for achieving the best performance.

However, in heterogeneous environments, incorporating different hardware platforms, operating systems, architectures, and technologies, low-level programming entails severe challenges due to inconsistent hardware capabilities, degrees of data parallelism, and naming conventions.

As specialized implementations lead to a lack of portability across different platforms, SIMD abstraction libraries have been developed to provide a unified SIMD interface and abstract SIMD functionalities. These libraries rely on C++ template metaprogramming and function template specializations to map to different SIMD intrinsics and potential compensations to tackle missing functionalities, which must be implemented within the libraries.

Libraries written in C/C++ enable developers to write SIMD-hardware-oblivious application code and create code for specific SIMD extensions with little overhead. The separation into SIMD-hardware-oblivious code and a SIMD abstraction library reduces complexity and makes it easy for both parts of the equation.

The success of this approach can be seen in the number of SIMD libraries and abstraction layers being promoted to solve problems:

Template SIMD Library (TSL¹)
Google Highway², an open-source SIMD library
Xsimd³, which provides C++ Wrapper for SIMD instances.

The advantage of such libraries is that SIMDified code only needs to be programmed once, and it is then specialized for the target SIMD platform through the SIMD abstraction library. SIMD instructions and their abstraction perfectly fit libraries and heterogeneous design environments.

Using FPGAs for Acceleration

FPGAs offer cost and power-efficient means to accelerate software applications. However, FPGAs have historically required a deep understanding of digital design principles and extensive knowledge of specialized languages like VHDL or Verilog. As a result of the complexity of programming and the lack of code portability, FPGA-based solutions have made them difficult to access in general and have thus remained more specialized compared to CPU or GPU-based computing platforms. With the introduction of Intel oneAPI, this is changing.

Intel® oneAPI is a software development kit that provides a unified programming model for CPUs, GPUs, and FPGAs. It includes programming tools and libraries that support various languages, such as C++, Fortran, Python, and Data Parallel C++ (DPC++), designed for heterogeneous computing to optimize performance, increase productivity, and reduce development time.

Intel oneAPI Overview

Intel oneAPI includes the ability to target FPGAs from SYCL/C++, and Intel FPGAs are becoming an increasingly attractive option for software developers for efficient data processing. To be used with FPGAs, SIMDified applications just require FPGAs to be included as another backend in the SIMD abstraction library. This ensures that FPGAs can be used in SIMD applications.

FPGAs and SIMD: A natural fit.

The Intel DPC++ compiler can synthesize arbitrary C++ code into circuits and auto-vectorize data-parallel processing with the help of annotations. Taking advantage of the flexibility of an FPGA, arrays in the code may be annotated and implemented as simple registers, removing data access bottlenecks and allowing parallel processing from sink to source. Together, this makes it possible to realize performance acceleration using FPGAs for SIMD in a simple and programmable way.

SIMD abstraction libraries are an obvious choice to incorporate SIMD processing capabilities for FPGAs. As mentioned, those libraries typically support typical SIMD instruction set extensions from Intel and ARM processors. The following example shows how SIMD instructions are easily implemented on an FPGA using the TSL abstraction library. In the general example of element-wise addition shown below, the scalar code describes the functionality of loading registers, and the pragma unroll attribute informs the DPC++ Compiler to implement all paths in parallel.

In this simple element-wise example, there are no dependencies between the instructions, and similar implementations will work for other SIMD instructions, such as scatter, gather, and store. Complex instructions can also be optimally accelerated.

A horizontal reduction requires an adder tree of depth ld(N) where N is the number of elements and is only provided at compile time. The following code example shows how this can be implemented in a scalable manner using an unroll pragma with a compile-time constant to implement the adder trees. In the following examples, the loops are unrolled in hardware optimally by ld(RegisterSize/sizeof(DataType)) = ld(SIMDElementCount).

By adding the examples above to a library of similar SIMD elements, SIMD instruction can be accelerated on Intel FPGAs by any software application that calls the library.

Additional system benefits are provided by the Intel FPGA Board Support Package (BSP). Intel® FPGAs employ a BSP to describe hardware interfaces to the FPGA and provide a shell to implement the kernel.

The BSP enables the SYCL Universal Shared Memory (USM) feature, which allows direct data sharing between the CPU and accelerator, freeing the CPU from data transfer management. Thus, the FPGA can be used as a coprocessor.

The pre-compiled BSP reduces overall runtime by ensuring only the kernel logic is generated live.

By providing support for C++/SYCL, support for CPU data transfer offloading, and pre-compiled BSPs, Intel FPGAs are a natural fit for SIMD applications and streaming applications such as modern composable databases.

Simplicity with SIMD and FPGAs

Using FPGAs for accelerating SIMD instructions has been investigated by Dirk Habich, Alexander Krause, Johannes Pietrzyk, and Wolfgang Lehner at TU Dresden and documented in their paper “Simplicity done right for SIMDified query processing on CPU and FPGA” presented at SiMoD@SIGMOD 2023 in Seattle, USA. With support from Christian Färber from Intel, the work shows how practical and efficient implementing a SIMDified kernel in an FPGA by achieving peak performance at the same time is.

In their paper, they evaluated FPGA acceleration of SIMD instructions using a dual-socket 3rd-generation Intel® Xeon® Scalable processor (code-named "Ice Lake), each having 36 cores with a base frequency of 2.2 GHz and a BitWare® IA-840f acceleration card, which is equipped with an Intel® Agilex® 7 AGF027 FPGA and 4x 16 GB DDR4 memories.

First, they examined the impact of the register width on the maximum acceleration bandwidth by progressively increasing the register width of the SIMD instance. In the first case, a simple aggregation, their result showed that the bandwidth of the FPGA accelerator increases with each doubling of the data width until the global bandwidth saturates—an ideal acceleration case.

In the second case, a filter-count kernel, which has a data dependency in the last stage of the adder tree, similar behavior was seen, but this case saturates earlier, at the width of the PCIe link. Both cases highlight the expected benefits of natively parallel instructions executing on a highly parallel architecture - impressive acceleration gains - and indicate that the benefits achieved could be continued with wide memory accesses.

A final performance comparison was done comparing the FPGA to the CPU. The same multi-threaded AVX512-based filter-count kernel was dispatched onto the CPU and FPGA. As more concurrently working threads were added and the number of CPU cores increased, the per-core CPU bandwidth degraded, as expected. The FPGA maintained peak performance across the entire range of the workload.

Building upon this work, the TU Dresden and Intel team further investigated how the TSL approach can be employed to exploit an FPGA as a custom SIMD processor. This work is published with CIDR 2024[4] in Chaminade, USA.

Conclusion

SIMD instructions offer a means to accelerate software applications. With the advent of popular libraries for SIMD instructions, it is now possible to easily extend the applicability of SIMD acceleration to FGPAs. The parallel architecture of an FPGA and its ability to re-configure data paths into direct sink-to-source connections make them an ideal fit for SIMD acceleration. When used through hardware abstraction libraries like TSL, a single code base can be shipped and executed on various hardware, from different CPUs to previously unattainable hardware like FPGAs, while achieving peak performance.

References

[FKLU05] F. Franchetti, S. Kral, J. Lorenz, and C. W. Ueberhuber, "Efficient Utilization of SIMD Extensions," in Proceedings of the IEEE, vol. 93, no. 2, pp. 409-425, Feb. 2005, doi: 10.1109/JPROC.2004.840491.
[MJR+13] G. Mitra, B. Johnston, A. P. Rendell, E. McCreath and J. Zhou, "Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms," 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Ph.D. Forum, Cambridge, MA, USA, 2013, pp. 1107-1116, doi: 10.1109/IPDPSW.2013.207.
[LLS+11] W. -Y. Lo, D. P. -K. Lun, W. -C. Siu, W. Wang and J. Song, "Improved SIMD Architecture for High-Performance Video Processors," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 12, pp. 1769-1783, Dec. 2011, doi: 10.1109/TCSVT.2011.2130250.
[KYLT12] P. Kristof, H. Yu, Z. Li and X. Tian, "Performance Study of SIMD Programming Models on Intel Multicore Processors," 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, Shanghai, China, 2012, pp. 2423-2432, doi: 10.1109/IPDPSW.2012.299.
[UPD+20] Ungethüm, A., Pietrzyk, J., Damme, P., Krause, A., Habich, D., Lehner, W., & Focht, E. (2020). Hardware-Oblivious SIMD Parallelism for In-Memory Column-Stores. 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. Retrieved from http://cidrdb.org/cidr2020/papers/p28-ungethuem-cidr20.pdf

[1] https://github.com/db-tu-dresden/TSL

[2] https://github.com/google/highway

[3] https://github.com/xtensor-stack/xsimd

[4] https://www.cidrdb.org/cidr2024/papers/p53-pietrzyk.pdf