I am exploring the possibility of offloading some particle simulation code to an FPGA, and I have been testing as a member of the oneAPI Beta program.
Using the VectorAdd sample/tutorial, the CPU:GPU:FPGA time comparisons are less than stellar.
My question is: What kind of performance ratios should I expect for floating point?
What precision are you looking at? Vector add on non-HBM FPGAs will become memory-bound long before you can use up the compute resources. If your application is compute-heavy with little data reuse, there is no point in using FPGAs for it since it will always be faster on GPUs simply because they have considerably higher peak compute performance and memory bandwidth compared to their same-generation FPGAs.
For current-generation Arria 10 and Stratix 10 Intel FPGAs, you can do one FP32 MAC per DSP. The biggest Arria 10 has 1518 DSPs. The peak DSP operating frequency on Arria 10 is 480 MHz. Hence, the theoretical peak FP32 performance of Arria 10 will be ~1.45 TFLOP/s. However, in reality, you will never be able to fully utilize all the DSPs and run at 480 MHz. Best case scenario you will run at ~350 MHz. Moreover, it is near-impossible for an application to fully map to MAC operations and it is also impossible to map every application to exactly 1518 DSPs which means you will always have some unused DSPs. The real-world peak FP32 performance of Arria 10 will be around 900 GFLOP/s. Similarly, for the biggest Stratix 10 GX, do not expect a frequency over 450 MHz with hyperflex which would give you a real-world peak FP32 performance of ~4.5 TFLOP/s. Of course these numbers will be achievable only if your application has an abnormally high amount of data reuse that is exploited using on-chip memory to minimize external memory access or else, as I mentioned above, your performance will be bound by the extremely low external memory bandwidth of these FPGAs long before you can fully use their compute potential. Of course there is also Stratix 10 MX with HBM, which gives you a reasonable amount of memory bandwidth, but will have 30% lower peak compute performance compared to Stratix 10 GX due to a lot of FPGA area being taken by the HBM controller, leading to the largest MX FPGA having a lot less DSPs, logic, BRAM, etc, compared to the largest GX FPGA.
Take a look at the roofline model, if you are not already familiar with it, and use the numbers I mentioned above and the theoretical peak memory bandwidth of these FPGAs to determine what kind of performance rations you can expect compared to CPUs and GPUs.
Thank you. Your detailed comments are quite helpful. Given the price:performance curve of the hardware, I don't think the FPGA board is suitable for offloading physics equations (mostly Navier-Stokes calculations in my target case). Another problem is 32-bit float. Most of these calculations need double precision for industrial-sized problems, but I thought it was worth a try with the current hardware.
Indeed FPGAs are not ready (and probably never will be) for 64-bit calculations; however, there is a strong initiative in the the HPC and scientific computing communities towards rethinking precision requirements since 64-bit computation is too expensive on any hardware, and the majority of real-world applications are memory-bound on modern hardware anyway, which means we are just wasting silicon space by putting more 64-bit ALUs in our chips. You can refer to the following publication for quantitative results: