Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
15772 Discussions

Unexpected performance results of OpenCL kernel execution

Honored Contributor II



I have a simple dummy kernel that is being used for some benchmarking goal. Below is my OpenCL kernel: 


__attribute__((num_compute_units(1))) __attribute__((num_simd_work_items(16))) __attribute__((reqd_work_group_size(256,1,1))) __kernel void WGS256MAPI16LL1048576(const __global float *GIn, __global float *GOut, const int M, const int N, const int P) { const int XGL = get_global_id(0); const int XGRid = get_group_id(0); const int XGRnum = get_num_groups(0); const int XLSize = get_local_size(0); const int XLid = get_local_id(0); // Just a private variable float temp = 1.0; // Start of a new level of for loop long baseIndex1 = XGRid*XLSize*16+XLid; temp += temp * M; temp += temp * M; temp += temp * M; temp += temp * M; temp += temp * M; temp += temp * M; temp += temp * M; temp += temp * M; ... temp += temp * M; GOut = temp; }  


As it's clear in the code, my kernel is basically does nothing special. I also have removed any DRAM access. The kernel is performing 1024 "fma" operations and save the result into the memory, just to make sure the computation is not being optimized out by the compiler. 

I deploy this kernel as a NDRange, on a Nallatech 385A. It also has been compiled with Altera 16.0 compiler. Now, basically I would like to calculate the performance of my kernel as a number of floating points per second (GFlops). Here is the section of my code, which is responsible to deploy the kernel and calculate the performance: 


Event evKernel (algorithm->getKernelName ()); err = clEnqueueNDRangeKernel (queue, kernel, algorithm->getWorkDim(), NULL, globalWorkSize, localMemSize, 0, NULL, &evKernel.CLEvent()); CL_CHECK_ERROR (err); err = clWaitForEvents (1, &evKernel.CLEvent()); evKernel.FillTimingInfo ();  


As it's clear, it logs the beginning and then end of the execution. Also I'm making sure I'm waiting for the kernel termination being triggered and then log the time. 


Using all above configuration, the FPGA can deliver around 2451 GFlops, which is completely outperforming the theoretical performance as 1.5TFlops. I have checked the generated Verilog file and seen all FMA operations have been generated. Now my question is, am I doing anything wrong that I'm getting this non-sense performance number? 



0 Kudos
1 Reply
Honored Contributor II

Have you checked the final post-place-and-route resource utilization? The computation might have as well been optimized out during synthesis. The OpenCL compiler is indeed generating FMA operations, but it seems it is ignoring the SIMD attribute since the number of DSPs that are used is exactly equal to the number of FMA operations in the code, rather than the number of operations multiplied by SIMD factor. Assuming that nothing is getting optimized out, you have probably made a mistake in calculating the FLOPS value. 


Arria 10 GX 1150 has 1518 DSPs, each capable of performing one single-precision FMA operation, and with the DSPs running at the maximum frequency of 482 MHz, you will get 1.46 TFLOPS. It is certainly not possible to go over this number.