I looked up Intel Opencl for FPGA documentation but it doesn't tell what are the critical paths in kernels which are determining the clock frequency. Is there way to find such information?
There is no straightforward way to find or optimize the critical path from the OpenCL kernel. You can recompile the project generated by the OpenCL compiler directly using Quartus and check the timing report, but mapping the HDL signals back to the OpenCL kernel will be next to impossible. Another thing you can do is to use the -fmax switch to increase the target operating frequency until the II of your loop goes up. In this case, you can find the path that is resulting in the increase in II in the report, which is basically the critical path, but this information is not necessarily accurate. Newer versions of the compiler (18+) give more accurate information in this case.
Based on my personal experience, the critical path of the kernel is usually in only a few places:
NDRange: The critical path for NDRange kernels nearly always falls on the 2x clock for Block RAM double-pumping. This effectively limits your operating frequency to 200-260 MHz on Arria 10 depending on area utilization, since the maximum operating frequency of the Block RAMs is 500-550 MHz depending on speed grade. If your kernel doesn’t use double-pumping, you can probably reach 350+ and Fmax will be either limited by placement and routing restrictions or the OpenCL BSP.
Single work-item: If you have loop-carried dependencies, i.e. you are reading some data every loop iteration that was updated in the previous loop iteration, the feedback path will limit your operating frequency to something between 150 and 220 MHz on Arria 10. The operating frequency cannot go higher in this case unless you sacrifice the loop II.
If you there are no loop-carried dependencies in the kernel, the critical path will be the chain of updates and comparisons on the loop variables for the deepest loop nest that has an II of one. Manual loop flattening alongside with, what I call, “loop exit condition optimization” can help achieve over 300 MHz on Arria 10 in such cases even with high area utilization. Still, even in this case, the deeper the original loop nest was, the lower your Fmax will be. Take a look at Sections 126.96.36.199 and 188.8.131.52 in the following document for more info on this optimization:
And associated example code can be found here: