Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
697 Discussions

Unexpected low Kernel Clock Frequency

JButt5
Novice
1,630 Views

I'm working on an OpenCL kernel targeting a Cyclone V SoC that should process a continuous real-time sample stream at a sample rate of 16 MHz, which requires a certain kernel clock frequency so that the kernel can keep up with the data stream. Coming from traditional VHDL design flows, I'm quite certain that a clock frequency of approx. 40 MHz should not be an issue for the Cyclone V.

 

However, the kernel is extremely slow. The Dynamic Profiler shows that the kernel clock runs at 1.3MHz. How can I investigate what slows down the Kernel clock to such a low frequency, what are best practices to increase the kernel clock frequency?

 

See the attached screenshot for details

 

Profiling Results:

profilerKernelFreq.png

 

 

The Qsys System:

qsys.png

 

 

The kernel code:

#pragma OPENCL EXTENSION cl_intel_channels: enable   struct TwoChannelSample { short2 chanA; short2 chanB; };   #define FIFO_DEPTH 32768   channel struct TwoChannelSample rxSamps __attribute__((depth(0))) __attribute__((io("THDB_ADA_rxSamples"))); channel struct TwoChannelSample txSamps __attribute__((depth(0))) __attribute__((io("THDB_ADA_txSamples"))); channel ushort stateChan __attribute__((depth(0))) __attribute__((io("THDB_ADA_state")));   kernel void thdbADARxTxCallback (global const float2* restrict txSamples, global float2* restrict rxSamples, global ushort* restrict interfaceState) { // get state from interface *interfaceState = read_channel_intel (stateChan);   // Process sample-wise for (int i = 0; i < FIFO_DEPTH; ++i) { struct TwoChannelSample rxSample = read_channel_intel (rxSamps);   rxSamples[i].x = (float)rxSample.chanA.x; rxSamples[i].y = (float)rxSample.chanA.y; rxSamples[i + FIFO_DEPTH].x = (float)rxSample.chanB.x; rxSamples[i + FIFO_DEPTH].y = (float)rxSample.chanB.y; } }

 

0 Kudos
1 Solution
HRZ
Valued Contributor III
1,249 Views

You seem to be using a custom-made BSP with multiple custom I/O channels; your critical path very likely lies in your BSP. You can try compiling an empty OpenCL kernel to see what operating frequency you will get. If what you get is still in the same range, then your critical path is in the BSP and you should optimize your BSP.

View solution in original post

5 Replies
MEIYAN_L_Intel
Employee
1,249 Views

Hi,

 

You can review Fmax information as in chapter 2.3.2 as link below: 

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf

 

The Fmax II report provides key performance metrics on all blocks including scheduled

fmax, sustainable II, block latency, and maximum interleaving iterations.

 

You can go though the best practice document as the link mentioned above and usually loop or memory usage having more impact on Fmax.

 

Thanks.

0 Kudos
HRZ
Valued Contributor III
1,250 Views

You seem to be using a custom-made BSP with multiple custom I/O channels; your critical path very likely lies in your BSP. You can try compiling an empty OpenCL kernel to see what operating frequency you will get. If what you get is still in the same range, then your critical path is in the BSP and you should optimize your BSP.

JButt5
Novice
1,249 Views

@HRZ​ was right, the problem was into my BSP. Running the Quartus Timing Analysis for my BSP revealed that an unconstrained path that intentionally crossed clock domains lead to bad results, adjusting the sdc file for the project fixed this and brought the clock back up to 150 MHz again which is absolutely fine for my use case.

Seems like the aocl compiler tool runs the same timing analysis as quartus does and adjusts the Kernel clock based on those results, is that right, @MeiYanL_Intel​? I did not find any information on that in the CL SDK documentation, nor did I find the results of the timing analysis that's obviously running in the background in the compiler report. Did I overlook something here? Last but not least, I also did not find the report you mentioned, as I'm on a Cyclone V, I use the latest version of the Standard SDK which is 18.1, however you linked me to the 19.x Pro version docs. Is it possible that these reports were added with the 19.x releases and are not contained in the 18.x releases of the SDK?

0 Kudos
HRZ
Valued Contributor III
1,249 Views

This is not documented anywhere but apparently, the OpenCL compiler first uses a very high frequency to place and route the design and then, based on the timing report, adjusts the kernel PLL and re-routes the design with the maximum-achievable value determined by the timing report. If the re-route fails timing, then the compiler will incrementally reduce the frequency and redo the routing until timing is met or maximum number of retrials has been reached.

 

If you look in the folder that is created by the OpenCL compiler when compiling a kernel, you will find a set of *.rpt files which are the text reports for synthesis, fitting, routing, etc. The timing report is in *.sta.rpt. In the same folder, there is another folder called "report" in which you can find the pre-synthesis HTML report generated by the OpenCL compiler which includes the information @MeiYanL_Intel​ mentioned above; however, the report tends to change quite a bit with every new version of the compiler.

0 Kudos
MEIYAN_L_Intel
Employee
1,249 Views

Hi,

Thanks @Hamid Reza Zohouri.

After I compare both edition in Quartus, I found that the fmax report can be view directly in 19.x GUI while you still can view fmax in the loop analysis report as in figure 53 in document as link: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/ug-aoclstd-best-practices-guide.pdf

 

Thanks

0 Kudos
Reply