Solved: Unexpected low Kernel Clock Frequency

JButt5 · ‎09-11-2019

I'm working on an OpenCL kernel targeting a Cyclone V SoC that should process a continuous real-time sample stream at a sample rate of 16 MHz, which requires a certain kernel clock frequency so that the kernel can keep up with the data stream. Coming from traditional VHDL design flows, I'm quite certain that a clock frequency of approx. 40 MHz should not be an issue for the Cyclone V.

However, the kernel is extremely slow. The Dynamic Profiler shows that the kernel clock runs at 1.3MHz. How can I investigate what slows down the Kernel clock to such a low frequency, what are best practices to increase the kernel clock frequency?

See the attached screenshot for details

Profiling Results:

The Qsys System:

The kernel code:

#pragma OPENCL EXTENSION cl_intel_channels: enable
 
struct TwoChannelSample
{
    short2 chanA;
    short2 chanB;
};
 
#define FIFO_DEPTH 32768
 
channel struct TwoChannelSample rxSamps __attribute__((depth(0))) __attribute__((io("THDB_ADA_rxSamples")));
channel struct TwoChannelSample txSamps __attribute__((depth(0))) __attribute__((io("THDB_ADA_txSamples")));
channel ushort                  stateChan    __attribute__((depth(0))) __attribute__((io("THDB_ADA_state")));
 
kernel void thdbADARxTxCallback (global const       float2* restrict txSamples,
                                 global             float2* restrict rxSamples,
                                 global             ushort* restrict interfaceState)
{
    // get state from interface
    *interfaceState = read_channel_intel (stateChan);
 
    // Process sample-wise
    for (int i = 0; i < FIFO_DEPTH; ++i)
    {
        struct TwoChannelSample rxSample = read_channel_intel (rxSamps);
 
        rxSamples[i].x = (float)rxSample.chanA.x;
        rxSamples[i].y = (float)rxSample.chanA.y;
        rxSamples[i + FIFO_DEPTH].x = (float)rxSample.chanB.x;
        rxSamples[i + FIFO_DEPTH].y = (float)rxSample.chanB.y;
    }
}

HRZ · ‎09-14-2019

You seem to be using a custom-made BSP with multiple custom I/O channels; your critical path very likely lies in your BSP. You can try compiling an empty OpenCL kernel to see what operating frequency you will get. If what you get is still in the same range, then your critical path is in the BSP and you should optimize your BSP.

View solution in original post

MEIYAN_L_Intel · ‎09-13-2019

Hi,

You can review Fmax information as in chapter 2.3.2 as link below:

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf

The Fmax II report provides key performance metrics on all blocks including scheduled

fmax, sustainable II, block latency, and maximum interleaving iterations.

You can go though the best practice document as the link mentioned above and usually loop or memory usage having more impact on Fmax.

Thanks.

HRZ · ‎09-14-2019

You seem to be using a custom-made BSP with multiple custom I/O channels; your critical path very likely lies in your BSP. You can try compiling an empty OpenCL kernel to see what operating frequency you will get. If what you get is still in the same range, then your critical path is in the BSP and you should optimize your BSP.

JButt5 · ‎09-14-2019

@HRZ was right, the problem was into my BSP. Running the Quartus Timing Analysis for my BSP revealed that an unconstrained path that intentionally crossed clock domains lead to bad results, adjusting the sdc file for the project fixed this and brought the clock back up to 150 MHz again which is absolutely fine for my use case.

Seems like the aocl compiler tool runs the same timing analysis as quartus does and adjusts the Kernel clock based on those results, is that right, @MeiYanL_Intel? I did not find any information on that in the CL SDK documentation, nor did I find the results of the timing analysis that's obviously running in the background in the compiler report. Did I overlook something here? Last but not least, I also did not find the report you mentioned, as I'm on a Cyclone V, I use the latest version of the Standard SDK which is 18.1, however you linked me to the 19.x Pro version docs. Is it possible that these reports were added with the 19.x releases and are not contained in the 18.x releases of the SDK?

HRZ · ‎09-14-2019

This is not documented anywhere but apparently, the OpenCL compiler first uses a very high frequency to place and route the design and then, based on the timing report, adjusts the kernel PLL and re-routes the design with the maximum-achievable value determined by the timing report. If the re-route fails timing, then the compiler will incrementally reduce the frequency and redo the routing until timing is met or maximum number of retrials has been reached.

If you look in the folder that is created by the OpenCL compiler when compiling a kernel, you will find a set of *.rpt files which are the text reports for synthesis, fitting, routing, etc. The timing report is in *.sta.rpt. In the same folder, there is another folder called "report" in which you can find the pre-synthesis HTML report generated by the OpenCL compiler which includes the information @MeiYanL_Intel mentioned above; however, the report tends to change quite a bit with every new version of the compiler.

MEIYAN_L_Intel · ‎09-17-2019

Hi,

Thanks @Hamid Reza Zohouri.

After I compare both edition in Quartus, I found that the fmax report can be view directly in 19.x GUI while you still can view fmax in the loop analysis report as in figure 53 in document as link: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/ug-aoclstd-best-practices-guide.pdf

Thanks