Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16557 Discussions

OpenCL SDK unable to claculate II

aejjeh
Beginner
1,944 Views

I am looking at the generated report for my matrix multiply OpenCL kernel and the report is telling me that it is unable to estimate the II and the bottleneck is showing as 'n/a'. Any ideas what might be the reason for this? I am using version 18.0 of the tools, and I am attaching a screenshot of the loop analysis report (which shows my code)screenshot.png

 

Update: Kernel Code below (wasn't able to attach):

__kernel void mysgemmNT_c_fpga_c_c_c(__global float* restrict A, ulong bytes_A, __global float* restrict B, ulong bytes_B, __global float* restrict C, ulong bytes_C, uint mt, uint nt, uint kt) { // unsigned tid = get_global_id(0); // printf("Thread: %d\n", tid); const int size = 1024; __local float rowA[size]; __local float rowC[size]; __local float localB[size*size]; const int n = size; const int m = size; const int k = size; // for (int x = 0; x < m; ++x) // for (int y = 0; y < k; ++y) // localA[y+x*k] = A[y+x*k]; for (int z = 0; z < n; ++z) for (int y = 0; y < k; ++y) localB[y+z*k] = B[y+z*k];   for (int x = 0; x < m; ++x) { for(int y = 0; y < k; ++y) rowA[y] = A[y+x*k];   for (int z = 0; z < n; ++z) { float c = 0.0; for (int y = 0; y < k; ++y) { int indexA = y+x*k; int indexB = z*k+y; // printf("%d: A[%d][%d]=%f,\tB[%d][%d]=%f\n", y,x,y,A[indexA],z,y,B[indexB]); c += rowA[y]*localB[indexB]; } rowC[z] = c; // printf("C[%d][%d] = %f, %f\n", x, z, C[indexC], c); }   for (int z = 0; z < n; ++z) { int indexC = z+x*n; C[indexC] = rowC[z]; } } }

 

0 Kudos
3 Replies
HRZ
Valued Contributor III
447 Views

Can you attach the kernel file? I'd like to generate the report using an earlier version of the compiler and check the report.

0 Kudos
aejjeh
Beginner
447 Views

@HRZ​ I added my code to my original post. (For some reason it didn't let me attach a cl file, so I just pasted it in the post itself).

0 Kudos
HRZ
Valued Contributor III
447 Views

I checked your code with 16.1.2; all loops are reported to have an II of one while the loop on "x" is limited to 16 "threads" (i.e. every access to the local buffers in the loop will be stalled at least once every 16 iterations) to limit on-chip memory replication. This information, which is VERY important, seems to have been removed in the newer versions of report. I think what the new report is trying to say something along the lines of "II is an approximation due to stallable accesses/inner loops" or something like that.

0 Kudos
Reply