Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
1,518 Views

OpenCL SDK unable to claculate II

I am looking at the generated report for my matrix multiply OpenCL kernel and the report is telling me that it is unable to estimate the II and the bottleneck is showing as 'n/a'. Any ideas what might be the reason for this? I am using version 18.0 of the tools, and I am attaching a screenshot of the loop analysis report (which shows my code)screenshot.png

 

Update: Kernel Code below (wasn't able to attach):

__kernel void mysgemmNT_c_fpga_c_c_c(__global float* restrict A, ulong bytes_A, __global float* restrict B, ulong bytes_B, __global float* restrict C, ulong bytes_C, uint mt, uint nt, uint kt) { // unsigned tid = get_global_id(0); // printf("Thread: %d\n", tid); const int size = 1024; __local float rowA[size]; __local float rowC[size]; __local float localB[size*size]; const int n = size; const int m = size; const int k = size; // for (int x = 0; x < m; ++x) // for (int y = 0; y < k; ++y) // localA[y+x*k] = A[y+x*k]; for (int z = 0; z < n; ++z) for (int y = 0; y < k; ++y) localB[y+z*k] = B[y+z*k];   for (int x = 0; x < m; ++x) { for(int y = 0; y < k; ++y) rowA[y] = A[y+x*k];   for (int z = 0; z < n; ++z) { float c = 0.0; for (int y = 0; y < k; ++y) { int indexA = y+x*k; int indexB = z*k+y; // printf("%d: A[%d][%d]=%f,\tB[%d][%d]=%f\n", y,x,y,A[indexA],z,y,B[indexB]); c += rowA[y]*localB[indexB]; } rowC[z] = c; // printf("C[%d][%d] = %f, %f\n", x, z, C[indexC], c); }   for (int z = 0; z < n; ++z) { int indexC = z+x*n; C[indexC] = rowC[z]; } } }

 

0 Kudos
3 Replies
Highlighted
Valued Contributor II
21 Views

Can you attach the kernel file? I'd like to generate the report using an earlier version of the compiler and check the report.

0 Kudos
Highlighted
Beginner
21 Views

@HRZ​ I added my code to my original post. (For some reason it didn't let me attach a cl file, so I just pasted it in the post itself).

0 Kudos
Highlighted
Valued Contributor II
21 Views

I checked your code with 16.1.2; all loops are reported to have an II of one while the loop on "x" is limited to 16 "threads" (i.e. every access to the local buffers in the loop will be stalled at least once every 16 iterations) to limit on-chip memory replication. This information, which is VERY important, seems to have been removed in the newer versions of report. I think what the new report is trying to say something along the lines of "II is an approximation due to stallable accesses/inner loops" or something like that.

0 Kudos