Profiler showing results that are off from measured results

Altera_Forum · ‎07-06-2015

We are using the DE5-net and are running a kernel that should be memory bound. When we run this and look at the profiler however we are seeing very large swings in performance. we measure this performance in 3 ways 1) using the wall clock time 2) EventProfilingInfo and 3) with the actual profiler from Altera. methods 1 and 2 always match up which is what we would expect. The profiler however shows different numbers that what we would expect. These large swings in performance are something like 1Gflop the first run then the next run would be 7Gflops. Has any one see large swings like this? Also the profiler is telling us that we are getting ~945 MB/s from the memory banks and the optimal is 12800 MB/s while I know that I will never see the 12.8k MB/s I would always expected to see more than the ~945 MB/s we are currently getting. Is there any way to optimize our code to push this number up closer to 10000 (~80% of the optimal) we have verification code on the host side to make sure that this kernel is returning correct results which is never has an issue doing. Can any one point out what would be causing such low memory performance? Also has any one else seen anything like this before, if so how did you get around it?

Here is the profiler info that I have been talking about:

Statistic Measured Optimal

Kernel Clock Frequency 236.8 MHz na

Global BW (MEMORY:bank1) 942.8 MB/s 12800 MB/s

Average Write Burst 15 16

Average Read Burst 3 16

Global BW (MEMORY:bank2) 944.5 MB/s 12800 MB/s

Average Write Burst 15 16

Average Read Burst 4 16

also here is the kernel code:

# define N        1024*1024*128# define VS       1# define WGS      1# define NWG      N/(4*VS*WGS)
__attribute((reqd_work_group_size(256,1,1)))
__attribute((num_simd_work_items(8))) __
kernel void native_opencl (global float const * restrict x, global float * restrict d, global float * restrict coeff)
{    
     float temp,xtemp;    
     int i=get_global_id(0),        
     start = 0,        
     end = N/(4*NWG*WGS*VS);    
     int ii;    
     for (ii=start;ii<end;ii++)
     {
          i=get_global_id(0) * N/(4*NWG*WGS*VS)+ii;
          xtemp = x; 
          temp = coeff;
          temp = mad(temp,xtemp,coeff); 
          temp = mad(temp,xtemp,coeff); 
          temp = mad(temp,xtemp,coeff);  
          temp = mad(temp,xtemp,coeff); 
          temp = mad(temp,xtemp,coeff);
          temp = mad(temp,xtemp,coeff);
          temp = mad(temp,xtemp,coeff);
          d = temp;   
     }
}