Solved: problem in streaming access to global memory in OpenCL

hamze60 · ‎10-16-2018

Hello,

I have rather a simple OpenCL code (in below), compiled in ND-range configuration. In some part of code I have random memory access, and as I expected, profiler shows that efficiency of memory access is low (very low hit-rate, one cache-line per data access).

However, there are some fully consecutive memory access which I expect to have very high efficiency, as cache-line should be fully utilized. my measurements showed unexpected results. After investigating with profiler, it shows almost minimum efficiency like random memory access.

Can anybody suggest anything for this?

I appreciate it if @HRZ can take a look.

float toAdd = 0;

unsigned ei;

unsigned si;

float div;

unsigned ovid; //other vertex id

unsigned start_of_chunk = glb_id * chunk_size;

unsigned end_of_chunk = start_of_chunk + chunk_size;

for (unsigned i = start_of_chunk; i < end_of_chunk; i++ )

{

toAdd = 0.0;

ei = end_edge[i]; //? why not streaming?

si = start_edge[i]; //? why not streaming?

div = div_array[i]; //? why not streaming?

for(unsigned j = si; j < ei; j++) // edge loop

{

ovid = ovid_of_edge[j]; //? why not streaming?

toAdd += val[ovid];

}

val_next[i] = toAdd * div;

}

HRZ · ‎10-17-2018

If you cannot use SIMD, utilizing the memory bandwidth efficiently will be difficult since your only remaining tool would be using multiple compute units and in the end, you will always get low external memory bandwidth efficiently since you will have multiple narrow accesses competing with other other for the memory bus.

The information about the private cache does not seem to have been carried over to the new HTML report yet. I used the older report from an older version of the compiler to see which accesses are being cached by the compiler. Based on what I can see, in both of your kernels, only the ovid_of_edge[j] and pg_val[ovid] reads are cached and the rest are not. If you can manually perform caching as you have done in your second kernel without breaking data consistency, then that is indeed a very good method to improve the performance. The compiler is also correctly coalescing the unrolled reads in your second kernel to wide 512-bit accesses; though this configuration could result in memory bandwidth overutilization for boards with one or two memory banks. Maybe an unroll size (cache line size) of 8 would be more appropriate in this case but you should probably compile and test both cases to see which one is faster.

Another thing I can think of is to merge all the buffers that are read only from their i index into a struct so that instead of having multiple narrow reads, you can read all of them at once using one large read from the struct; something like an array of structs. This could also improve the memory efficiency.

Finally, since you are separating compute from memory accesses by performing manual caching, you should make sure to have enough work-groups that can run concurrently in each compute unit to keep the pipeline busy.

P.S. I think you need to add a local memory barrier at the end of your unrolled memory loop.

View solution in original post

HRZ · ‎10-17-2018

Are you talking about the cache efficiency or the external memory bandwidth utilization efficiency? The compiler does not necessarily instantiate the cache for every access. Furthermore, for high memory bandwidth utilization in NDRange kernel, you MUST use SIMD and your accesses must be consecutive based on work-item ID so that they can be coalesced at compile-time. Based on the snippet of your code, your accesses are NOT actually consecutive based on work-item ID. Note that loops are NOT pipelined in NDRange kernels and hence, the accesses inside of your for loops will not actually be consecutive even though they seem so. Scheduling goes like this:

Thread: 0
i : start_of_chunk
 
Thread: 1
i : start_of_chunk
 
Thread: 2
i : start_of_chunk
 
.
.
.
 
Thread: 0
i : start_of_chunk + 1
 
Thread: 1
i : start_of_chunk + 1
 
Thread: 2
i : start_of_chunk + 1
 
.
.
.

If you post all of your code, I can put it into the compiler on my machine and take a look at the report.

hamze60 · ‎10-17-2018

Thanks for following this,

I am working on Xeon+FPGA Harp machine. I've saturated the external memory BW, but performance was lower than what I expect. Then using profiler I saw this sequential access problem.

I already couldn't use Task kernel, due to variable length internal loop. (we have discussed this while a go, if you remember)
same for ND-range with SIMD, because kernel body is work-item dependent.
Then ND-range without SIMD is only option for me.

I assumed a hidden buffer is already implemented for every sequential access port like this, for every work-item's hardware (max_work_group_size == 256). Then even if all work-items access in parallel, they keep one copy of cache for themselves. anyway, what can be done in this case? the solution I have is to implement local buffer for every input argument in kernel (referred as local banking, 1.8.4, in best practice guide), something like (currently I am generating bit-stream for it):

#define BUF_SIZE 16  //why16? because 16*4 == cacheline's size
for (i=start; i<end; i++){
    ..
    .
    buf_cntr =(buf_cntr  + 1) % BUF_SIZE;
    if (buf_cntr == 0){
	    #pragma unroll
	    for (int k = 0; k < BUF_SIZE; k++)
	    {
	    	    ei_local[k] = end_edge[i+k];
	    }			
    } 
   ei =  ei_local[buf_cntr];
}

I attached you original OCL code, and the one after above change.

Thank you

HRZ · ‎10-17-2018

If you cannot use SIMD, utilizing the memory bandwidth efficiently will be difficult since your only remaining tool would be using multiple compute units and in the end, you will always get low external memory bandwidth efficiently since you will have multiple narrow accesses competing with other other for the memory bus.

The information about the private cache does not seem to have been carried over to the new HTML report yet. I used the older report from an older version of the compiler to see which accesses are being cached by the compiler. Based on what I can see, in both of your kernels, only the ovid_of_edge[j] and pg_val[ovid] reads are cached and the rest are not. If you can manually perform caching as you have done in your second kernel without breaking data consistency, then that is indeed a very good method to improve the performance. The compiler is also correctly coalescing the unrolled reads in your second kernel to wide 512-bit accesses; though this configuration could result in memory bandwidth overutilization for boards with one or two memory banks. Maybe an unroll size (cache line size) of 8 would be more appropriate in this case but you should probably compile and test both cases to see which one is faster.

Another thing I can think of is to merge all the buffers that are read only from their i index into a struct so that instead of having multiple narrow reads, you can read all of them at once using one large read from the struct; something like an array of structs. This could also improve the memory efficiency.

Finally, since you are separating compute from memory accesses by performing manual caching, you should make sure to have enough work-groups that can run concurrently in each compute unit to keep the pipeline busy.

P.S. I think you need to add a local memory barrier at the end of your unrolled memory loop.

hamze60 · ‎10-20-2018

Hi,

I tested that local caching, and I could make all sequential read and writes to have memory 100% efficiency!

thank you for you helps