OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

vload4 vs 4 individual memory accesses : bank conflicts

ABoxe
Beginner
678 Views

What is the advantage of vload4 over 4 single memory accesses?

Suppose I am loading memory from local memory. Below are two kernels. The second kernel should exhibit no bank conflict.

Does the first have bank conflicts? Because, if one vload is executed per clock, then there should be conflicts in a half wave.

 

void kernel1() {

     int start = get_global_id(0)*4;

     int4 test = vload4(start,localBuffer);

}

 

 

void kernel2() {

     int4 test;

     int start = get_global_id(0)*4;

     test.x = localBuffer[start];

     test.y = localBuffer[start+1];

     test.z = localBuffer[start+2];

     test.w = localBuffer[start+3];

  }

0 Kudos
4 Replies
Jeffrey_M_Intel1
Employee
678 Views

Sorry for the delayed reply.  If you have not seen already, the optimization guide has some good tips in its "Memory Access Considerations" section.  In general, loading vector types is always more efficient than single values.  Using vector types for memory transfers makes it easier for the compiler to create efficient code.  If the compiler recognizes the pattern in kernel2 it might combine loads into a pattern close enough to kernel1 that the performance will be practically identical, but kernel2 isn't expected to have any advantages over kernel1.  

 

0 Kudos
ABoxe
Beginner
678 Views

Thanks, Jeffrey. So, vload may exhibit fewer bank conflicts than single loads? Is there ever a situation

when vload performs worse than individual loads?

 

 

0 Kudos
Jeffrey_M_Intel1
Employee
678 Views

As I've asked around, what I've understood is that vload is always better.  Single loads may be combined by the compiler so in many cases there may not be a big difference, but vload is expected to be the best case.  There may be a few corner cases with odd borders where single loads may have a minor advantage but I suspect these are rare. In the future I'm hoping we will be able to do more memory transfer optimization tutorials including deeper analysis so the guidelines do not need to be so high level. Please watch for more documentation/example improvements as we can prioritize them in the future.

 

 

0 Kudos
ABoxe
Beginner
678 Views

Hi Jeffrey,

Thanks for pursuing this. I am particularly interested in avoiding bank conflicts. Is there some way

of confirming that vload does not trigger more bank conflicts than separate loads? 

Thanks so much,

Aaron

0 Kudos
Reply