OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

Performance benefits of vector load/store functions

Shobana_Lakhsminaras
512 Views
What are the performance benefits of using vload4 instead of loading data one by one if the buffers are not aligned on a float4 boundary? Onthe other hand, if the buffers are aligned on a float4 boundary, will there be a performance penalty in using vload4 instead of using *float4Ptr?

Thanks in advance
0 Kudos
3 Replies
Raghupathi_M_Intel
512 Views
According to the spec the behavior is undefined if the data you are trying to load using vloadn is not correctly aligned (vloadn functions take two arguments - a start address and an offset, so start+offset*n should be aligned).

For the second part of your question,if your buffers are aligned (and for float4 the requirement is that it is aligned appropriately) there should be no difference is performance.

Thanks,
Raghu
0 Kudos
Shobana_Lakhsminaras
512 Views

As per the spec, the start address of vloadn of float data type must be 4 byte aligned and not required to be 16 bytes aligned. Please correct me, if I am wrong. I would like to know the performance benefit of using vloadn in such a scenario when the buffer address is aligned on a float boundary and not float4 boundary.
Thanks.

0 Kudos
Raghupathi_M_Intel
512 Views

Sorry I misread your original post.

Yes vloadn requires the data (address+offset*n) to be aligned to sizeof(gentype). If the data is already aligned to 16bytes I don't think there is any performance difference in either approach. If the data is only aligned to float boundary you have to use vload4 since float4 data types require 16byte alignment.

Thanks,
Raghu

0 Kudos
Reply