Performance benefits of vector load/store functions

Shobana_Lakhsminaras · ‎07-12-2012

What are the performance benefits of using vload4 instead of loading data one by one if the buffers are not aligned on a float4 boundary? Onthe other hand, if the buffers are aligned on a float4 boundary, will there be a performance penalty in using vload4 instead of using *float4Ptr?

Thanks in advance

Raghupathi_M_Intel · ‎07-13-2012

According to the spec the behavior is undefined if the data you are trying to load using vloadn is not correctly aligned (vloadn functions take two arguments - a start address and an offset, so start+offset*n should be aligned).

For the second part of your question,if your buffers are aligned (and for float4 the requirement is that it is aligned appropriately) there should be no difference is performance.

Thanks,
Raghu

Shobana_Lakhsminaras · ‎07-16-2012

As per the spec, the start address of vloadn of float data type must be 4 byte aligned and not required to be 16 bytes aligned. Please correct me, if I am wrong. I would like to know the performance benefit of using vloadn in such a scenario when the buffer address is aligned on a float boundary and not float4 boundary.
Thanks.

Raghupathi_M_Intel · ‎07-16-2012

Sorry I misread your original post.

Yes vloadn requires the data (address+offset*n) to be aligned to sizeof(gentype). If the data is already aligned to 16bytes I don't think there is any performance difference in either approach. If the data is only aligned to float boundary you have to use vload4 since float4 data types require 16byte alignment.

Thanks,
Raghu