Convenience of vector data types on current GPUs

Edgardo_Doerner · ‎03-19-2018

I have a question regarding the use of vector data types inside OpenCL kernels. Since I am working in OpenCL I have heard about the advantages of using vector data types to bulk load/store data from/to device memory, to take chance of SSE and/or AVX instructions available on CPUs. However, looking to the CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT/LONG/FLOAT property of several GPUs (including Intel HD, AMD and NVIDIA graphics processors) all of them present a value of 1. So it seems that in current GPU architectures they do not take any advantage of the use of vector data types, as pointed in the following stackoverflow discussion: https://stackoverflow.com/questions/16258930/speedup-when-using-float4-opencl

Therefore, would you recommend (excluding CPUs) to use vector data types to store data on global memory?. I am currently working on a Monte Carlo code for particle transport and I use float4 data types to store particle information (position, energy, etc.), its attributes are "codified" on these data types and therefore I usually have to extract them addressing the vector components, for example:

// store is a float4 data type!!
float position = store.xyz
float energy = store.w

Maybe it would be more advisable (on a performance point of view) to just use plain int or float data types?.

Thanks for your help!

Ben_A_Intel · ‎03-23-2018

Hi Edgardo,

This is a deceptively complex question! Picking values to return for "preferred" queries like CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT is more of an art than a science, unfortunately, particularly when there's more than one "right" answer for our GPUs:

For ALU operations, we "scalarize" most operations and execute them in a "SIMT" manner. So, assuming there are enough work items per EU thread (AKA the "SIMD size" or "subgroup size" is relatively large), and/or there are enough instructions to break up back-to-back dependencies, there's no inherent advantage to using vectors vs. scalars for computation.

For IO operations though, using vectors is usually beneficial, since it increases the odds of reading or writing full GPU cache lines, and the compiler will try to "coalesce" scalar loads and stores into vector loads and stores when possible. Of course, if you load or store a vector in your code, you won't need to rely on the compiler to do the coalescing for you.

From your description it sounds like your code should run well on our GPUs - you are storing float4s even though you may be computing the position component and energy component separately.

If you have any follow-on questions don't hesitate to ask.

Edgardo_Doerner · ‎04-17-2018

thanks for your answer, so it seems that I will stick with scalar types for calculations and vector types for data transfer from/to global memory.