Solved: Hi Dmitry,

Alexander_Karsakov · ‎05-15-2014

Hello.

I have tried to make some OpenCL-related performance optimization for Intel devices. I want to use vectorization and vector data type with optimal lenght for specified device. I called clGetDeviceInfo(.., CL_DEVICE_PREFERRED_VECTOR_WIDTH, ..) method, but it returns not really optimal values:

uchar - 1
short - 1
int - 1
float - 1

I checked it on GPU Intel HD4600 and CPU Intel Core i5-4570.

I have tried to find the optimal value of the vector length for my problem and got following values:

uchar - 16
short - 8
int - 1
float - 1

If I use uchar16 instead uchar I get x3 acceleration.

I have two question:

1. Why is clGetDeviceInfo(.., CL_DEVICE_PREFERRED_VECTOR_WIDTH, ..) return these values?

2. Is it possible to change these values in future releases? This will make possible to do cross-platform optimization.

Thanks,

Alexander.

Dmitry_K_Intel · ‎05-17-2014

Hi Alexander,

You are right - Intel OpenCL devices prefer scalar values as they assume that internal autovectorization will produce better results in most cases. And you are right once more - there are cases where internal autovectorization fails and manual tuning produce better results.

Please check https://software.intel.com/sites/products/documentation/ioclsdk/2013/Intel_SDK_for_OpenCL_Applications_2013_Optimization_Guide.pdf for more info

View solution in original post

Dmitry_K_Intel · ‎05-17-2014

Hi Alexander,

You are right - Intel OpenCL devices prefer scalar values as they assume that internal autovectorization will produce better results in most cases. And you are right once more - there are cases where internal autovectorization fails and manual tuning produce better results.

Please check https://software.intel.com/sites/products/documentation/ioclsdk/2013/Intel_SDK_for_OpenCL_Applications_2013_Optimization_Guide.pdf for more info

Alexander_Karsakov · ‎05-19-2014

Hi Dmitry,

Thanks for clarification!

Maxim_S_Intel · ‎05-20-2014

Intel OpenCL devices prefer scalar values as they assume that internal autovectorization

The caveat: compiler does best job when vectorizing for 32 bits types (like int and float). In contrast for char/uchar using the short vectors like uchar4 explicitly might be more performant as it better coalesces the memory accesses (since with uchar4/uchar8/etc you operate on aligned data chunks) and also better amortizes the work-item scheduling costs (since you process multiple pixels simultaneously).

CL_DEVICE_PREFERRED_VECTOR_WIDTH for Intel devices