Automatic vectorization in OpenCL on Xeon Phi

Christophe_A_ · ‎08-28-2013

Hi,

I have two questions about the automatic vectorization process in OpenCL on Xeon Phi.

First, I was wondering what was happening when you take a work-group size of 1 on an automatically vectorized kernel. Will it be executed at the same speed that it would have without the vectorization?

And secondly, is it possible to determine if the automatic vectorization has been made by merging multiple work-items in one, or inside a work-item?

I hope my questions are clear enough, but let me know if something is unclear.

Thank you,

--
Christophe

Sumedh_N_Intel · ‎08-28-2013

Hi Christophe,

The Intel SDK for OpenCL Applications does not use the vectorized kernel for the Intel Xeon Phi coprocessors if the dimension 0 workgroup is less than 16. Hence, you will get the performance equivalent to the non-vectorized code.

The OpenCL implicit vectorization module vectorizes the code by merging multiple work-items present in the same work group. Hence, for a workgroup with dimension 0 of size less than 16 does not use vectorized code. As far as I know, the vectorization module does NOT [edited] vectorize the code inside the work-item. Let me check with the experts and get back to you on that one.

Christophe_A_ · ‎08-28-2013

Hi,

Thank you for prompt response!

What is unclear to us is that in our point of view, the implicit vectorization among different work-items should happen during the clBuildProgram() step whereas the workgroup size is only known later during the clEnqueueNDRangeKernel() step.
What exactly happens in case of an implicit vectorization during clBuildProgram() and a later kernel launch with for example only one work-item per work-group ? A second non-vectorized kernel is used instead of the vectorized one? or is there implicit vectorization among work-items from different work-groups?

If the vectorization module also vectorizes the code inside the work-item, is it possible to prevent this vectorization?
Our aim is to use only vectorization among the work-items and not inside each work-item.

Thank you again,

--
Christophe

Sumedh_N_Intel · ‎09-12-2013

Hi,

Sorry for this delayed reply.

The vectorization module creates both vectorized and non-vectorized versions of the kernel. For kernels that do not contain barriers, the vectorized version is executed as much as possible, leaving the last leftover iterations of dimension 0 to the non-vectorized version. For kernels that do contain barriers, either vectorized or non-vectorized version is executed for all work-items.

Also, currently, the Intel OpenCL compiler only vectorizes along the dimension zero work-items loop and does not vectorize the code within a work-item.