Solved: vectorization: no speedup?!?

eimunic · ‎09-28-2011

Hello,

I am a little confused about the vectorization capabilities of the Intel OCL SDK.

I tried to grasp the concept of vectorization using this kernel:

[bash]__kernel void dense(__global float *A, __global float *b, __global float *x, __global float *h1)
{
    size_t idx = get_global_id(0);

    int j;
    float tmp = 0.0f;

    for (j=0; j;
    }
    h1[idx] = pow(fabs(tmp - b[idx]),PNORM);

}[/bash]

Where A is a K by K matrix and x and b are K-dimensional vectors.

To determine if this kernel has been auto-vectorized, I used the Intel OpenCL Offline compiler. Result: successfully vectorized!

So far so good... BUT then I tried to disable auto-vectorization.
The documentation "Writing Optimal OpenCL Code with Intel OpenCL SDK" (p.11) states the following:
You must use vec_type_hint to disable vectorization if you kernel already processes data using mostly vector types.

So I used __attribute__((vec_type_hint(float))) to disable the auto-vectorization. Result: successfully vectorized!

What am I am missing?

Out of curiosity I used __attribute__((vec_type_hint(float4))) for the same kernel. This restulted in: not vectorized!
BUT the runtime was the same, thus no performance ramifications. This was kind of unexpected because I thought that
this kernel should be a perfect fit for vectorization and expected a decrease in performance of close to 4.

I finally wrote a vectorized version of the kernel above:

[bash]__kernel void __attribute__((vec_type_hint(float4))) dense(__global float4 *A, __global float *b, __global float4 *x, __global float *h1)
{
    size_t idx = get_global_id(0);

    int j;
    float4 tmp = (float4)0.0f;

    for (j=0; j;
    }
    h1[idx] = pow(fabs(tmp.x + tmp.y + tmp.z + tmp.w - b[idx]),PNORM);

}[/bash]

Result: NO speedup + not vectorized!
Result using vec_type_hint(float): NO speedup + successfully vectorized!

Once again, what am I am missing?

Any advice is much appreciated.

Best

Maxim_S_Intel · ‎09-28-2011

Hi!
Regarding vec_type_hint. Pls read the Guide carefully:

The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.
By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint())) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."

So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).

Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).

You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).
Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).

View solution in original post

Maxim_S_Intel · ‎09-28-2011

Hi!
Regarding vec_type_hint. Pls read the Guide carefully:

The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.
By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint())) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."

So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).

Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).

You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).
Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).

nurbs · ‎09-28-2011

Sorry if it is not related, how canI check if AVX is supported on my Intel CPU?

Brijender_B_Intel · ‎09-28-2011

check this article, it has code too:

http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/?wapkw=(how+to+detect+avx)

nurbs · ‎09-28-2011

Cool Thanks.I am alsogoing to get anew PC.How could I know whether it will support AVX?

Yuri_K_Intel · ‎09-29-2011

Hi,

2nd generation Intel Core processors (Intel microarchitecture code name Sandy Bridge), released in Q1, 2011, are the first from Intel supporting Intel AVX technology.
You can find a detailed list of processors here for example: http://en.wikipedia.org/wiki/Sandy_Bridge_(microarchitecture)

Thanks,
Yuri

eimunic · ‎09-30-2011

Hi.

Yes, I might have missunderstood this. Thanks for clearing this up :)

Moreover, I had already paid attention to the global and local sizes,
so I came to the conclusion that my runtime measuring might be a little off...
After a more profound time measuring the vectorized version shows a speedup of close to 2!

This is still less than I have expected but much better than before.

Best