OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

vectorization: no speedup?!?

eimunic
Beginner
875 Views
Hello,

I am a little confused about the vectorization capabilities of the Intel OCL SDK.

I tried to grasp the concept of vectorization using this kernel:

[bash]__kernel void dense(__global float *A, __global float *b, __global float *x, __global float *h1)
{
    size_t idx = get_global_id(0);

    int j;
    float tmp = 0.0f;

    for (j=0; j;
    }
    h1[idx] = pow(fabs(tmp - b[idx]),PNORM);

}[/bash]



Where A is a K by K matrix and x and b are K-dimensional vectors.

To determine if this kernel has been auto-vectorized, I used the Intel OpenCL Offline compiler. Result: successfully vectorized!

So far so good... BUT then I tried to disable auto-vectorization.
The documentation "Writing Optimal OpenCL Code with Intel OpenCL SDK" (p.11) states the following:
You must use vec_type_hint to disable vectorization if you kernel already processes data using mostly vector types.

So I used __attribute__((vec_type_hint(float))) to disable the auto-vectorization. Result: successfully vectorized!

What am I am missing?

Out of curiosity I used __attribute__((vec_type_hint(float4))) for the same kernel. This restulted in: not vectorized!
BUT the runtime was the same, thus no performance ramifications. This was kind of unexpected because I thought that
this kernel should be a perfect fit for vectorization and expected a decrease in performance of close to 4.

I finally wrote a vectorized version of the kernel above:

[bash]__kernel void __attribute__((vec_type_hint(float4))) dense(__global float4 *A, __global float *b, __global float4 *x, __global float *h1)
{
    size_t idx = get_global_id(0);

    int j;
    float4 tmp = (float4)0.0f;

    for (j=0; j;
    }
    h1[idx] = pow(fabs(tmp.x + tmp.y + tmp.z + tmp.w - b[idx]),PNORM);

}[/bash]


Result: NO speedup + not vectorized!
Result using vec_type_hint(float): NO speedup + successfully vectorized!

Once again, what am I am missing?

Any advice is much appreciated.

Best

0 Kudos
1 Solution
Maxim_S_Intel
Employee
875 Views
Hi!
Regarding vec_type_hint. Pls read the Guide carefully:

The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.
By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint())) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."

So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).

Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).

You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).
Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).

View solution in original post

0 Kudos
6 Replies
Maxim_S_Intel
Employee
876 Views
Hi!
Regarding vec_type_hint. Pls read the Guide carefully:

The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.
By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint())) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."

So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).

Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).

You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).
Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).
0 Kudos
nurbs
Beginner
875 Views

Sorry if it is not related, how canI check if AVX is supported on my Intel CPU?

0 Kudos
nurbs
Beginner
875 Views
Cool Thanks.I am alsogoing to get anew PC.How could I know whether it will support AVX?

0 Kudos
Yuri_K_Intel
Employee
875 Views
Hi,

2nd generation Intel Core processors (Intel microarchitecture code name Sandy Bridge), released in Q1, 2011, are the first from Intel supporting Intel AVX technology.
You can find a detailed list of processors here for example: http://en.wikipedia.org/wiki/Sandy_Bridge_(microarchitecture)

Thanks,
Yuri
0 Kudos
eimunic
Beginner
875 Views
Hi.

Yes, I might have missunderstood this. Thanks for clearing this up :)

Moreover, I had already paid attention to the global and local sizes,
so I came to the conclusion that my runtime measuring might be a little off...
After a more profound time measuring the vectorized version shows a speedup of close to 2!

This is still less than I have expected but much better than before.

Best
0 Kudos
Reply