- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I am a little confused about the vectorization capabilities of the Intel OCL SDK.

I tried to grasp the concept of vectorization using this kernel:

[bash]__kernel void dense(__global float *A, __global float *b, __global float *x, __global float *h1) { size_t idx = get_global_id(0); int j; float tmp = 0.0f; for (j=0; j; } h1[idx] = pow(fabs(tmp - b[idx]),PNORM); }[/bash]

Where A is a K by K matrix and x and b are K-dimensional vectors.

To determine if this kernel has been auto-vectorized, I used the

*Intel OpenCL Offline compiler*. Result:

**successfully vectorized!**

So far so good... BUT then I tried to

**disable**auto-vectorization.

The documentation "Writing Optimal OpenCL Code with Intel OpenCL SDK" (p.11) states the following:

*You must use vec_type_hint to disable vectorization if you kernel already processes data using mostly vector types.*

So I used __attribute__((vec_type_hint(float))) to disable the auto-vectorization. Result:

**successfully vectorized!**

What am I am missing?

Out of curiosity I used __attribute__((vec_type_hint(float4))) for the same kernel. This restulted in: not vectorized!

BUT the runtime was the same, thus no performance ramifications. This was kind of unexpected because I thought that

this kernel should be a perfect fit for vectorization and expected a decrease in performance of close to 4.

I finally wrote a vectorized version of the kernel above:

[bash]__kernel void __attribute__((vec_type_hint(float4))) dense(__global float4 *A, __global float *b, __global float4 *x, __global float *h1) { size_t idx = get_global_id(0); int j; float4 tmp = (float4)0.0f; for (j=0; j; } h1[idx] = pow(fabs(tmp.x + tmp.y + tmp.z + tmp.w - b[idx]),PNORM); }[/bash]

Result:

**NO speedup + not vectorized!**

Result using vec_type_hint(float):

**NO speedup + successfully vectorized!**

Once again, what am I am missing?

Any advice is much appreciated.

Best

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Regarding vec_type_hint. Pls read the Guide carefully:

So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.

By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint())) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."

Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).

You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).

Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Regarding vec_type_hint. Pls read the Guide carefully:

So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.

By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint())) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."

Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).

You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).

Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Sorry if it is not related, how canI check if AVX is supported on my Intel CPU?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

check this article, it has code too:

http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/?wapkw=(how+to+detect+avx)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

2nd generation Intel Core processors (Intel microarchitecture code name Sandy Bridge), released in Q1, 2011, are the first from Intel supporting Intel AVX technology.

You can find a detailed list of processors here for example: http://en.wikipedia.org/wiki/Sandy_Bridge_(microarchitecture)

Thanks,

Yuri

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Yes, I might have missunderstood this. Thanks for clearing this up :)

Moreover, I had already paid attention to the global and local sizes,

so I came to the conclusion that my runtime measuring might be a little off...

After a more profound time measuring the vectorized version shows a speedup of close to 2!

This is still less than I have expected but much better than before.

Best

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page