- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am a little confused about the vectorization capabilities of the Intel OCL SDK.
I tried to grasp the concept of vectorization using this kernel:
[bash]__kernel void dense(__global float *A, __global float *b, __global float *x, __global float *h1) { size_t idx = get_global_id(0); int j; float tmp = 0.0f; for (j=0; j; } h1[idx] = pow(fabs(tmp - b[idx]),PNORM); }[/bash]
Where A is a K by K matrix and x and b are K-dimensional vectors.
To determine if this kernel has been auto-vectorized, I used the Intel OpenCL Offline compiler. Result: successfully vectorized!
So far so good... BUT then I tried to disable auto-vectorization.
The documentation "Writing Optimal OpenCL Code with Intel OpenCL SDK" (p.11) states the following:
You must use vec_type_hint to disable vectorization if you kernel already processes data using mostly vector types.
So I used __attribute__((vec_type_hint(float))) to disable the auto-vectorization. Result: successfully vectorized!
What am I am missing?
Out of curiosity I used __attribute__((vec_type_hint(float4))) for the same kernel. This restulted in: not vectorized!
BUT the runtime was the same, thus no performance ramifications. This was kind of unexpected because I thought that
this kernel should be a perfect fit for vectorization and expected a decrease in performance of close to 4.
I finally wrote a vectorized version of the kernel above:
[bash]__kernel void __attribute__((vec_type_hint(float4))) dense(__global float4 *A, __global float *b, __global float4 *x, __global float *h1) { size_t idx = get_global_id(0); int j; float4 tmp = (float4)0.0f; for (j=0; j; } h1[idx] = pow(fabs(tmp.x + tmp.y + tmp.z + tmp.w - b[idx]),PNORM); }[/bash]
Result: NO speedup + not vectorized!
Result using vec_type_hint(float): NO speedup + successfully vectorized!
Once again, what am I am missing?
Any advice is much appreciated.
Best
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding vec_type_hint. Pls read the Guide carefully:
So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.
By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint())) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."
Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).
You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).
Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding vec_type_hint. Pls read the Guide carefully:
So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.
By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint())) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."
Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).
You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).
Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry if it is not related, how canI check if AVX is supported on my Intel CPU?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
check this article, it has code too:
http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/?wapkw=(how+to+detect+avx)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2nd generation Intel Core processors (Intel microarchitecture code name Sandy Bridge), released in Q1, 2011, are the first from Intel supporting Intel AVX technology.
You can find a detailed list of processors here for example: http://en.wikipedia.org/wiki/Sandy_Bridge_(microarchitecture)
Thanks,
Yuri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I might have missunderstood this. Thanks for clearing this up :)
Moreover, I had already paid attention to the global and local sizes,
so I came to the conclusion that my runtime measuring might be a little off...
After a more profound time measuring the vectorized version shows a speedup of close to 2!
This is still less than I have expected but much better than before.
Best

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page