<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic vectorization: no speedup?!? in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821830#M1102</link>
    <description>Hi,&lt;BR /&gt;&lt;BR /&gt;2nd generation Intel  Core processors (Intel microarchitecture code name Sandy Bridge), released in Q1, 2011,
 are the first from Intel supporting Intel AVX technology.&lt;BR /&gt;You can find a detailed list of processors here for example: &lt;A href="http://en.wikipedia.org/wiki/Sandy_Bridge_%28microarchitecture%29"&gt;http://en.wikipedia.org/wiki/Sandy_Bridge_(microarchitecture)&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Yuri</description>
    <pubDate>Thu, 29 Sep 2011 08:58:45 GMT</pubDate>
    <dc:creator>Yuri_K_Intel</dc:creator>
    <dc:date>2011-09-29T08:58:45Z</dc:date>
    <item>
      <title>vectorization: no speedup?!?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821825#M1097</link>
      <description>Hello,&lt;BR /&gt;&lt;BR /&gt;I am a little confused about the vectorization capabilities of the Intel OCL SDK.&lt;BR /&gt;&lt;BR /&gt;I tried to grasp the concept of vectorization using this kernel:&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;[bash]__kernel void dense(__global float *A, __global float *b, __global float *x, __global float *h1)
{
    size_t idx = get_global_id(0);

    int j;
    float tmp = 0.0f;

    for (j=0; j&lt;K&gt;;
    }
    h1[idx] = pow(fabs(tmp - b[idx]),PNORM);

}[/bash]&lt;/K&gt;&lt;/PRE&gt; &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Where A is a K by K matrix and x and b are K-dimensional vectors.&lt;BR /&gt;&lt;BR /&gt;To determine if this kernel has been auto-vectorized, I used the &lt;I&gt;Intel OpenCL Offline compiler&lt;/I&gt;. Result: &lt;B&gt;successfully vectorized!&lt;BR /&gt;&lt;/B&gt;&lt;BR /&gt;So far so good... BUT then I tried to &lt;B&gt;disable&lt;/B&gt; auto-vectorization.&lt;BR /&gt;The documentation "Writing Optimal OpenCL Code with Intel OpenCL SDK" (p.11) states the following:&lt;BR /&gt;&lt;I&gt;You must use vec_type_hint to disable vectorization if you kernel already processes data using mostly vector types.&lt;/I&gt;&lt;BR /&gt;&lt;BR /&gt;So I used __attribute__((vec_type_hint(float))) to disable the auto-vectorization. Result: &lt;B&gt;successfully vectorized!&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;What am I am missing?&lt;BR /&gt;&lt;BR /&gt;Out of curiosity I used __attribute__((vec_type_hint(float4))) for the same kernel. This restulted in: not vectorized!&lt;BR /&gt;BUT the runtime was the same, thus no performance ramifications. This was kind of unexpected because I thought that&lt;BR /&gt;this kernel should be a perfect fit for vectorization and expected a decrease in performance of close to 4.&lt;BR /&gt;&lt;BR /&gt;I finally wrote a vectorized version of the kernel above:&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;[bash]__kernel void __attribute__((vec_type_hint(float4))) dense(__global float4 *A, __global float *b, __global float4 *x, __global float *h1)
{
    size_t idx = get_global_id(0);

    int j;
    float4 tmp = (float4)0.0f;

    for (j=0; j&lt;K&gt;;
    }
    h1[idx] = pow(fabs(tmp.x + tmp.y + tmp.z + tmp.w - b[idx]),PNORM);

}[/bash]&lt;/K&gt;&lt;/PRE&gt; &lt;BR /&gt;&lt;BR /&gt;Result: &lt;B&gt;NO speedup + not vectorized!&lt;/B&gt;&lt;BR /&gt;Result using vec_type_hint(float): &lt;B&gt;NO speedup + successfully vectorized!&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;Once again, what am I am missing?&lt;BR /&gt;&lt;BR /&gt;Any advice is much appreciated.&lt;BR /&gt;&lt;BR /&gt;Best&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 28 Sep 2011 11:39:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821825#M1097</guid>
      <dc:creator>eimunic</dc:creator>
      <dc:date>2011-09-28T11:39:48Z</dc:date>
    </item>
    <item>
      <title>vectorization: no speedup?!?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821826#M1098</link>
      <description>Hi!&lt;BR /&gt;Regarding vec_type_hint. Pls read the Guide carefully:&lt;BR /&gt;&lt;BR /&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;The implicit vectorization module works best for kernels that operate on elements of 4-byte width, such as float or int. In OpenCL, you can define the computational width of a kernel using the vec_type_hint attribute.&lt;BR /&gt;By default, kernels are always vectorized, since the default computation width is 4-byte. Specifying __attribute__((vec_type_hint(&lt;TYPEN&gt;))) with typen of any vector type (e.g. float3 or char4) disables the vectorization module optimization for this kernel."&lt;/TYPEN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;So specifying vec_type_hint with float (exactly 4-bytes) actually helps vectorizer to recognize proper way to vectorize (and doesn't disbale it).&lt;BR /&gt;&lt;BR /&gt;Potential speed-up from any vectorization depends on the numbers of computations that would benefit from SSE/AVX (in your case math is heavy enough) and also from the input (small problem sizes will exhibit mostly various overheads regardless of manual/automatic/none vectorization).&lt;BR /&gt;&lt;BR /&gt;You must provide workgroup size of multiple of 8 otherwise auto-vectorized version will not run (but scalar would run instead).&lt;BR /&gt;Finally with manually vectorized veriosn (float4) make sure your divided your global size by 4 (in the host code).&lt;BR /&gt;</description>
      <pubDate>Wed, 28 Sep 2011 13:32:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821826#M1098</guid>
      <dc:creator>Maxim_S_Intel</dc:creator>
      <dc:date>2011-09-28T13:32:09Z</dc:date>
    </item>
    <item>
      <title>vectorization: no speedup?!?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821827#M1099</link>
      <description>&lt;P&gt;Sorry if it is not related, how canI check if AVX is supported on my Intel CPU?&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2011 16:32:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821827#M1099</guid>
      <dc:creator>nurbs</dc:creator>
      <dc:date>2011-09-28T16:32:47Z</dc:date>
    </item>
    <item>
      <title>vectorization: no speedup?!?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821828#M1100</link>
      <description>&lt;P&gt;check this article, it has code too:&lt;/P&gt;&lt;P&gt;&lt;A href="http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/?wapkw=(how+to+detect+avx)" target="_blank"&gt;http://software.intel.com/en-us/blogs/2011/04/14/is-avx-enabled/?wapkw=(how+to+detect+avx)&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2011 16:36:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821828#M1100</guid>
      <dc:creator>Brijender_B_Intel</dc:creator>
      <dc:date>2011-09-28T16:36:46Z</dc:date>
    </item>
    <item>
      <title>vectorization: no speedup?!?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821829#M1101</link>
      <description>Cool Thanks.I am alsogoing to get anew PC.How could I know whether it will support AVX?&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 28 Sep 2011 19:15:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821829#M1101</guid>
      <dc:creator>nurbs</dc:creator>
      <dc:date>2011-09-28T19:15:55Z</dc:date>
    </item>
    <item>
      <title>vectorization: no speedup?!?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821830#M1102</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;2nd generation Intel  Core processors (Intel microarchitecture code name Sandy Bridge), released in Q1, 2011,
 are the first from Intel supporting Intel AVX technology.&lt;BR /&gt;You can find a detailed list of processors here for example: &lt;A href="http://en.wikipedia.org/wiki/Sandy_Bridge_%28microarchitecture%29"&gt;http://en.wikipedia.org/wiki/Sandy_Bridge_(microarchitecture)&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Yuri</description>
      <pubDate>Thu, 29 Sep 2011 08:58:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821830#M1102</guid>
      <dc:creator>Yuri_K_Intel</dc:creator>
      <dc:date>2011-09-29T08:58:45Z</dc:date>
    </item>
    <item>
      <title>vectorization: no speedup?!?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821831#M1103</link>
      <description>Hi.&lt;BR /&gt;&lt;BR /&gt;Yes, I might have missunderstood this. Thanks for clearing this up :)&lt;BR /&gt;&lt;BR /&gt;Moreover, I had already paid attention to the global and local sizes,&lt;BR /&gt;so I came to the conclusion that my runtime measuring might be a little off...&lt;BR /&gt;After a more profound time measuring the vectorized version shows a speedup of close to 2!&lt;BR /&gt;&lt;BR /&gt;This is still less than I have expected but much better than before.&lt;BR /&gt;&lt;BR /&gt;Best</description>
      <pubDate>Fri, 30 Sep 2011 08:43:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/vectorization-no-speedup/m-p/821831#M1103</guid>
      <dc:creator>eimunic</dc:creator>
      <dc:date>2011-09-30T08:43:07Z</dc:date>
    </item>
  </channel>
</rss>

