<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic CPU vs GPU optimizations in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781901#M412</link>
    <description>Maybe because OpenMP does not optimize your code for parallelization, like using SIMD instructions :-)&lt;BR /&gt;Also, as you say, you play with different kind of memory, it is very important even on CPU (Like avoiding switching the registers etc...)</description>
    <pubDate>Wed, 27 Jul 2011 20:40:23 GMT</pubDate>
    <dc:creator>Polar01</dc:creator>
    <dc:date>2011-07-27T20:40:23Z</dc:date>
    <item>
      <title>CPU vs GPU optimizations</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781900#M411</link>
      <description>Hello&lt;BR /&gt;&lt;BR /&gt;I have implemented a straightaway naive matrix 
multiplication in OpenCL with AMD SDK. I get Speedup of around 16 for 
just an 8-core CPU system while I only run it on CPUs. I have applied 
some popular optimizations like utilizing private memory and local 
memory optimizations, and grouping my matrix in one dimension so I use 
both global and local dimension sizes. Now I get Speedup of around 24 
with same 8-core CPU.&lt;BR /&gt;&lt;BR /&gt;First I wonder this much speedup because for 
8-cores I normally get around or less than 8 speedup with OpenMP for 
example. so these figures of 16 and 24 amaze me how its possible?&lt;BR /&gt;&lt;BR /&gt;Second
 these local + private memory and grouping of work items are 
optimizations that I heard are only for GPUs and arent for CPUs so I 
again wonder how I get so much boost in speedup when I run it only on 
CPUs ?&lt;BR /&gt;&lt;BR /&gt;Thirdly, I wonder how local and private memory and grouping 
are handled for CPUs as they cause speedup, caches or processor 
registers or what? Because this is magic to get so much speedup...&lt;BR /&gt;&lt;BR /&gt;I also want to know what are CPU specific optimizations in OpenCL ?&lt;BR /&gt;&lt;BR /&gt;Please
 help me clarify because I am so new to OpenCL and its giving me so big 
performance I cant beleive it, I have verified results and they are 
perfectly accurate.&lt;BR /&gt;Thanks in advance</description>
      <pubDate>Wed, 27 Jul 2011 17:37:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781900#M411</guid>
      <dc:creator>akhal</dc:creator>
      <dc:date>2011-07-27T17:37:53Z</dc:date>
    </item>
    <item>
      <title>CPU vs GPU optimizations</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781901#M412</link>
      <description>Maybe because OpenMP does not optimize your code for parallelization, like using SIMD instructions :-)&lt;BR /&gt;Also, as you say, you play with different kind of memory, it is very important even on CPU (Like avoiding switching the registers etc...)</description>
      <pubDate>Wed, 27 Jul 2011 20:40:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781901#M412</guid>
      <dc:creator>Polar01</dc:creator>
      <dc:date>2011-07-27T20:40:23Z</dc:date>
    </item>
    <item>
      <title>CPU vs GPU optimizations</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781902#M413</link>
      <description>Hi,&lt;BR /&gt;&lt;BR /&gt;It depends how you port your code to OpenCL.&lt;BR /&gt;But major factors could SIMD utilization mention by Polar01. Another reason could be better cache utilization. when you use local memories and all mapped to L1 cache, you might have significant speed up.&lt;BR /&gt;&lt;BR /&gt;However, all of this are only assumptions and we can't comment on AMD SDK.&lt;BR /&gt;&lt;BR /&gt;Thanks.</description>
      <pubDate>Thu, 28 Jul 2011 05:52:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781902#M413</guid>
      <dc:creator>Evgeny_F_Intel</dc:creator>
      <dc:date>2011-07-28T05:52:23Z</dc:date>
    </item>
    <item>
      <title>CPU vs GPU optimizations</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781903#M414</link>
      <description>But is SIMD utilization or auto-vectorization possible if I havent used OpencL vectors for example? Also local/private memory can boost speedup on CPUs? I am confused because someone told me that for device CPUs there is no local memory in OpenCL so no benefit, and that it only gives performance for GPUs...&lt;BR /&gt;</description>
      <pubDate>Thu, 28 Jul 2011 17:51:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781903#M414</guid>
      <dc:creator>akhal</dc:creator>
      <dc:date>2011-07-28T17:51:23Z</dc:date>
    </item>
    <item>
      <title>CPU vs GPU optimizations</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781904#M415</link>
      <description>CPU does have local memories and those are caches. When properly used they can gain significant speed up.&lt;BR /&gt;&lt;BR /&gt;As alredy said, it's very difficult to understand from where coming the performance numbers w/o touching the code.&lt;BR /&gt;&lt;BR /&gt;First of all try to understand code correctness, in some case you can have speed up because MT code produces different results.&lt;BR /&gt;&lt;BR /&gt;You can try to use Intel OpenCL SDK and together with VTune Amplifier (http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/), which is supported by SDK, you can tryunderstand what is the real reason.&lt;BR /&gt;&lt;BR /&gt;I also can reference you to Intel performance guidelines document (http://www.intel.com/Assets/PDF/manual/248966.pdf) which can provide few answers.</description>
      <pubDate>Thu, 28 Jul 2011 19:27:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781904#M415</guid>
      <dc:creator>Evgeny_F_Intel</dc:creator>
      <dc:date>2011-07-28T19:27:01Z</dc:date>
    </item>
    <item>
      <title>CPU vs GPU optimizations</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781905#M416</link>
      <description>Ok thank you for kind and useful information:)</description>
      <pubDate>Fri, 29 Jul 2011 10:45:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781905#M416</guid>
      <dc:creator>akhal</dc:creator>
      <dc:date>2011-07-29T10:45:33Z</dc:date>
    </item>
    <item>
      <title>CPU vs GPU optimizations</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781906#M417</link>
      <description>Continue with Evgeny direction, using OpenCL with CPU is very much relevance in boosting the performance of data parallel workloads.&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;To understand better how you optimize your code, I suggest to read it all in the Intel OpenCL Community,&lt;/DIV&gt;&lt;DIV&gt;And specifically:&lt;/DIV&gt;&lt;DIV&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN style="line-height: normal;"&gt;&lt;A href="http://software.intel.com/en-us/articles/optimize-opencl-code-with-intel-gpa/"&gt;Optimize OpenCL code with the Intel Graphics Performance Analyzers 4.0&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN style="line-height: normal;"&gt;&lt;A href="http://software.intel.com/en-us/articles/working-with-the-intel-vtune-amplifier-xe-2011/"&gt;Optimize OpenCL code with the Intel VTune Amplifier XE&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN style="line-height: normal;"&gt;&lt;A href="http://software.intel.com/en-us/articles/tips-and-tricks-for-kernel-development/"&gt;Tips and Tricks in writing OpenCL Code for CPU&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;DIV&gt;&lt;SPAN style="line-height: 16px;"&gt;Good luck with code optimization,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="line-height: 16px;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN style="line-height: 16px;"&gt;Arnon&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 30 Jul 2011 09:53:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781906#M417</guid>
      <dc:creator>ARNON_P_Intel</dc:creator>
      <dc:date>2011-07-30T09:53:22Z</dc:date>
    </item>
    <item>
      <title>CPU vs GPU optimizations</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781907#M418</link>
      <description>Thank you very much for the kind hints :)</description>
      <pubDate>Sun, 31 Jul 2011 18:03:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/CPU-vs-GPU-optimizations/m-p/781907#M418</guid>
      <dc:creator>akhal</dc:creator>
      <dc:date>2011-07-31T18:03:45Z</dc:date>
    </item>
  </channel>
</rss>

