<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Memory spill in my kernel in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146407#M5994</link>
    <description>&lt;P&gt;Hi ,&lt;/P&gt;

&lt;P&gt;I have a question related to the compiler generated information.&lt;/P&gt;

&lt;P&gt;I have compiled the kernel and see the following information.I am working on Gen9 GT2 system.&lt;/P&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Maximum work-group size: 256&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Compiler work-group size: (0, 0, 0)&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Local memory size: 0&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Preferred multiple of work-group size: 8&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Minimum amount of private memory: 768&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Amount of spill memory used by the kernel: 1024&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;I assume that private memory is mapped to the registers by the compiler. My workgroup size is "1". each EU thread has 128 GRFs and each GRF size is 256-bits . the total memory size of the private space is 4096 bytes. I see my kernel requires only 768 bytes of the GRF space which is much less than 4096 bytes , but compiler is generating a spill code of 1024 bytes.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;could you please explain me why this is happening?&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Best Regards,&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Rajesh&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 28 Feb 2018 11:26:53 GMT</pubDate>
    <dc:creator>rajesh_k_</dc:creator>
    <dc:date>2018-02-28T11:26:53Z</dc:date>
    <item>
      <title>Memory spill in my kernel</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146407#M5994</link>
      <description>&lt;P&gt;Hi ,&lt;/P&gt;

&lt;P&gt;I have a question related to the compiler generated information.&lt;/P&gt;

&lt;P&gt;I have compiled the kernel and see the following information.I am working on Gen9 GT2 system.&lt;/P&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Maximum work-group size: 256&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Compiler work-group size: (0, 0, 0)&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Local memory size: 0&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Preferred multiple of work-group size: 8&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Minimum amount of private memory: 768&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;1&amp;gt;&lt;SPAN style="white-space:pre"&gt; &lt;/SPAN&gt;Amount of spill memory used by the kernel: 1024&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;I assume that private memory is mapped to the registers by the compiler. My workgroup size is "1". each EU thread has 128 GRFs and each GRF size is 256-bits . the total memory size of the private space is 4096 bytes. I see my kernel requires only 768 bytes of the GRF space which is much less than 4096 bytes , but compiler is generating a spill code of 1024 bytes.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;could you please explain me why this is happening?&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Best Regards,&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Rajesh&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 28 Feb 2018 11:26:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146407#M5994</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-02-28T11:26:53Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146408#M5995</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Could you please respond?&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Thanks&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Rajesh&lt;/P&gt;</description>
      <pubDate>Thu, 01 Mar 2018 09:52:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146408#M5995</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-01T09:52:49Z</dc:date>
    </item>
    <item>
      <title>Odd, when I went back to</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146409#M5996</link>
      <description>Odd, when I went back to reply to this question yesterday I didn't see it.

Our GPU OpenCL compiler will compile kernels for 8, 16, or 32 work items per EU thread.  You may hear this referred to as the "vectorization width", or as compiling a kernel "SIMD8", "SIMD16", or "SIMD32".  What this means is that the 4K EU thread total GRF size is shared among more than one OpenCL work item, so the effective GRF size per work item is considerably less than 4K.  (In theory our compiler could compile a kernel where a single work item executes in one EU thread, but this would be a separate codepath through our compiler, and these kernels don't generally run well on a GPU.)

You can find out what SIMD size your kernel was compiled to by querying CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.  Since your preferred work group size multiple is 8, your kernel was compiled SIMD8.

The value we return for CL_KERNEL_PRIVATE_MEM_SIZE is additional private memory that we need per work item, above and beyond what we can store in the register file.  This is typically caused by large private memory arrays that we can't put in the register file, or at least can't put in the register file efficiently.  Most frequently, these are private memory arrays that are indexed dynamically, say based on a loop counter.

Hope this helps!</description>
      <pubDate>Fri, 02 Mar 2018 23:07:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146409#M5996</guid>
      <dc:creator>Ben_A_Intel</dc:creator>
      <dc:date>2018-03-02T23:07:07Z</dc:date>
    </item>
    <item>
      <title>Thanks Ben.</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146410#M5997</link>
      <description>&lt;P&gt;Thanks Ben.&lt;/P&gt;

&lt;P&gt;Let me give you a background of&amp;nbsp; what am i trying to achieve.&lt;/P&gt;

&lt;P&gt;Iam&amp;nbsp; processing a 1D vector of length 64. i will do MAC operations on these vector samples. the algorithm is quite lengthy where i will have to pass the 1D vector through many feed forward stages. what i thought of doing was,i will assign a single work item and process the 1D vector in SIMD additions and multiplications meaning i will process 8 elements additions or multiplication in a single workitem instead of processing it in 8-work items. As you mentioned above i want to work in the another code path that you suggested-&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;"&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px;"&gt;(In theory our compiler could compile a kernel where a single work item executes in one EU thread, but this would be a separate codepath through our compiler, and these kernels don't generally run well on a GPU)&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;when you are compiling 8 different work items per EU thread then will it not limit the program code of the kernel?. if i break my algorithm into smaller kernels would it not incur the latencies of VFE-&amp;gt;TSG-&amp;gt;TDG&amp;nbsp; to spawn the new threads and read the data again?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;please suggest a better kernel programming to process the scenario i have described.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Rajesh&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 06 Mar 2018 10:44:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146410#M5997</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-06T10:44:28Z</dc:date>
    </item>
    <item>
      <title>Quote:rajesh k. wrote:</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146411#M5998</link>
      <description>&lt;BLOCKQUOTE&gt;rajesh k. wrote:&lt;BR /&gt; 
&lt;P&gt;please suggest a better kernel programming to process the scenario i have described.&lt;/P&gt;
 &lt;/BLOCKQUOTE&gt;

Are the operations on your 1D vector all component-wise operations?  If so, the usual way to do this is to have a work group process one (or possibly more than one) vector, and each work item in the work group process one (or possibly more than one) component of the vector.  I'd start with one vector per work group and one component per work item, and play around with the partitioning from there once it's working, since different partitioning might give different performance.

If they aren't component-wise operations then there are options for cross-work-item sharing, but this is more complicated.</description>
      <pubDate>Tue, 06 Mar 2018 23:42:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146411#M5998</guid>
      <dc:creator>Ben_A_Intel</dc:creator>
      <dc:date>2018-03-06T23:42:14Z</dc:date>
    </item>
    <item>
      <title>Hi Ben,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146412#M5999</link>
      <description>&lt;P&gt;Hi Ben,&lt;/P&gt;

&lt;P&gt;Just sharing more information about my algorithm.&lt;/P&gt;

&lt;P&gt;my kernel function is set of&amp;nbsp; of functions. the output of one function flows down to the next one like below.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Mykerne()&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;{&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; func_1();&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; func_2();&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; func_3();&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; func_4();&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; func_5();&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;}&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;one of the functions looks like below.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;i have two input vectors v1[17] and another V2[80] and generates an output of V3[64]&lt;/P&gt;

&lt;P&gt;To generate one component of&amp;nbsp; output v3 , i will have to shift the input v2 by one and multiply by V1 and sum them up.&lt;/P&gt;

&lt;P&gt;once output is generated v3 will be input to next function and so on.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;to generate v3&amp;nbsp;&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;using multiple work items may not straightforward.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;this is the simplest function of the kernel. other functions little more complex.&lt;/P&gt;

&lt;P&gt;what i have implemented was, i&amp;nbsp; used vector operations to generate v3 ,vload16 for v1 and v2 and using "dotp" i will generate&amp;nbsp; v3. i will do all of my kernel functions in a single work-item and i was expecting it will be mapped to single EU thread so that i will have sufficient registers to generate this functionality without memory spill.&lt;/P&gt;

&lt;P&gt;my question is there any way i can map one single work item to single EU thread at the same time i would like complier&amp;nbsp; to generate SIMD 8 additions and multiplication. please let me know.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;void func_1(float *v1, float *v2, float *v3 )&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;{&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;float sum;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;int i;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;&amp;nbsp; for (i = 0; i &amp;lt; 64; i++)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;{&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum = v2&lt;I&gt;;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[1] * v2[i - 1];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[2] * v2[i - 2];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[3] * v2[i - 3];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[4] * v2[i - 4];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[5] * v2[i - 5];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[6] * v2[i - 6];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[7] * v2[i - 7];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[8] * v2[i - 8];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[9] * v2[i - 9];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[10] * v2[i - 10];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[11] * v2[i - 11];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[12] * v2[i - 12];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[13] * v2[i - 13];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[14] * v2[i - 14];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[15] * v2[i - 15];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; sum += v1[16] * v2[i - 16];&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; v2&lt;I&gt; = sum;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp;}&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp;return;&lt;BR /&gt;
	}&lt;/P&gt;</description>
      <pubDate>Wed, 07 Mar 2018 09:00:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146412#M5999</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-07T09:00:55Z</dc:date>
    </item>
    <item>
      <title>Hi Ben,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146413#M6000</link>
      <description>&lt;P&gt;Hi Ben,&lt;/P&gt;

&lt;P&gt;could you please share your comments ?&lt;/P&gt;

&lt;P&gt;One more observation i would like to bring to your notice is:&lt;/P&gt;

&lt;P&gt;i have created a work group of size 8. but i didn't change my kernel which is basically written for a single work-item,but now i do read&amp;nbsp; and write the data based on the local_x rather than the global_id_x .&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;my old thread space looks like this:&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;&amp;nbsp;size_t globalWorkSize[2] = { 1024, 1};&lt;/SPAN&gt;&lt;/P&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;in the kernel side i read the data like the following&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;-----------------------------------------------------------------&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;&lt;SPAN style="font-size: 1em;"&gt;const int global_x&amp;nbsp; = get_global_id(0);&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;const int local_x = get_local_id(0);&lt;/P&gt;

&lt;P&gt;__global float* x;&lt;/P&gt;

&lt;P&gt;float16 temp;&lt;/P&gt;

&lt;P&gt;vload16(&lt;SPAN style="font-size: 13.008px;"&gt;temp&lt;/SPAN&gt;,0,x+ &lt;SPAN style="font-size: 13.008px;"&gt;global_x&amp;nbsp;&lt;/SPAN&gt;*16 );&lt;/P&gt;

&lt;P&gt;-----------------------------------------------------------&lt;/P&gt;

&lt;P&gt;&lt;EM style="font-size: 13.008px;"&gt;&lt;SPAN style="font-weight: 700;"&gt;my new thread space looks like this:&lt;/SPAN&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;&lt;SPAN style="font-size: 1em;"&gt;size_t globalWorkSize[2] = { 1024, 1};&lt;/SPAN&gt;&lt;/P&gt;

&lt;DIV style="font-size: 13.008px;"&gt;size_t local_size[2] = { 8, 1 };&lt;/DIV&gt;

&lt;DIV style="font-size: 13.008px;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV style="font-size: 13.008px;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV style="font-size: 13.008px;"&gt;i didn't reduce the global size by 8 since i have written the SIMD code to be executed within single work item itself.&lt;/DIV&gt;

&lt;DIV style="font-size: 13.008px;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV style="font-size: 13.008px;"&gt;but in the kernel side i have made changes to the&amp;nbsp; global_id to incorporate the local id.&lt;/DIV&gt;

&lt;DIV style="font-size: 13.008px;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV style="font-size: 13.008px;"&gt;-----------------------------------------------------------------------------&lt;/DIV&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;const int global_x&amp;nbsp; &amp;nbsp; &amp;nbsp;= (get_global_id(0)&amp;gt;&amp;gt;3)*8;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;const int local_x = get_local_id(0);&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;__global float* x;&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;float16 temp;&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;vload16(&lt;SPAN style="font-size: 13.008px;"&gt;temp&lt;/SPAN&gt;,0,x+ (&lt;SPAN style="font-size: 13.008px;"&gt;global_x+local_x)&amp;nbsp;&lt;/SPAN&gt;*16&amp;nbsp;);&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;-------------------------------------------------------------------&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;this is functionally correct. instead of reading the data based on global_index i am reading it based on local_x and modified global_x.&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;But i see huge performance improvement with this change. But i am not able to understand why there is an improvement.&lt;/P&gt;

&lt;P style="font-size: 13.008px;"&gt;could you please explain why this happening ?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Best Regards,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Rajesh&lt;/P&gt;</description>
      <pubDate>Fri, 09 Mar 2018 10:30:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146413#M6000</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-09T10:30:30Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146414#M6001</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Could you please respond?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Rajesh&lt;/P&gt;</description>
      <pubDate>Mon, 12 Mar 2018 05:19:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146414#M6001</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-12T05:19:07Z</dc:date>
    </item>
    <item>
      <title>Hi Rajesh, it sounds like you</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146415#M6002</link>
      <description>&lt;P&gt;Hi Rajesh, it sounds like you're on the right track and I don't have too much more to add.&amp;nbsp; Nice job with the performance improvements so far!&lt;/P&gt;

&lt;P&gt;Going from a work group size of one to a work group size of eight (or even larger) will almost always result in a performance improvement because our Execution Unit (EU) ALUs are SIMD ALUs, and our IO instructions can load or store up to one cache line (64 bytes) per EU thread per clock.&amp;nbsp; With only one work item per EU thread it's very difficult to keep either of these resources busy.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Mar 2018 23:38:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146415#M6002</guid>
      <dc:creator>Ben_A_Intel</dc:creator>
      <dc:date>2018-03-13T23:38:35Z</dc:date>
    </item>
    <item>
      <title>Thank you Ben!</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146416#M6003</link>
      <description>&lt;P&gt;Thank you Ben!&lt;/P&gt;</description>
      <pubDate>Thu, 22 Mar 2018 09:25:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Memory-spill-in-my-kernel/m-p/1146416#M6003</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-22T09:25:06Z</dc:date>
    </item>
  </channel>
</rss>

