<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic If you haven't done already, in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069550#M4385</link>
    <description>&lt;P&gt;If you haven't done already, you should use pinned host arrays and USE_HOST_PTR &amp;nbsp;type direct memory accessing to lower memory read write laatency to %30. I tested with Intel HD 400 and 1M data was taking 10-15 ms to read and compute then I converted it to map and compute using host buffer 4096 aligned and it took 3-5 ms.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Then with a 1 cpu-core, it takes 55ms which is 10 times slower.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Because data reuse ratio is too low, and 8192x8192 matrix is also out of cache.&lt;/P&gt;</description>
    <pubDate>Tue, 08 Nov 2016 09:06:00 GMT</pubDate>
    <dc:creator>Huseyin_Tugrul_B_</dc:creator>
    <dc:date>2016-11-08T09:06:00Z</dc:date>
    <item>
      <title>Optimizing a Matrix-Vector multiplication kernel</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069548#M4383</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;

&lt;P&gt;I am trying to optimize a Matrix-vector multiplication kernel for an Intel CPU-GPU system. I know that gemv/BLAS-2 is memory bound but I want to obtain the best performance possible. Here's the code for the kernel :&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__kernel void gemv(const __global float4* M,
	const __global float4* V,
	uint width, uint height,
	__global float* W,
	__local float* partialDotProduct)
{
	// Each work-group handles as many matrix rows as necessary


	for (uint y = get_group_id(0); y &amp;lt; height; y += get_num_groups(0)) {

		// Row pointer
		const __global float4* row = M + (y * width/4);

		// Each work-item accumulates as many products as necessary
		// into local variable "sum"
		float4 sum = (float4) (0.0f);

		for (uint x = get_local_id(0); x &amp;lt; width/4; x += get_local_size(0))
			sum = fma(row&lt;X&gt;,V&lt;X&gt;,sum);
		
		
		// Each partial dot product is stored in SLM
		partialDotProduct[get_local_id(0)] = dot(sum, (float4) 1.0f);

		// Perform parallel reduction to add each work-item's
		// partial dot product together

		for (uint stride = get_local_size(0) / 2; stride &amp;gt; 0; stride /= 2) {

			// Synchronize to make sure each work-item is done updating SLM
			barrier(CLK_LOCAL_MEM_FENCE);

			// Only the first work-items in the work-group add elements together
			if (get_local_id(0) &amp;lt; stride) {

				// Add two elements from the "partialDotProduct" array
				// and store the result in partialDotProduct[index]
				partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride];
			}
		}

		// Write the result of the reduction to global memory
		if (get_local_id(0) == 0)
			W&lt;Y&gt; = partialDotProduct[0];

	}
}
&lt;/Y&gt;&lt;/X&gt;&lt;/X&gt;&lt;/PRE&gt;

&lt;P&gt;On measuring performance (Using the profiling queue), I found that the CPU is faster than the GPU by 10-20% for all datasizes (Ranging from matrices 512x512 to 8192x8192).&lt;/P&gt;

&lt;P&gt;Is there room for any more optimization here? Or am I correct in assuming that the performance here is bounded by the memory accesses?&lt;/P&gt;

&lt;P&gt;I am using the latest OpenCL runtimes on Intel 6300U/ HD 520 running Windows.&lt;/P&gt;

&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Sun, 30 Oct 2016 13:48:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069548#M4383</guid>
      <dc:creator>Supradeep_A_</dc:creator>
      <dc:date>2016-10-30T13:48:05Z</dc:date>
    </item>
    <item>
      <title>Sorry for the delayed reply. </title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069549#M4384</link>
      <description>&lt;P&gt;Sorry for the delayed reply.&amp;nbsp; There is still room for optimization. &amp;nbsp; We're hoping to have at least one publication come out on this topic within a few weeks where we go into strategies for matrix-matrix and matrix-vector optimizations in more detail.&amp;nbsp; Until then:&lt;/P&gt;

&lt;P&gt;As you know, optimizing OpenCL is a combination of efficient operations and efficient data transfer.&amp;nbsp; To reduce the # of operations you could consider opportunities for unrolling.&amp;nbsp;&amp;nbsp; There should also be a way to write back results as a vector type instead of one at a time.&lt;/P&gt;

&lt;P&gt;Our tools can give excellent feedback to help with optimization.&amp;nbsp; You may have already seen the introductory videos here: &lt;A href="https://software.intel.com/en-us/intel-opencl"&gt;https://software.intel.com/en-us/intel-opencl&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;I'm hoping to have a more detailed response soon.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2016 03:33:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069549#M4384</guid>
      <dc:creator>Jeffrey_M_Intel1</dc:creator>
      <dc:date>2016-11-08T03:33:04Z</dc:date>
    </item>
    <item>
      <title>If you haven't done already,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069550#M4385</link>
      <description>&lt;P&gt;If you haven't done already, you should use pinned host arrays and USE_HOST_PTR &amp;nbsp;type direct memory accessing to lower memory read write laatency to %30. I tested with Intel HD 400 and 1M data was taking 10-15 ms to read and compute then I converted it to map and compute using host buffer 4096 aligned and it took 3-5 ms.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Then with a 1 cpu-core, it takes 55ms which is 10 times slower.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Because data reuse ratio is too low, and 8192x8192 matrix is also out of cache.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Nov 2016 09:06:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069550#M4385</guid>
      <dc:creator>Huseyin_Tugrul_B_</dc:creator>
      <dc:date>2016-11-08T09:06:00Z</dc:date>
    </item>
    <item>
      <title>Do you get the same result</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069551#M4386</link>
      <description>&lt;P&gt;Do you get the same result with this as without vectorization?&lt;/P&gt;

&lt;P&gt;I have a vector of size 3150 and matrix 3150 * 3150. I am trying to use float2, but the result I get from the multiplication is different from the scalar one.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Also, do you recommend using float4 over float2?&lt;/P&gt;

&lt;P&gt;Thanks,&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 03:48:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069551#M4386</guid>
      <dc:creator>Namrata_C_Intel</dc:creator>
      <dc:date>2017-01-17T03:48:16Z</dc:date>
    </item>
    <item>
      <title>We recently published some</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069552#M4387</link>
      <description>&lt;P&gt;We recently published some best known methods for matrix multiply on Gen graphics here, which may help with matrix-vector multiply.&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/sgemm-ocl-opt"&gt;Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen9&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;This includes float4 and even 16 bit (half) floats.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 04:01:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069552#M4387</guid>
      <dc:creator>Jeffrey_M_Intel1</dc:creator>
      <dc:date>2017-01-17T04:01:16Z</dc:date>
    </item>
    <item>
      <title>Thank you very much for</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069553#M4388</link>
      <description>&lt;P&gt;Thank you very much for sharing this Jeffrey.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;One question though. In case my matrix is not a power of 2 (3150 in my case), is it still recommended to use these techniques?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;If not, what can I do to get the best results?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;I have tried different kernels, but the gemv with vectorization gives best latency. The only problem is, I get different results when using a scalar and a vector version.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thanks,&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 08:58:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Optimizing-a-Matrix-Vector-multiplication-kernel/m-p/1069553#M4388</guid>
      <dc:creator>Namrata_C_Intel</dc:creator>
      <dc:date>2017-01-17T08:58:51Z</dc:date>
    </item>
  </channel>
</rss>

