<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Sun, in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090257#M23223</link>
    <description>&lt;P&gt;Hi Sun,&lt;/P&gt;

&lt;P&gt;As you can expect, there are overheads for calling a library; these are more apparent when the amount of computation done is small.&amp;nbsp; In Intel MKL 11.2, a new feature (direct call) was introduced, designed to help address this problem (see &lt;A href="https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call"&gt;https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call&lt;/A&gt; for more details).&amp;nbsp; While sgemv is not currently supported in direct call, you can replace the gemv call with one to gemm, which is supported:&lt;BR /&gt;
	cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,4,1,4,1,(float*)m,4,v,1,0,_r,1);&lt;/P&gt;

&lt;P&gt;You’ll also need to define MKL_DIRECT_CALL_SEQ when linking to enable direct call.&lt;/P&gt;

&lt;P&gt;Thank you,&lt;/P&gt;

&lt;P&gt;Sarah&lt;/P&gt;</description>
    <pubDate>Fri, 18 Nov 2016 17:56:03 GMT</pubDate>
    <dc:creator>Sarah_K_Intel</dc:creator>
    <dc:date>2016-11-18T17:56:03Z</dc:date>
    <item>
      <title>4x4 matrix  1x4 vector  slower using MKL</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090254#M23220</link>
      <description>&lt;P&gt;using vs2013&amp;nbsp;&amp;nbsp; CPU Intel(R)Core(TM) i7 4790&lt;/P&gt;

&lt;P&gt;MKL 2017 Sequential&lt;/P&gt;

&lt;P&gt;Doing a Test:&lt;/P&gt;

&lt;P&gt;#include&amp;lt;stdio.h&amp;gt;&lt;BR /&gt;
	#include&amp;lt;string.h&amp;gt;&lt;BR /&gt;
	#include"mkl.h"&lt;BR /&gt;
	#include &amp;lt;time.h&amp;gt;&lt;BR /&gt;
	#include &amp;lt;pmmintrin.h&amp;gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;void matrixMultiplicationNormal(float _r[4], float m[4][4], float v[4])&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp;float r[4];&lt;/P&gt;

&lt;P&gt;&amp;nbsp;r[0] = m[0][0] * v[0] + m[0][1] * v[1] + m[0][2] * v[2] + m[0][3] * v[3];&lt;BR /&gt;
	&amp;nbsp;r[1] = m[1][0] * v[0] + m[1][1] * v[1] + m[1][2] * v[2] + m[1][3] * v[3];&lt;BR /&gt;
	&amp;nbsp;r[2] = m[2][0] * v[0] + m[2][1] * v[1] + m[2][2] * v[2] + m[2][3] * v[3];&lt;BR /&gt;
	&amp;nbsp;r[3] = m[3][0] * v[0] + m[3][1] * v[1] + m[3][2] * v[2] + m[3][3] * v[3];&lt;/P&gt;

&lt;P&gt;&amp;nbsp;memcpy(_r, r, 16);&lt;/P&gt;

&lt;P&gt;}&lt;/P&gt;

&lt;P&gt;void matrixMultiplicationsse3(float _r[4], float m[4][4], float v[4])&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp;__m128 *matrix = (__m128 *)m, *vector = (__m128 *)v;&lt;BR /&gt;
	&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp;__m128 x = _mm_mul_ps(matrix[0], *vector);&lt;BR /&gt;
	&amp;nbsp;__m128 y = _mm_mul_ps(matrix[1], *vector);&lt;BR /&gt;
	&amp;nbsp;__m128 z = _mm_mul_ps(matrix[2], *vector);&lt;BR /&gt;
	&amp;nbsp;__m128 w = _mm_mul_ps(matrix[3], *vector);&lt;BR /&gt;
	&amp;nbsp;__m128 tmp1 = _mm_hadd_ps(x, y); // = [y2+y3, y0+y1, x2+x3, x0+x1]&lt;BR /&gt;
	&amp;nbsp;__m128 tmp2 = _mm_hadd_ps(z, w); // = [w2+w3, w0+w1, z2+z3, z0+z1]&lt;/P&gt;

&lt;P&gt;&amp;nbsp;_mm_storeu_ps(_r, _mm_hadd_ps(tmp1, tmp2)); // = [w0+w1+w2+w3, z0+z1+z2+z3, y0+y1+y2+y3, x0+x1+x2+x3]&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;void matrixMultiplicationMKL(float _r[4], float m[4][4], float v[4])&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp;cblas_sgemv(CblasRowMajor, CblasNoTrans,4,4,1,(float*)m,4,v,1,0,_r,1);&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;int main()&lt;BR /&gt;
	{&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;__declspec(align(32)) float outNormal[4] = { 0 };&lt;BR /&gt;
	&amp;nbsp;__declspec(align(32)) float outMKL[4] = { 0 };&lt;BR /&gt;
	&amp;nbsp;__declspec(align(32)) float outsse3[4] = { 0 };&lt;/P&gt;

&lt;P&gt;&amp;nbsp;__declspec(align(32)) float in[4] = { 1.0, 2.3, 5.5, 4.4 };&lt;/P&gt;

&lt;P&gt;&amp;nbsp;__declspec(align(32)) float matrix[4][4] = { 1, 0, 0, 2,&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0,1,0,2,&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;0,0,1,3,&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;0,0,0,1};&lt;BR /&gt;
	&amp;nbsp;int s1, s2,s3;&lt;BR /&gt;
	&amp;nbsp;int t = clock();&lt;BR /&gt;
	&amp;nbsp;for (int i = 0; i &amp;lt; 5000000; i++)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;matrixMultiplicationNormal(outNormal, matrix, in);&lt;BR /&gt;
	&amp;nbsp;s1 =&amp;nbsp; clock()-t;&lt;BR /&gt;
	&amp;nbsp;t = clock();&lt;BR /&gt;
	&amp;nbsp;for (int i = 0; i &amp;lt; 5000000; i++)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;matrixMultiplicationsse3(outsse3, matrix, in);&lt;BR /&gt;
	&amp;nbsp;s2 = clock() - t;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;t = clock();&lt;BR /&gt;
	&amp;nbsp;for (int i = 0; i &amp;lt; 5000000; i++)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;matrixMultiplicationMKL(outMKL, matrix, in);&lt;BR /&gt;
	&amp;nbsp;s3 = clock() - t;&lt;BR /&gt;
	&amp;nbsp;printf("use normal time=%dms\nuse sse3 time=%dms\nuse intel mkl time=%dms", s1, s2, s3);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;return 0;&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;use normal time =218 ms&lt;/P&gt;

&lt;P&gt;use sse3 time =156ms&lt;/P&gt;

&lt;P&gt;use intel mkl time=374ms&lt;/P&gt;

&lt;P&gt;Is Intel MKL slower for small matrix vector mul?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 18 Nov 2016 03:09:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090254#M23220</guid>
      <dc:creator>sun_c_</dc:creator>
      <dc:date>2016-11-18T03:09:07Z</dc:date>
    </item>
    <item>
      <title>Dear customer,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090255#M23221</link>
      <description>&lt;P&gt;Dear customer,&lt;/P&gt;

&lt;P&gt;For small matrix for calculation, the initialization time would be substantial when timing matrix-matrix/matrix-vector multiplication.&amp;nbsp;Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices. Your post is valuable, I will give a feedback to developing team. If you plan to use very small matrix which is less than 20* 20, I recommend to use common c code... Another point is, for C code, you actually spend much time for memory-memory copy function, not matrix calculation.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best regards,&lt;BR /&gt;
	Fiona&lt;/P&gt;</description>
      <pubDate>Fri, 18 Nov 2016 04:14:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090255#M23221</guid>
      <dc:creator>Zhen_Z_Intel</dc:creator>
      <dc:date>2016-11-18T04:14:49Z</dc:date>
    </item>
    <item>
      <title>Thanks.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090256#M23222</link>
      <description>&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 18 Nov 2016 07:44:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090256#M23222</guid>
      <dc:creator>sun_c_</dc:creator>
      <dc:date>2016-11-18T07:44:06Z</dc:date>
    </item>
    <item>
      <title>Hi Sun,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090257#M23223</link>
      <description>&lt;P&gt;Hi Sun,&lt;/P&gt;

&lt;P&gt;As you can expect, there are overheads for calling a library; these are more apparent when the amount of computation done is small.&amp;nbsp; In Intel MKL 11.2, a new feature (direct call) was introduced, designed to help address this problem (see &lt;A href="https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call"&gt;https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call&lt;/A&gt; for more details).&amp;nbsp; While sgemv is not currently supported in direct call, you can replace the gemv call with one to gemm, which is supported:&lt;BR /&gt;
	cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,4,1,4,1,(float*)m,4,v,1,0,_r,1);&lt;/P&gt;

&lt;P&gt;You’ll also need to define MKL_DIRECT_CALL_SEQ when linking to enable direct call.&lt;/P&gt;

&lt;P&gt;Thank you,&lt;/P&gt;

&lt;P&gt;Sarah&lt;/P&gt;</description>
      <pubDate>Fri, 18 Nov 2016 17:56:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/4x4-matrix-1x4-vector-slower-using-MKL/m-p/1090257#M23223</guid>
      <dc:creator>Sarah_K_Intel</dc:creator>
      <dc:date>2016-11-18T17:56:03Z</dc:date>
    </item>
  </channel>
</rss>

