<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re:Why BLAS SGEMM is slow? in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1415856#M33673</link>
    <description>&lt;P&gt;Hi Anton,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for providing the verbose output.&lt;/P&gt;&lt;P&gt;Could you please attach the complete code here in the forum so that it would help us to do a quick check from our end? &lt;/P&gt;&lt;P&gt;If you do not want to post it here please let us know so that we can contact you privately.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
    <pubDate>Tue, 20 Sep 2022 06:24:33 GMT</pubDate>
    <dc:creator>VidyalathaB_Intel</dc:creator>
    <dc:date>2022-09-20T06:24:33Z</dc:date>
    <item>
      <title>Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1414967#M33657</link>
      <description>&lt;P&gt;I'm measuring three approaches to matrix multiplication performance: a naive blocked OpenMP implementation, Eigen, and SGEMM from MKL 2021.4.0. For simplicity all matrices are square, type `float`, size `n x n`, aligned at 64-bytes. The compiler is `GCC 8.3.1` with compilation flags `-msse4.2 -O3 -fopenmp`. OS is `CentOS 7`&lt;/P&gt;
&lt;P&gt;I don't understand why MKL SGEMM is the slowest. Why is a naive OpenMP implementation faster than a fancy-optimized library?&lt;/P&gt;
&lt;P&gt;**Blocked OpenMP (`BS = n / 64`):**&lt;/P&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;#pragma&lt;/SPAN&gt; &lt;SPAN&gt;omp&lt;/SPAN&gt; &lt;SPAN&gt;parallel&lt;/SPAN&gt; &lt;SPAN&gt;for&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;#pragma&lt;/SPAN&gt; &lt;SPAN&gt;vector&lt;/SPAN&gt; &lt;SPAN&gt;aligned&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt;++)&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt;++)&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;C&lt;/SPAN&gt;&lt;SPAN&gt;[&lt;/SPAN&gt;&lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt;] *= &lt;/SPAN&gt;&lt;SPAN&gt;beta&lt;/SPAN&gt;&lt;SPAN&gt;;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;BR /&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;#pragma&lt;/SPAN&gt; &lt;SPAN&gt;omp&lt;/SPAN&gt; &lt;SPAN&gt;parallel&lt;/SPAN&gt; &lt;SPAN&gt;for&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;#pragma&lt;/SPAN&gt; &lt;SPAN&gt;vector&lt;/SPAN&gt; &lt;SPAN&gt;aligned&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt; &amp;lt; &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt;+=&lt;/SPAN&gt;&lt;SPAN&gt;BS&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;k&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;k&lt;/SPAN&gt;&lt;SPAN&gt; &amp;lt; &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;k&lt;/SPAN&gt;&lt;SPAN&gt;+=&lt;/SPAN&gt;&lt;SPAN&gt;BS&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt; &amp;lt; &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt;+=&lt;/SPAN&gt;&lt;SPAN&gt;BS&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;ii&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;ii&lt;/SPAN&gt;&lt;SPAN&gt; &amp;lt; &lt;/SPAN&gt;&lt;SPAN&gt;i&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;BS&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;ii&lt;/SPAN&gt;&lt;SPAN&gt;++)&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;kk&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;k&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;kk&lt;/SPAN&gt;&lt;SPAN&gt; &amp;lt; &lt;/SPAN&gt;&lt;SPAN&gt;k&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;BS&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;kk&lt;/SPAN&gt;&lt;SPAN&gt;++)&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;jj&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;jj&lt;/SPAN&gt;&lt;SPAN&gt; &amp;lt; &lt;/SPAN&gt;&lt;SPAN&gt;j&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;BS&lt;/SPAN&gt;&lt;SPAN&gt;; &lt;/SPAN&gt;&lt;SPAN&gt;jj&lt;/SPAN&gt;&lt;SPAN&gt;++)&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;C&lt;/SPAN&gt;&lt;SPAN&gt;[&lt;/SPAN&gt;&lt;SPAN&gt;ii&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;jj&lt;/SPAN&gt;&lt;SPAN&gt;] += &lt;/SPAN&gt;&lt;SPAN&gt;alpha&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;A&lt;/SPAN&gt;&lt;SPAN&gt;[&lt;/SPAN&gt;&lt;SPAN&gt;ii&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;kk&lt;/SPAN&gt;&lt;SPAN&gt;]*&lt;/SPAN&gt;&lt;SPAN&gt;B&lt;/SPAN&gt;&lt;SPAN&gt;[&lt;/SPAN&gt;&lt;SPAN&gt;kk&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;jj&lt;/SPAN&gt;&lt;SPAN&gt;];&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;**Eigen**&lt;/P&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;Eigen&lt;/SPAN&gt;&lt;SPAN&gt;::&lt;/SPAN&gt;&lt;SPAN&gt;Map&lt;/SPAN&gt;&lt;SPAN&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN&gt;const&lt;/SPAN&gt; &lt;SPAN&gt;Eigen&lt;/SPAN&gt;&lt;SPAN&gt;::&lt;/SPAN&gt;&lt;SPAN&gt;MatrixXf&lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt; &lt;/SPAN&gt;&lt;SPAN&gt;AM&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;A&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;);&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;Eigen&lt;/SPAN&gt;&lt;SPAN&gt;::&lt;/SPAN&gt;&lt;SPAN&gt;Map&lt;/SPAN&gt;&lt;SPAN&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN&gt;const&lt;/SPAN&gt; &lt;SPAN&gt;Eigen&lt;/SPAN&gt;&lt;SPAN&gt;::&lt;/SPAN&gt;&lt;SPAN&gt;MatrixXf&lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt; &lt;/SPAN&gt;&lt;SPAN&gt;BM&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;B&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;);&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;Eigen&lt;/SPAN&gt;&lt;SPAN&gt;::&lt;/SPAN&gt;&lt;SPAN&gt;Map&lt;/SPAN&gt;&lt;SPAN&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN&gt;Eigen&lt;/SPAN&gt;&lt;SPAN&gt;::&lt;/SPAN&gt;&lt;SPAN&gt;MatrixXf&lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt; &lt;/SPAN&gt;&lt;SPAN&gt;CM&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;C&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;);&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;CM&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;noalias&lt;/SPAN&gt;&lt;SPAN&gt;() &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;beta&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;CM&lt;/SPAN&gt; &lt;SPAN&gt;+&lt;/SPAN&gt; &lt;SPAN&gt;alpha&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;BM&lt;/SPAN&gt; &lt;SPAN&gt;*&lt;/SPAN&gt; &lt;SPAN&gt;AM&lt;/SPAN&gt;&lt;SPAN&gt;);&lt;/SPAN&gt;&lt;SPAN&gt; // fortran order!&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;**MKL SGEMM**&lt;/P&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;cblas_sgemm&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;CblasRowMajor&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;CblasNoTrans&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;CblasNoTrans&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;alpha&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;A&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;B&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;beta&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;C&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;);&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;The google benchmark results for Intel Xeon Silver 4114, 2 sockets, 2 NUMA nodes:&lt;/P&gt;
&lt;P&gt;Benchmark Time CPU Iterations&lt;BR /&gt;-------------------------------------------------------------------------------&lt;BR /&gt;MatMul/OmpBlk/4096/64/real_time 1132 ms 1038 ms 1&lt;BR /&gt;MatMul/OmpBlk/16384/64/real_time 83668 ms 80612 ms 1&lt;BR /&gt;MatMul/OmpBlk/32768/64/real_time 1562980 ms 1492184 ms 1&lt;BR /&gt;MatMul/Eigen/4096/real_time 878 ms 867 ms 1&lt;BR /&gt;MatMul/Eigen/16384/real_time 36140 ms 31629 ms 1&lt;BR /&gt;MatMul/Eigen/32768/real_time 259762 ms 246788 ms 1&lt;BR /&gt;MatMul/Blas/4096/real_time 4091 ms 3719 ms 1&lt;BR /&gt;MatMul/Blas/16384/real_time 219940 ms 219581 ms 1&lt;BR /&gt;MatMul/Blas/32768/real_time 1773874 ms 1750015 ms 1&lt;/P&gt;
&lt;P&gt;**ldd snippet:**&lt;/P&gt;
&lt;P&gt;libmkl_intel_ilp64.so.1 =&amp;gt; /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.so.1 &lt;BR /&gt;libmkl_core.so.1 =&amp;gt; /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.so.1 &lt;BR /&gt;libmkl_intel_thread.so.1 =&amp;gt; /opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.so.1 &lt;BR /&gt;libiomp5.so =&amp;gt; /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64/libiomp5.so&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;(code formatting apparently doesn't work)&lt;/P&gt;</description>
      <pubDate>Thu, 15 Sep 2022 13:28:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1414967#M33657</guid>
      <dc:creator>AntK</dc:creator>
      <dc:date>2022-09-15T13:28:47Z</dc:date>
    </item>
    <item>
      <title>Re:Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1415203#M33658</link>
      <description>&lt;P&gt;Hi Anton,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for reaching out to us.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;gt;SGEMM from MKL 2021.4.0.....&lt;/EM&gt;.&lt;EM&gt;I don't understand why MKL SGEMM is the slowes&lt;/EM&gt;t&lt;/P&gt;&lt;P&gt;Could you please try the latest MKL version which is 2022.1.0 and see if there is any improvement?&lt;/P&gt;&lt;P&gt;It would be a great help if you provide us with the complete sample reproducer code along with steps to reproduce the issue and the output of the MKL_VERBOSE variable (usage: &lt;STRONG&gt;export MKL_VERBOSE=1&lt;/STRONG&gt; before running the executable) so that we can check this issue from our end as well.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:41:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1415203#M33658</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-09-16T09:41:35Z</dc:date>
    </item>
    <item>
      <title>Re: Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1415423#M33662</link>
      <description>&lt;P&gt;Hi VidyalathaB,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you for your kind answer. Unfortunally, I cannot change the MKL version. The code snippets contain all the information and logic, you may be interested in. If it's non-obvious how to figure out calls signature, I'm always glad to help. Here they are:&lt;/P&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;void&lt;/SPAN&gt; &lt;SPAN&gt;matmat_mul_*&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt; &lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;const&lt;/SPAN&gt; &lt;SPAN&gt;float*&lt;/SPAN&gt; &lt;SPAN&gt;A_mat&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;const&lt;/SPAN&gt; &lt;SPAN&gt;float*&lt;/SPAN&gt; &lt;SPAN&gt;B_mat&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;float*&lt;/SPAN&gt; &lt;SPAN&gt;C_out&lt;/SPAN&gt;&lt;SPAN&gt;);&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;This code will help to allocate the chunk of memory for the call:&lt;/DIV&gt;
&lt;DIV&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; float *&lt;/SPAN&gt;&lt;SPAN&gt;A&lt;/SPAN&gt;&lt;SPAN&gt;=(&lt;/SPAN&gt;&lt;SPAN&gt;float&lt;/SPAN&gt;&lt;SPAN&gt;*)&lt;/SPAN&gt;&lt;SPAN&gt;_mm_malloc&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;n&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;sizeof&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;float&lt;/SPAN&gt;&lt;SPAN&gt;), &lt;/SPAN&gt;&lt;SPAN&gt;64&lt;/SPAN&gt;&lt;SPAN&gt;);&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;And this is to free the chunk:&lt;/DIV&gt;
&lt;DIV&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;_mm_free&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;A&lt;/SPAN&gt;&lt;SPAN&gt;);&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;Please don't forget to restart the computer before running the experiments.&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;This is the MKL_VERBOSE console output for n = 1024:&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE oneMKL 2021.0 Update 4 Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.20GHz ilp64 intel_thread&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 137.42ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 100.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 99.37ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE SGEMM(N,N,1024,1024,1024,0x7ffe04f66f98,0x7f831ef8c040,1024,0x7f831f38d040,1024,0x7ffe04f66fa0,0x7f831eb8b040,1024) 93.22ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV&gt;And this is for n = 4096&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE oneMKL 2021.0 Update 4 Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.20GHz ilp64 intel_thread&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 4.34s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 4.19s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 3.85s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;MKL_VERBOSE SGEMM(N,N,4096,4096,4096,0x7fffbca922c8,0x7f59259f2040,4096,0x7f59299f3040,4096,0x7fffbca922d0,0x7f59219f1040,4096) 3.73s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:20&lt;/FONT&gt;&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Sun, 18 Sep 2022 08:13:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1415423#M33662</guid>
      <dc:creator>AntK</dc:creator>
      <dc:date>2022-09-18T08:13:52Z</dc:date>
    </item>
    <item>
      <title>Re:Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1415856#M33673</link>
      <description>&lt;P&gt;Hi Anton,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for providing the verbose output.&lt;/P&gt;&lt;P&gt;Could you please attach the complete code here in the forum so that it would help us to do a quick check from our end? &lt;/P&gt;&lt;P&gt;If you do not want to post it here please let us know so that we can contact you privately.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 20 Sep 2022 06:24:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1415856#M33673</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-09-20T06:24:33Z</dc:date>
    </item>
    <item>
      <title>Re: Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1416000#M33680</link>
      <description>&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;stdio.h&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;stdlib.h&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;math.h&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;time.h&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;sys/time.h&amp;gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;iostream&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;vector&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;algorithm&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;numeric&amp;gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;omp.h&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;mkl.h&amp;gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;immintrin.h&amp;gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;Eigen/Dense&amp;gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#include &amp;lt;benchmark/benchmark.h&amp;gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#define USECPSEC 1000000ULL&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;unsigned long long dtime_usec(unsigned long long start=0) {&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;timeval tv;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;gettimeofday(&amp;amp;tv, 0);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;int perfcheck();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;void init_data(int n, float* A_mat, float* B_mat, float seed) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* A = A_mat;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* B = B_mat;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;for ( int i = 0 ; i &amp;lt; n ; i++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for ( int j = 0 ; j &amp;lt; n ; j++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;A[i*n+j]=(float)i/(float)n;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;B[i*n+j]=(float)j/(float)n;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;void verify_res(int n, const float* C1, const float* C2, int ncase)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float norm = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* C = C2;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float rtol=1e-04, atol=1e-05;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int i = 0 ; i &amp;lt; n ; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int j = 0 ; j &amp;lt; n ; j++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;norm += (C[i*n+j]-(float)(i*j)/(float)n)*(C[i*n+j]-(float)(i*j)/(float)n);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;if (abs(C1[i*n+j]-C2[i*n+j]) &amp;gt; atol + rtol * C1[i*n+j]) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;printf("Error in (%d, %d)\n", i, j);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;printf("%d - C1: %f C2: %f\n", ncase, C1[i*n+j], C2[i*n+j]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;throw 1;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;if (norm &amp;gt; 1e-8)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;printf("Error: %f\n", norm);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;throw 1;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;void matmat_mul(int n, const float* A_mat, const float* B_mat, float* C_out) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;// C = alpha * A x B + beta * C&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float alpha = 1.0, beta = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* A = &amp;amp;A_mat[0];&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* B = &amp;amp;B_mat[0];&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C = &amp;amp;C_out[0];&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;for ( int i = 0 ; i &amp;lt; n ; i++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for ( int j = 0 ; j &amp;lt; n ; j++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;C[i*n+j] *= beta;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int k = 0 ; k &amp;lt; n ; k++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;C[i*n+j] += alpha*A[i*n+k]*B[k*n+j];&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;void matmat_mul_simd(int n, const float* A_mat, const float* B_mat, float* C_out) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;// C = alpha * A x B + beta * C&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float alpha = 1.0, beta = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* A = (const float*)__builtin_assume_aligned(&amp;amp;A_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* B = (const float*)__builtin_assume_aligned(&amp;amp;B_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C = (float*) __builtin_assume_aligned(&amp;amp;C_out[0], 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;__m128 alpha4 = _mm_set1_ps(alpha);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 beta4 = _mm_set1_ps(beta);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;for(int i=0; i&amp;lt;n; i++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int j=0; j&amp;lt;n; j+=4) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 c4 = _mm_load_ps(&amp;amp;C[i*n+j]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;c4 = _mm_mul_ps(beta4,c4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_store_ps(&amp;amp;C[i*n+j], c4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;for(int i=0; i&amp;lt;n; i++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int k=0; k&amp;lt;n; k++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 a4 = _mm_set1_ps(A[i*n+k]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;a4 = _mm_mul_ps(alpha4,a4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int j=0; j&amp;lt;n; j+=4) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 c4 = _mm_load_ps(&amp;amp;C[i*n+j]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 b4 = _mm_load_ps(&amp;amp;B[k*n+j]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_store_ps(&amp;amp;C[i*n+j], c4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;void matmat_mul_omp(int n, const float* A_mat, const float* B_mat, float* C_out) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;// C = alpha * A x B + beta * C&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float alpha = 1.0, beta = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* A = (const float*)__builtin_assume_aligned(&amp;amp;A_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* B = (const float*)__builtin_assume_aligned(&amp;amp;B_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C = (float*) __builtin_assume_aligned(&amp;amp;C_out[0], 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#pragma omp parallel for schedule(dynamic)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#pragma vector aligned&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int i = 0 ; i &amp;lt; n ; i++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int j = 0 ; j &amp;lt; n ; j++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float tmpSum = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#pragma omp reduction (+: tmpSum)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#pragma GCC unroll 8&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int k = 0 ; k &amp;lt; n ; k++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;tmpSum += A[i*n+k]*B[k*n+j];&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;C[i*n+j] = beta * C[i*n+j] + alpha * tmpSum;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;void matmat_mul_simd_omp(int n, const float* A_mat, const float* B_mat, float* C_out) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;// C = alpha * A x B + beta * C&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float alpha = 1.0, beta = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* A = (const float*)__builtin_assume_aligned(&amp;amp;A_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* B = (const float*)__builtin_assume_aligned(&amp;amp;B_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C = (float*) __builtin_assume_aligned(&amp;amp;C_out[0], 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;__m128 alpha4 = _mm_set1_ps(alpha);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 beta4 = _mm_set1_ps(beta);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#pragma omp parallel for collapse(2)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int i=0; i&amp;lt;n; i++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int j=0; j&amp;lt;n; j+=4) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 c4 = _mm_load_ps(&amp;amp;C[i*n+j]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;c4 = _mm_mul_ps(beta4,c4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_store_ps(&amp;amp;C[i*n+j], c4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#pragma omp parallel for schedule(dynamic)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int i=0; i&amp;lt;n; i++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int k=0; k&amp;lt;n; k++) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 a4 = _mm_set1_ps(A[i*n+k]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;a4 = _mm_mul_ps(alpha4,a4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int j=0; j&amp;lt;n; j+=4) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 c4 = _mm_load_ps(&amp;amp;C[i*n+j]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;__m128 b4 = _mm_load_ps(&amp;amp;B[k*n+j]);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_store_ps(&amp;amp;C[i*n+j], c4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;void matmat_mul_omp_blk(int n, int BS, const float* A_mat, const float* B_mat, float* C_out) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;// C = alpha * A x B + beta * C&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float alpha = 1.0, beta = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* A = (const float*)__builtin_assume_aligned(&amp;amp;A_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* B = (const float*)__builtin_assume_aligned(&amp;amp;B_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C = (float*) __builtin_assume_aligned(&amp;amp;C_out[0], 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#pragma omp parallel for collapse(2)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#pragma vector aligned&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int i=0; i&amp;lt;n; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for(int j=0; j&amp;lt;n; j++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;C[i*n+j] *= beta;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#pragma omp parallel for schedule(dynamic)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#pragma vector aligned&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int i = 0; i &amp;lt; n; i+=BS)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int k = 0; k &amp;lt; n; k+=BS)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int j = 0; j &amp;lt; n; j+=BS)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int ii = i; ii &amp;lt; i+BS; ii++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int kk = k; kk &amp;lt; k+BS; kk++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int jj = j; jj &amp;lt; j+BS; jj++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;C[ii*n+jj] += alpha*A[ii*n+kk]*B[kk*n+jj];&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;void matmat_mul_eigen(int n, const float* A_mat, const float* B_mat, float* C_out) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;// C = alpha * A x B + beta * C&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float alpha = 1.0, beta = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* A = (const float*)__builtin_assume_aligned(&amp;amp;A_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* B = (const float*)__builtin_assume_aligned(&amp;amp;B_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C = (float*) __builtin_assume_aligned(&amp;amp;C_out[0], 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;// "The best code is the code I don't have to write"&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Eigen::Map&amp;lt;const Eigen::MatrixXf&amp;gt; AM(A, n, n);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Eigen::Map&amp;lt;const Eigen::MatrixXf&amp;gt; BM(B, n, n);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;Eigen::Map&amp;lt;Eigen::MatrixXf&amp;gt; CM(C, n, n);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;CM.noalias() = beta*CM + alpha*(BM * AM); // fortran order!&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;void matmat_mul_sgemm(const int n, float* A_mat, float* B_mat, float* C_out) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float alpha = 1.0, beta = 0.0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* A = (const float*)__builtin_assume_aligned(&amp;amp;A_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const float* B = (const float*)__builtin_assume_aligned(&amp;amp;B_mat[0], 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C = (float*) __builtin_assume_aligned(&amp;amp;C_out[0], 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;// "The best code is the code I don't have to write"&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;n, n, n, alpha, A, n, B, n, beta, C, n);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;class MatMul : public benchmark::Fixture {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;protected:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int i=0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int n;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* b;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* A, * B, * C;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;public:&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;void SetUp(const ::benchmark::State&amp;amp; state) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;n = state.range(0);&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;A=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;B=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;C=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;init_data(n, A, B, (float)time(NULL));&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;void TearDown(const ::benchmark::State&amp;amp; state) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(A);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(B);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;};&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_DEFINE_F(MatMul, Verify)(benchmark::State&amp;amp; st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;n = 16;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* A2=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* B2=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C1=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C2=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C3=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C4=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C5=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C6=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C7=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;init_data(n, A2, B2, (float)time(NULL));&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;for (auto _ : st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul(n, A2, B2, C1);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_omp(n, A2, B2, C2);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_simd(n, A2, B2, C3);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_omp_blk(n, 4, A2, B2, C4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_simd_omp(n, A2, B2, C5);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_eigen(n, A2, B2, C6);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_sgemm(n, A2, B2, C7);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;verify_res(n, C1, C2, 1);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;verify_res(n, C2, C3, 2);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;verify_res(n, C3, C4, 3);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;verify_res(n, C4, C5, 4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;verify_res(n, C5, C6, 5);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;verify_res(n, C6, C7, 6);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(A2);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(B2);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C1);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C2);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C3);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C4);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C5);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C6);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C7);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_REGISTER_F(MatMul, Verify)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;-&amp;gt;Unit(benchmark::kMillisecond)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;-&amp;gt;Arg(8)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;-&amp;gt;UseRealTime();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_DEFINE_F(MatMul, SingleThread)(benchmark::State&amp;amp; st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (auto _ : st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul(n, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::DoNotOptimize(C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::ClobberMemory();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_DEFINE_F(MatMul, Simd)(benchmark::State&amp;amp; st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (auto _ : st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_simd(n, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::DoNotOptimize(C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::ClobberMemory();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_DEFINE_F(MatMul, Omp)(benchmark::State&amp;amp; st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (auto _ : st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_omp(n, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::DoNotOptimize(C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::ClobberMemory();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_DEFINE_F(MatMul, SimdOmp)(benchmark::State&amp;amp; st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (auto _ : st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_simd_omp(n, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::DoNotOptimize(C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::ClobberMemory();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_DEFINE_F(MatMul, OmpBlk)(benchmark::State&amp;amp; st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int bs = n / st.range(1);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (auto _ : st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_omp_blk(n, bs, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::DoNotOptimize(C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::ClobberMemory();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_DEFINE_F(MatMul, Eign)(benchmark::State&amp;amp; st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (auto _ : st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_eigen(n, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::DoNotOptimize(C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::ClobberMemory();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_DEFINE_F(MatMul, MklBlas)(benchmark::State&amp;amp; st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (auto _ : st) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_sgemm(n, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::DoNotOptimize(C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;benchmark::ClobberMemory();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int perfbench(int n) {&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* A=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* B=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;float* C=(float*)_mm_malloc(n*n*sizeof(float), 64);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;unsigned long long dt = 0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int nrepeats = 3;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;std::vector&amp;lt;unsigned long long&amp;gt; times(nrepeats);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;int bs = 128;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_omp_blk(n, bs, A, B, C); // warm up&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;times.clear(); times.resize(nrepeats);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int i = 0; i &amp;lt; nrepeats; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;dt = dtime_usec(0);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_omp_blk(n, bs, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;times[i] = dtime_usec(dt);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;std::cout &amp;lt;&amp;lt; "omp_blk time: " &amp;lt;&amp;lt; dt &amp;lt;&amp;lt; "ms" &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_eigen(n, A, B, C); // warm up&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;times.clear(); times.resize(nrepeats);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int i = 0; i &amp;lt; nrepeats; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;dt = dtime_usec(0);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_eigen(n, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;times[i] = dtime_usec(dt);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;std::cout &amp;lt;&amp;lt; "Eigen time: " &amp;lt;&amp;lt; dt &amp;lt;&amp;lt; "ms" &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_sgemm(n, A, B, C); // warm up&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;times.clear(); times.resize(nrepeats);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;for (int i = 0; i &amp;lt; nrepeats; i++)&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;dt = dtime_usec(0);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;matmat_mul_sgemm(n, A, B, C);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;times[i] = dtime_usec(dt);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;dt = std::accumulate(times.begin(), times.end(), 0.0) / times.size() / 1000;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;std::cout &amp;lt;&amp;lt; "MKL sgemm time: " &amp;lt;&amp;lt; dt &amp;lt;&amp;lt; "ms" &amp;lt;&amp;lt; std::endl;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(A);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(B);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;_mm_free(C);&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;return 0;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;#if 1&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;int main() &lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;{&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;const int n = 4*1024;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;perfbench(n);&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;// perfcheck();&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#else&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int from = 512; // 1 MB&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;// int to = 2048; // 2k * 2k * 8 = 32M&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int to = 32*1024;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int mult = 8;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;int step = 512;&lt;/FONT&gt;&lt;/P&gt;
&lt;DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;BENCHMARK_REGISTER_F&lt;/SPAN&gt;&lt;SPAN&gt;(MatMul, SingleThread)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Unit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::kMillisecond)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;DenseRange&lt;/SPAN&gt;&lt;SPAN&gt;(from, to, step)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;UseRealTime&lt;/SPAN&gt;&lt;SPAN&gt;();&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;BR /&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;BENCHMARK_REGISTER_F&lt;/SPAN&gt;&lt;SPAN&gt;(MatMul, Simd)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Unit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::kMillisecond)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;RangeMultiplier&lt;/SPAN&gt;&lt;SPAN&gt;(mult)-&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Range&lt;/SPAN&gt;&lt;SPAN&gt;(from, to)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;UseRealTime&lt;/SPAN&gt;&lt;SPAN&gt;();&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;BR /&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;BENCHMARK_REGISTER_F&lt;/SPAN&gt;&lt;SPAN&gt;(MatMul, Omp)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Unit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::kMillisecond)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;RangeMultiplier&lt;/SPAN&gt;&lt;SPAN&gt;(mult)-&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Range&lt;/SPAN&gt;&lt;SPAN&gt;(from, to)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;UseRealTime&lt;/SPAN&gt;&lt;SPAN&gt;();&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;BR /&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;BENCHMARK_REGISTER_F&lt;/SPAN&gt;&lt;SPAN&gt;(MatMul, SimdOmp)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Unit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::kMillisecond)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;RangeMultiplier&lt;/SPAN&gt;&lt;SPAN&gt;(mult)-&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Range&lt;/SPAN&gt;&lt;SPAN&gt;(from, to)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;UseRealTime&lt;/SPAN&gt;&lt;SPAN&gt;();&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;BR /&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;BENCHMARK_REGISTER_F&lt;/SPAN&gt;&lt;SPAN&gt;(MatMul, OmpBlk)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Unit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::kMillisecond)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;ArgsProduct&lt;/SPAN&gt;&lt;SPAN&gt;({&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::&lt;/SPAN&gt;&lt;SPAN&gt;CreateRange&lt;/SPAN&gt;&lt;SPAN&gt;(from, to, mult),&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::&lt;/SPAN&gt;&lt;SPAN&gt;CreateRange&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;64&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;256&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt; /*multi=*/&lt;/SPAN&gt;&lt;SPAN&gt;4&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; })&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;UseRealTime&lt;/SPAN&gt;&lt;SPAN&gt;();&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;BR /&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;BENCHMARK_REGISTER_F&lt;/SPAN&gt;&lt;SPAN&gt;(MatMul, Eign)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Unit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::kMillisecond)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;RangeMultiplier&lt;/SPAN&gt;&lt;SPAN&gt;(mult)-&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Range&lt;/SPAN&gt;&lt;SPAN&gt;(from, to)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;UseRealTime&lt;/SPAN&gt;&lt;SPAN&gt;();&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;BR /&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;BENCHMARK_REGISTER_F&lt;/SPAN&gt;&lt;SPAN&gt;(MatMul, MklBlas)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Unit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;benchmark&lt;/SPAN&gt;&lt;SPAN&gt;::kMillisecond)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;RangeMultiplier&lt;/SPAN&gt;&lt;SPAN&gt;(mult)-&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;Range&lt;/SPAN&gt;&lt;SPAN&gt;(from, to)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; -&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt;UseRealTime&lt;/SPAN&gt;&lt;SPAN&gt;();&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;
&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;BENCHMARK_MAIN();&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#endif&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Sep 2022 16:57:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1416000#M33680</guid>
      <dc:creator>AntK</dc:creator>
      <dc:date>2022-09-20T16:57:34Z</dc:date>
    </item>
    <item>
      <title>Re:Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1416190#M33686</link>
      <description>&lt;P&gt;Hi Anton,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for sharing the reproducer here.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;I tried running the code and here is the output I'm getting which shows MKL is taking less time&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;omp_blk time: &lt;STRONG&gt;1179ms&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Eigen time: &lt;STRONG&gt;535ms&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;MKL sgemm time: &lt;STRONG&gt;172ms&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Command used:&lt;/P&gt;&lt;P&gt;&amp;nbsp;g++ main.cpp -O3 -fopenmp&amp;nbsp;-DMKL_ILP64&amp;nbsp;-m64 -I"/usr/local/include/eigen3/" -I"${MKLROOT}/include" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -lbenchmark -static-libstdc++ -msse4.2&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 21 Sep 2022 08:28:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1416190#M33686</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-09-21T08:28:27Z</dc:date>
    </item>
    <item>
      <title>Re: Re:Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1416718#M33696</link>
      <description>&lt;P&gt;I'm so excited for your success, Vidya! The code works on your machine! At this point, it may be time to research the situation. What if we both start reading this thread from the beginning to find the CPU and OS specifications?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Just in case my сomplation flags&lt;/P&gt;
&lt;P&gt;CC&lt;BR /&gt;&lt;FONT face="courier new,courier"&gt;-Wall -Wno-unknown-pragmas -mavx2 -O3 -DNDEBUG -fopenmp -std=gnu++17&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;LINK&lt;/P&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;-lgomp -lpthread -Wl,-rpath=.../oneapi/mkl/2021.4.0/lib/intel64 .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.so .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.so .../oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.so .../oneapi/compiler/latest/linux/compiler/lib/intel64/libiomp5.so -lm -ldl -lpthread -pthread -lrt&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Flags&amp;nbsp;&lt;FONT face="courier new,courier"&gt;-DMKL_ILP64 -m64&amp;nbsp;&lt;/FONT&gt;and &lt;SPAN&gt;&amp;nbsp;&lt;FONT face="courier new,courier"&gt;-Wl,--no-as-needed&lt;/FONT&gt; didn't&amp;nbsp;&lt;/SPAN&gt;change anything.&lt;/P&gt;</description>
      <pubDate>Thu, 22 Sep 2022 22:28:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1416718#M33696</guid>
      <dc:creator>AntK</dc:creator>
      <dc:date>2022-09-22T22:28:22Z</dc:date>
    </item>
    <item>
      <title>Re: Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1418070#M33717</link>
      <description>&lt;P&gt;Any new ideas?&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2022 19:13:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1418070#M33717</guid>
      <dc:creator>AntK</dc:creator>
      <dc:date>2022-09-28T19:13:35Z</dc:date>
    </item>
    <item>
      <title>Re:Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1418287#M33720</link>
      <description>&lt;P&gt;Hi Anton,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;gt;What if we both start reading this thread from the beginning to find the CPU and OS specifications?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;I apologize for the delay and I appreciate your patience.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;It took me a while in finding the CentOS machine and setting up the environment and installing the dependencies to test the code and here are the results&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Output:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;omp_blk time: &lt;STRONG&gt;1025ms&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Eigen time: &lt;STRONG&gt;453ms&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;MKL sgemm time: &lt;STRONG&gt;50ms&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Even here the MKL is performing better than others&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;CPU Model:&lt;/STRONG&gt; &lt;/P&gt;&lt;P&gt;Intel(R) Xeon(R) Platinum 8260M CPU @ 2.40GHz&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;CentOS 8 &lt;/STRONG&gt;(I could see that you are trying it on CentOS 7 but support for CentOS* 7 is deprecated in this release, Intel oneAPI 2022.1, and will be removed in a future release Refer: &lt;A href="https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html" target="_blank"&gt;https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html&lt;/A&gt;)&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Compilation command used:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;g++ test.cpp -Wall -Wno-unknown-pragmas -mavx2 -DNDEBUG -std=gnu++17 -O3 -fopenmp -DMKL_ILP64 -m64 -I"/usr/local/include/eigen3/" -I"/home/administrator/vidya/benchmark/include/" -I"${MKLROOT}/include" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl -L"./benchmark/build/src" -lbenchmark&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;g++ --version &amp;gt; (GCC) 8.5.0 20210514&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;In my previous post, the results are tested on Ubuntu 18.04.6 with CPU model Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 29 Sep 2022 11:38:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1418287#M33720</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-09-29T11:38:18Z</dc:date>
    </item>
    <item>
      <title>Re: Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1418468#M33722</link>
      <description>&lt;P&gt;Great. Things started moving. At least I have some hope now. OS is not quite relevant (I appreciate you found CentOS anyway). Your CPU is much better though.&lt;/P&gt;
&lt;P&gt;To match things up I rerun the task on&amp;nbsp;&lt;STRONG&gt;Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;I also tuned up the block size parameter, now it's 32.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;My output&lt;/STRONG&gt;:&lt;/P&gt;
&lt;P&gt;omp_blk(bs32) &lt;STRONG&gt;41&amp;nbsp;ms&lt;/STRONG&gt;&lt;BR /&gt;Eigen &lt;STRONG&gt;15 ms&lt;/STRONG&gt;&lt;BR /&gt;MKL sgemm &lt;STRONG&gt;42 ms&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The MKL time is quite similar to yours, but my Eigen is still way better and MKL is still the slowest.&lt;/P&gt;
&lt;P&gt;I tried to adjust&amp;nbsp;MKL_ENABLE_INSTRUCTIONS variable, it didn't help. I &lt;STRONG&gt;increased n to 2048&amp;nbsp;&lt;/STRONG&gt;to&amp;nbsp;&lt;/P&gt;
&lt;P class="sub_section_element_selectors"&gt;&lt;STRONG class="sub_section_element_selectors"&gt;CPU Model:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;AVX2&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;$ MKL_ENABLE_INSTRUCTIONS=AVX2 ./build/Release/perfdemo&lt;BR /&gt;perfbench for n = &lt;STRONG&gt;2048&lt;/STRONG&gt;&lt;BR /&gt;OpenMP Tile32 &lt;STRONG&gt;139 ms&lt;/STRONG&gt;&lt;BR /&gt;Eigen &lt;STRONG&gt;91 ms&lt;/STRONG&gt;&lt;BR /&gt;MKL sgemm &lt;STRONG&gt;456 ms&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;AVX512&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;$ MKL_ENABLE_INSTRUCTIONS=AVX512 ./build/Release/perfdemo&lt;BR /&gt;perfbench for n = &lt;STRONG&gt;2048&lt;/STRONG&gt;&lt;BR /&gt;OpenMP Tile32 &lt;STRONG&gt;137 ms&lt;/STRONG&gt;&lt;BR /&gt;Eigen &lt;STRONG&gt;108 ms&lt;/STRONG&gt;&lt;BR /&gt;MKL sgemm &lt;STRONG&gt;273 ms&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Eigen time seems pretty noisy up to x5 - x10. What if you run the binary a few times? Will the eigen time change?&lt;/P&gt;
&lt;P&gt;I'm totally puzzled. What's going on?&lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 22:20:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1418468#M33722</guid>
      <dc:creator>AntK</dc:creator>
      <dc:date>2022-09-29T22:20:12Z</dc:date>
    </item>
    <item>
      <title>Re: Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1418630#M33724</link>
      <description>&lt;P&gt;Hi Anton,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for getting back to us.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This time I changed the n value of perfbench to 2048 and made the value of bs to 32 (please let me know if there is any mistake here)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;gt;What if you run the binary a few times? Will the eigen time change?&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Sure, this is how I executed&lt;/P&gt;
&lt;P&gt;for i in {1..20}; do ./a.out $i; done &amp;gt; out.txt&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please find the attached file out.txt to see the output.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;gt;I'm totally puzzled. What's going on?&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;I tried it using MKL 2022.1.0 and I could not see the timings that you are getting (i guess the only difference&amp;nbsp; I could see in both our environments is the MKL version being used as everything else is almost similar). You can give it a try with the latest version which is 2022.2.0 now available for download and let us know if the issue still persists.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Regards,&lt;/P&gt;
&lt;P&gt;Vidya.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 30 Sep 2022 10:47:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1418630#M33724</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-09-30T10:47:49Z</dc:date>
    </item>
    <item>
      <title>Re:Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420191#M33745</link>
      <description>&lt;P&gt;Hi Anton,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;As we haven't heard back from you, could you please provide us with an update regarding the issue? Please let us know if you still observe the same timings with the latest oneMKL version which is 2022.2.0.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 07 Oct 2022 05:14:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420191#M33745</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-10-07T05:14:25Z</dc:date>
    </item>
    <item>
      <title>Re: Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420194#M33746</link>
      <description>&lt;P&gt;Hi Vidya,&lt;/P&gt;
&lt;P&gt;On my workstation I cannot update MKL and installing everything locally will take too much time, which I need elsewhere. You cannot use an older MKL version either. Therefore, I'm postponing the investigation and waiting for the deployment team to update my libs later.&lt;/P&gt;
&lt;P&gt;My results are quite consistent though.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Cheers,&lt;/P&gt;
&lt;P&gt;Anton&lt;/P&gt;</description>
      <pubDate>Fri, 07 Oct 2022 05:20:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420194#M33746</guid>
      <dc:creator>AntK</dc:creator>
      <dc:date>2022-10-07T05:20:16Z</dc:date>
    </item>
    <item>
      <title>Re:Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420688#M33755</link>
      <description>&lt;P&gt;Hi Anton,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;gt;Therefore, I'm postponing the investigation and waiting for the deployment team to update my libs later.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Could you please let us know if you (or your company or institution) have priority support? If yes, we would recommend you post the issue at &lt;A href="https://supporttickets.intel.com/servicecenter?lang=en-US" target="_blank"&gt;https://supporttickets.intel.com/servicecenter?lang=en-US&lt;/A&gt; &lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;If not, as per your request we can postpone it and close this thread for now.&lt;/P&gt;&lt;P&gt;Please do let us know.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 10 Oct 2022 08:55:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420688#M33755</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-10-10T08:55:38Z</dc:date>
    </item>
    <item>
      <title>Re: Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420866#M33764</link>
      <description>&lt;P&gt;OK. Let's close it.&lt;/P&gt;</description>
      <pubDate>Mon, 10 Oct 2022 22:31:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420866#M33764</guid>
      <dc:creator>AntK</dc:creator>
      <dc:date>2022-10-10T22:31:25Z</dc:date>
    </item>
    <item>
      <title>Re:Why BLAS SGEMM is slow?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420951#M33765</link>
      <description>&lt;P&gt;Hi Anton,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;gt;OK. Let's close it.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for the confirmation!&lt;/P&gt;&lt;P&gt;We are closing this thread for now. Please post a new question if you need any additional assistance for Intel as this thread will no longer be monitored.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Vidya.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 11 Oct 2022 04:29:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Why-BLAS-SGEMM-is-slow/m-p/1420951#M33765</guid>
      <dc:creator>VidyalathaB_Intel</dc:creator>
      <dc:date>2022-10-11T04:29:49Z</dc:date>
    </item>
  </channel>
</rss>

