<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Blas Performance in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Blas-Performance/m-p/954721#M15424</link>
    <description>&lt;P&gt;Hi there,&lt;/P&gt;
&lt;P&gt;I am running cblas routines on an older Ubuntu 12.04 (64bit) machine, Intel Core 2 Duo (E6600@2.4 GHz) using the latest 11.0 MKL.&lt;/P&gt;
&lt;P&gt;For data of size &amp;gt; 10MB, the performance of&lt;BR /&gt;saxpy is 0.9 Gflops, e.g. n =&amp;nbsp;16777216, t = 0.039717s, where the opcount = 2 * n.&lt;BR /&gt;sdot is 1.4 Gflops,e.g. n =&amp;nbsp;16777216, t = 0.024379s where the opcount = 2 * n - 1.&lt;BR /&gt;sgemv is 2.5 Gflops, e.g. m,n = 4096, t = 0.021503s where the opcount =&amp;nbsp; (2 * n - 1) * m.&lt;BR /&gt;&lt;BR /&gt;However in case of &lt;BR /&gt;sgemm the performance exceeds 35 Gflops, e.g. m,n,k = 4096, t =&amp;nbsp;4.114639s where the opcount = (2*k-1)*m*n.&lt;BR /&gt;Yet this should be impossible as the peak performance of the E6600 is 19.2 Gflops for single precision.&lt;/P&gt;
&lt;P&gt;lda,ldb,ldc = 4096, alpha=1,beta=0 and&lt;BR /&gt;cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, 4096, 4096, 4096, 1.0, A, 4096, B, 4096, 0.0, C, 4096);&lt;BR /&gt;I have verirfied the results for smaller sizes.&lt;/P&gt;
&lt;P&gt;Could someone please tell me how this is possible ?&lt;/P&gt;
&lt;P&gt;Thanks a lot,&lt;BR /&gt;Cem&lt;/P&gt;</description>
    <pubDate>Tue, 26 Feb 2013 16:37:42 GMT</pubDate>
    <dc:creator>Cem_Savas_B_</dc:creator>
    <dc:date>2013-02-26T16:37:42Z</dc:date>
    <item>
      <title>Blas Performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Blas-Performance/m-p/954721#M15424</link>
      <description>&lt;P&gt;Hi there,&lt;/P&gt;
&lt;P&gt;I am running cblas routines on an older Ubuntu 12.04 (64bit) machine, Intel Core 2 Duo (E6600@2.4 GHz) using the latest 11.0 MKL.&lt;/P&gt;
&lt;P&gt;For data of size &amp;gt; 10MB, the performance of&lt;BR /&gt;saxpy is 0.9 Gflops, e.g. n =&amp;nbsp;16777216, t = 0.039717s, where the opcount = 2 * n.&lt;BR /&gt;sdot is 1.4 Gflops,e.g. n =&amp;nbsp;16777216, t = 0.024379s where the opcount = 2 * n - 1.&lt;BR /&gt;sgemv is 2.5 Gflops, e.g. m,n = 4096, t = 0.021503s where the opcount =&amp;nbsp; (2 * n - 1) * m.&lt;BR /&gt;&lt;BR /&gt;However in case of &lt;BR /&gt;sgemm the performance exceeds 35 Gflops, e.g. m,n,k = 4096, t =&amp;nbsp;4.114639s where the opcount = (2*k-1)*m*n.&lt;BR /&gt;Yet this should be impossible as the peak performance of the E6600 is 19.2 Gflops for single precision.&lt;/P&gt;
&lt;P&gt;lda,ldb,ldc = 4096, alpha=1,beta=0 and&lt;BR /&gt;cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, 4096, 4096, 4096, 1.0, A, 4096, B, 4096, 0.0, C, 4096);&lt;BR /&gt;I have verirfied the results for smaller sizes.&lt;/P&gt;
&lt;P&gt;Could someone please tell me how this is possible ?&lt;/P&gt;
&lt;P&gt;Thanks a lot,&lt;BR /&gt;Cem&lt;/P&gt;</description>
      <pubDate>Tue, 26 Feb 2013 16:37:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Blas-Performance/m-p/954721#M15424</guid>
      <dc:creator>Cem_Savas_B_</dc:creator>
      <dc:date>2013-02-26T16:37:42Z</dc:date>
    </item>
    <item>
      <title>Dear Customer,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Blas-Performance/m-p/954722#M15425</link>
      <description>&lt;P&gt;Cem,&lt;/P&gt;
&lt;P&gt;Can you please provide me a testcase with which you've verified your results? I'll test it and let you know my results and comments&lt;/P&gt;
&lt;P&gt;Thank you,&lt;/P&gt;
&lt;P&gt;Sridevi&lt;/P&gt;</description>
      <pubDate>Tue, 26 Feb 2013 20:38:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Blas-Performance/m-p/954722#M15425</guid>
      <dc:creator>Sridevi_A_Intel</dc:creator>
      <dc:date>2013-02-26T20:38:00Z</dc:date>
    </item>
    <item>
      <title>Cem,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Blas-Performance/m-p/954723#M15426</link>
      <description>&lt;P&gt;Cem,&lt;/P&gt;
&lt;P&gt;I notice that you are running your testcase on dual core machine.MKL uses both cores by default.can you please&amp;nbsp;set “export MKL_NUM_THREADS=1” to measure the performance on a single core?&lt;/P&gt;
&lt;P&gt;-Sridevi&lt;/P&gt;</description>
      <pubDate>Mon, 01 Apr 2013 21:55:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Blas-Performance/m-p/954723#M15426</guid>
      <dc:creator>Sridevi_A_Intel</dc:creator>
      <dc:date>2013-04-01T21:55:54Z</dc:date>
    </item>
  </channel>
</rss>

