<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: MKL &amp;quot;only&amp;quot; twice faster on a 8-Cores Machine in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-quot-only-quot-twice-faster-on-a-8-Cores-Machine/m-p/895163#M10835</link>
    <description>Sparse matrix multiplication typically is memory bandwidth limited, with a high cache miss rate. In such cases, you may find that performance saturates at 2 or 4 threads. You should test the KMP_AFFINITY environment variable settings, KMP_AFFINITY=compact (or scatter). compact and scatter are likely to be the same, except at 4 threads.&lt;BR /&gt;It may be that disabling second/alternate sector prefetch could improve performance. I don't have access to Mac specific information on this. A few platforms may have a BIOS setup option. Without that, on linux, it involves root privilege, and an application to alter MSR (model specific register) settings.&lt;BR /&gt;</description>
    <pubDate>Wed, 30 Apr 2008 21:11:22 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2008-04-30T21:11:22Z</dc:date>
    <item>
      <title>MKL "only" twice faster on a 8-Cores Machine</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-quot-only-quot-twice-faster-on-a-8-Cores-Machine/m-p/895162#M10834</link>
      <description>Hello!
&lt;BR /&gt;
&lt;BR /&gt;I wrote a Conjugate Gradient solver to solve big sparse systems and I parallelized it with SSE and OpenMP.
&lt;BR /&gt;
&lt;BR /&gt;I'm working on a 8-Cores Mac Pro, and I reached a speedup of 9.0x in some cases (wrt to my non-parallel implementation).
&lt;BR /&gt;I would compare my results with MKL so I used the dcg_init, dcg_check, dcg and dcg_get methods to solve my system.
&lt;BR /&gt;For the crucial part of the Conjugate Gradient (the sparse matrix vector multiplication) I used the mkl_dcsrmv method.
&lt;BR /&gt;
&lt;BR /&gt;I set the environment variable OMP_NUM_THREADS to 8 in my shell and I checked with a profiler that all the 8 cores were working 100%. Unfortunately, with MKL I have a speedup of "only" 2.0x (wrt to MKL working serially).
&lt;BR /&gt;
&lt;BR /&gt;The matrix of my problem is square and sparse with ~2400000 non-zero elements and ~35000 rows.
&lt;BR /&gt;
&lt;BR /&gt;Am I doing something wrong, am I forgetting something? Or, is the size of my problem to small to see big performance  improvement with MKL?
&lt;BR /&gt;
&lt;BR /&gt;Thanks in advance!
&lt;BR /&gt;Best!
&lt;BR /&gt;
&lt;BR /&gt;dario
&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;</description>
      <pubDate>Wed, 30 Apr 2008 16:39:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-quot-only-quot-twice-faster-on-a-8-Cores-Machine/m-p/895162#M10834</guid>
      <dc:creator>dar_io</dc:creator>
      <dc:date>2008-04-30T16:39:41Z</dc:date>
    </item>
    <item>
      <title>Re: MKL "only" twice faster on a 8-Cores Machine</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-quot-only-quot-twice-faster-on-a-8-Cores-Machine/m-p/895163#M10835</link>
      <description>Sparse matrix multiplication typically is memory bandwidth limited, with a high cache miss rate. In such cases, you may find that performance saturates at 2 or 4 threads. You should test the KMP_AFFINITY environment variable settings, KMP_AFFINITY=compact (or scatter). compact and scatter are likely to be the same, except at 4 threads.&lt;BR /&gt;It may be that disabling second/alternate sector prefetch could improve performance. I don't have access to Mac specific information on this. A few platforms may have a BIOS setup option. Without that, on linux, it involves root privilege, and an application to alter MSR (model specific register) settings.&lt;BR /&gt;</description>
      <pubDate>Wed, 30 Apr 2008 21:11:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-quot-only-quot-twice-faster-on-a-8-Cores-Machine/m-p/895163#M10835</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2008-04-30T21:11:22Z</dc:date>
    </item>
    <item>
      <title>Re: MKL "only" twice faster on a 8-Cores Machine</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-quot-only-quot-twice-faster-on-a-8-Cores-Machine/m-p/895164#M10836</link>
      <description>Thank you very much for your answer!
&lt;BR /&gt;I will check and redo the tests with your advices!
&lt;BR /&gt;Googling a bit, I discovered that "...thread affinity can have a dramatic effect on the application speed" (from intel.com).
&lt;BR /&gt;
&lt;BR /&gt;Best,
&lt;BR /&gt;
&lt;BR /&gt;dario</description>
      <pubDate>Thu, 01 May 2008 07:11:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-quot-only-quot-twice-faster-on-a-8-Cores-Machine/m-p/895164#M10836</guid>
      <dc:creator>dar_io</dc:creator>
      <dc:date>2008-05-01T07:11:57Z</dc:date>
    </item>
  </channel>
</rss>

