<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Mkl introduced optimized in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936694#M14148</link>
    <description>&lt;P&gt;Mkl introduced optimized dcsrmv only a year ago so your conclusion would be expected for earlier versions.&lt;/P&gt;

&lt;P&gt;If you call mkl in a parallel region the default will be not to use additional threads.&lt;/P&gt;</description>
    <pubDate>Wed, 05 Mar 2014 03:30:00 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2014-03-05T03:30:00Z</dc:date>
    <item>
      <title>mkl_dcsrmv slower than openMP implementation</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936693#M14147</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I'm trying to find the fastest way to do a multithreaded sparse matrix-vector multiply. I've written some benchmarking code to form a large random sparse matrix in CSR format, and then time 3 different implementations to compute y = y + A*x. I have a serial implementation, an openMP implementation, and mkl_dcsrmv. I'm computing the average and minimum time over a number of runs, say, 10.&lt;/P&gt;

&lt;P&gt;Strangely, though, the openMP implementation beats MKL always. For the matrix sizes in the code, openMP has a min time of 0.199272 seconds, while MKL has a min time of 0.249399 seconds over 10 runs. This is for a matrix with about 256 million nonzeros.&lt;/P&gt;

&lt;P&gt;I'm running this on a machine with 32 cores. I've adjusted the number of threads and played with the KMP_AFFINITY environment variable. The openMP code does better in every case.&lt;/P&gt;

&lt;P&gt;Any idea why I'm getting these results? Perhaps I'm using MKL sub-optimally? Any help would be greatly appreciated.&lt;/P&gt;

&lt;P&gt;I've attached the code I'm running. I compile with "icc -mkl -openmp rand_mat.c"&lt;/P&gt;

&lt;P&gt;Thanks,&lt;/P&gt;

&lt;P&gt;AJ&lt;/P&gt;</description>
      <pubDate>Wed, 05 Mar 2014 01:29:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936693#M14147</guid>
      <dc:creator>AJ_F_</dc:creator>
      <dc:date>2014-03-05T01:29:28Z</dc:date>
    </item>
    <item>
      <title>Mkl introduced optimized</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936694#M14148</link>
      <description>&lt;P&gt;Mkl introduced optimized dcsrmv only a year ago so your conclusion would be expected for earlier versions.&lt;/P&gt;

&lt;P&gt;If you call mkl in a parallel region the default will be not to use additional threads.&lt;/P&gt;</description>
      <pubDate>Wed, 05 Mar 2014 03:30:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936694#M14148</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-03-05T03:30:00Z</dc:date>
    </item>
    <item>
      <title>I'm using Intel Composer XE</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936695#M14149</link>
      <description>&lt;P&gt;I'm using Intel Composer XE 2013 SP1, which would come with MKL 11.1, so the version shouldn't be an issue, yes?&lt;/P&gt;

&lt;P&gt;I'm not calling MKL in a parallel region. I'm using openMP only in the implementation in the other spMV. Also, I can see the speedup in MKL as I use additional threads, so multiple threads are definitely being used.&lt;/P&gt;

&lt;P&gt;What else might be causing it? Are there any tricks I'm missing to maximize the mkl speedup?&lt;/P&gt;

&lt;P&gt;AJ&lt;/P&gt;</description>
      <pubDate>Wed, 05 Mar 2014 04:07:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936695#M14149</guid>
      <dc:creator>AJ_F_</dc:creator>
      <dc:date>2014-03-05T04:07:49Z</dc:date>
    </item>
    <item>
      <title>AJ,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936696#M14150</link>
      <description>&lt;P&gt;AJ,&lt;/P&gt;

&lt;P&gt;Thanks for test code. We will further check on the code. btw, what is the procoessor that you got this issue?&lt;/P&gt;

&lt;P&gt;regards,&lt;BR /&gt;
	Chao&lt;/P&gt;</description>
      <pubDate>Wed, 05 Mar 2014 06:09:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936696#M14150</guid>
      <dc:creator>Chao_Y_Intel</dc:creator>
      <dc:date>2014-03-05T06:09:30Z</dc:date>
    </item>
    <item>
      <title>I'm running on an Intel(R)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936697#M14151</link>
      <description>&lt;P&gt;I'm running on an&amp;nbsp;Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz&lt;/P&gt;

&lt;P&gt;I have 32 cores over 4 sockets.&lt;/P&gt;</description>
      <pubDate>Wed, 05 Mar 2014 09:32:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936697#M14151</guid>
      <dc:creator>AJ_F_</dc:creator>
      <dc:date>2014-03-05T09:32:30Z</dc:date>
    </item>
    <item>
      <title>Hi AJ, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936698#M14152</link>
      <description>&lt;P&gt;Hi AJ,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The engineer owner have some investigation on the performance problem.&amp;nbsp;&lt;BR /&gt;
	The matrix uses non-sorted column indexes which leads to ineffective cache utilization during row to vector multiplication.&amp;nbsp;&lt;BR /&gt;
	And this explains why increasing number of parallel jobs gets more GFlop/s - jobs waiting on cache misses give way for other jobs for which the data is in the cache.&amp;nbsp;&lt;BR /&gt;
	so the suggestion is to sort column indexes before calling MKL CSRMV. Then he will be able to get all value added performance from MKL.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;thanks,&lt;BR /&gt;
	Chao&lt;/P&gt;</description>
      <pubDate>Mon, 15 Jun 2015 03:19:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/mkl-dcsrmv-slower-than-openMP-implementation/m-p/936698#M14152</guid>
      <dc:creator>Chao_Y_Intel</dc:creator>
      <dc:date>2015-06-15T03:19:13Z</dc:date>
    </item>
  </channel>
</rss>

