<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic MKL Diagonal SpMV in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Diagonal-SpMV/m-p/1080248#M22773</link>
    <description>&lt;P&gt;Hi Experts,&lt;/P&gt;

&lt;P&gt;I am trying to do SpMV using the diagonal storage format. I found 2 routines that do this operation for real double-precision one-based Indexing (mkl_ddiamv, &amp;nbsp;mkl_ddiagemv).&lt;/P&gt;

&lt;P&gt;I get the right results. But the issue is that even when I change the number of threads I get almost the same GFLOPS/s (i.e. same execution time).&lt;/P&gt;

&lt;P&gt;I checked the results on KNL (64-Core) and Dual-E5 Broadwell (72 cores ) and&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp;used 26 diagonal matrices&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;from University of Florida (example:&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/index.html"&gt;McRae&lt;/A&gt;&lt;/STRONG&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;/&lt;/STRONG&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/ecology1.html"&gt;ecology1&lt;/A&gt;),&lt;/STRONG&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;For the &lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/index.html"&gt;McRae&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;/&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/ecology1.html" style="color: rgb(0, 174, 239); text-decoration: underline !important;"&gt;ecology1&lt;/A&gt;&amp;nbsp;,&amp;nbsp;&lt;/SPAN&gt;On E5 : &amp;nbsp;around 2 GFLOPS using (1, 4, 8, 18, 36, 54, 72) threads.&lt;/P&gt;

&lt;P&gt;For the &lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/index.html"&gt;McRae&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;/&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/ecology1.html" style="color: rgb(0, 174, 239); text-decoration: underline !important;"&gt;ecology1&lt;/A&gt;,&amp;nbsp;&lt;/SPAN&gt;On KNL: &amp;nbsp;around 0.9 GFLOPS using (1, 4, 16, 32, 64, 128, 192, 256) threads.&lt;/P&gt;

&lt;P&gt;Note: I used CSR, BSR storage formats, the GFLOPS/s changes at different thread number.&lt;/P&gt;

&lt;P&gt;This is a part of my test code:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt; double exTime = 0.0, stime = 0, etime = 0.0;

    if(nr == nc) //mxm
	{

		mkl_ddiagemv(&amp;amp;transa, &amp;amp;dia-&amp;gt;m , dia-&amp;gt;val , &amp;amp;dia-&amp;gt;lval , dia-&amp;gt;idiag , &amp;amp;dia-&amp;gt;ndiag , x , y);
		bool resultNotRight = IsResultsWrong(y, y_ref, nr);
		if(resultNotRight)
			return -5;


		for (int i = 0; i &amp;lt; (runs); i++) {
				stime = dsecnd();
				mkl_ddiagemv(&amp;amp;transa, &amp;amp;dia-&amp;gt;m , dia-&amp;gt;val , &amp;amp;dia-&amp;gt;lval , dia-&amp;gt;idiag , &amp;amp;dia-&amp;gt;ndiag , x , y);
				etime = dsecnd();
				runResults&lt;I&gt; = (etime - stime);
		}
	}
	else //mxk
	{
		mkl_ddiamv (&amp;amp;transa,&amp;amp;nr,&amp;amp;nc,&amp;amp;alpha,matdescra,dia-&amp;gt;val, &amp;amp;dia-&amp;gt;lval, dia-&amp;gt;idiag ,&amp;amp;dia-&amp;gt;ndiag ,x,&amp;amp;beta,y);
		bool resultNotRight = IsResultsWrong(y, y_ref, nr);
		if(resultNotRight)
			return -5;

		for(int i=0;i&amp;lt; (runs);i++)
		{
			stime=dsecnd();
			mkl_ddiamv (&amp;amp;transa,&amp;amp;nr,&amp;amp;nc,&amp;amp;alpha,matdescra,dia-&amp;gt;val, &amp;amp;dia-&amp;gt;lval, dia-&amp;gt;idiag ,&amp;amp;dia-&amp;gt;ndiag ,x,&amp;amp;beta,y);
			etime= dsecnd();
			runResults&lt;I&gt; = (etime - stime);
		}
	}

	//Calculate Best Execution Time
	bestExTime = GetMaxExcutionTime(runResults);

	//Print GPLOPS
	bestExTime = bestExTime / (double)runs;
	double gplops = 1.e-9 * (2.0 * nnz /bestExTime);
	cout&amp;lt;&amp;lt;gplops&amp;lt;&amp;lt;"," &amp;lt;&amp;lt; dia-&amp;gt;ndiag &amp;lt;&amp;lt; ",";
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;PRE class="brush:cpp;"&gt;&amp;nbsp;&lt;/PRE&gt;

&lt;P&gt;Thanks,&lt;/P&gt;

&lt;P&gt;Mohammad Almasri&lt;/P&gt;</description>
    <pubDate>Tue, 02 Aug 2016 18:16:24 GMT</pubDate>
    <dc:creator>Mohammad_A_</dc:creator>
    <dc:date>2016-08-02T18:16:24Z</dc:date>
    <item>
      <title>MKL Diagonal SpMV</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Diagonal-SpMV/m-p/1080248#M22773</link>
      <description>&lt;P&gt;Hi Experts,&lt;/P&gt;

&lt;P&gt;I am trying to do SpMV using the diagonal storage format. I found 2 routines that do this operation for real double-precision one-based Indexing (mkl_ddiamv, &amp;nbsp;mkl_ddiagemv).&lt;/P&gt;

&lt;P&gt;I get the right results. But the issue is that even when I change the number of threads I get almost the same GFLOPS/s (i.e. same execution time).&lt;/P&gt;

&lt;P&gt;I checked the results on KNL (64-Core) and Dual-E5 Broadwell (72 cores ) and&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp;used 26 diagonal matrices&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;from University of Florida (example:&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/index.html"&gt;McRae&lt;/A&gt;&lt;/STRONG&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;/&lt;/STRONG&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/ecology1.html"&gt;ecology1&lt;/A&gt;),&lt;/STRONG&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;For the &lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/index.html"&gt;McRae&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;/&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/ecology1.html" style="color: rgb(0, 174, 239); text-decoration: underline !important;"&gt;ecology1&lt;/A&gt;&amp;nbsp;,&amp;nbsp;&lt;/SPAN&gt;On E5 : &amp;nbsp;around 2 GFLOPS using (1, 4, 8, 18, 36, 54, 72) threads.&lt;/P&gt;

&lt;P&gt;For the &lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/index.html"&gt;McRae&lt;/A&gt;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;/&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px; line-height: 19.512px;"&gt;&lt;A href="http://www.cise.ufl.edu/research/sparse/matrices/McRae/ecology1.html" style="color: rgb(0, 174, 239); text-decoration: underline !important;"&gt;ecology1&lt;/A&gt;,&amp;nbsp;&lt;/SPAN&gt;On KNL: &amp;nbsp;around 0.9 GFLOPS using (1, 4, 16, 32, 64, 128, 192, 256) threads.&lt;/P&gt;

&lt;P&gt;Note: I used CSR, BSR storage formats, the GFLOPS/s changes at different thread number.&lt;/P&gt;

&lt;P&gt;This is a part of my test code:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt; double exTime = 0.0, stime = 0, etime = 0.0;

    if(nr == nc) //mxm
	{

		mkl_ddiagemv(&amp;amp;transa, &amp;amp;dia-&amp;gt;m , dia-&amp;gt;val , &amp;amp;dia-&amp;gt;lval , dia-&amp;gt;idiag , &amp;amp;dia-&amp;gt;ndiag , x , y);
		bool resultNotRight = IsResultsWrong(y, y_ref, nr);
		if(resultNotRight)
			return -5;


		for (int i = 0; i &amp;lt; (runs); i++) {
				stime = dsecnd();
				mkl_ddiagemv(&amp;amp;transa, &amp;amp;dia-&amp;gt;m , dia-&amp;gt;val , &amp;amp;dia-&amp;gt;lval , dia-&amp;gt;idiag , &amp;amp;dia-&amp;gt;ndiag , x , y);
				etime = dsecnd();
				runResults&lt;I&gt; = (etime - stime);
		}
	}
	else //mxk
	{
		mkl_ddiamv (&amp;amp;transa,&amp;amp;nr,&amp;amp;nc,&amp;amp;alpha,matdescra,dia-&amp;gt;val, &amp;amp;dia-&amp;gt;lval, dia-&amp;gt;idiag ,&amp;amp;dia-&amp;gt;ndiag ,x,&amp;amp;beta,y);
		bool resultNotRight = IsResultsWrong(y, y_ref, nr);
		if(resultNotRight)
			return -5;

		for(int i=0;i&amp;lt; (runs);i++)
		{
			stime=dsecnd();
			mkl_ddiamv (&amp;amp;transa,&amp;amp;nr,&amp;amp;nc,&amp;amp;alpha,matdescra,dia-&amp;gt;val, &amp;amp;dia-&amp;gt;lval, dia-&amp;gt;idiag ,&amp;amp;dia-&amp;gt;ndiag ,x,&amp;amp;beta,y);
			etime= dsecnd();
			runResults&lt;I&gt; = (etime - stime);
		}
	}

	//Calculate Best Execution Time
	bestExTime = GetMaxExcutionTime(runResults);

	//Print GPLOPS
	bestExTime = bestExTime / (double)runs;
	double gplops = 1.e-9 * (2.0 * nnz /bestExTime);
	cout&amp;lt;&amp;lt;gplops&amp;lt;&amp;lt;"," &amp;lt;&amp;lt; dia-&amp;gt;ndiag &amp;lt;&amp;lt; ",";
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;PRE class="brush:cpp;"&gt;&amp;nbsp;&lt;/PRE&gt;

&lt;P&gt;Thanks,&lt;/P&gt;

&lt;P&gt;Mohammad Almasri&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2016 18:16:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Diagonal-SpMV/m-p/1080248#M22773</guid>
      <dc:creator>Mohammad_A_</dc:creator>
      <dc:date>2016-08-02T18:16:24Z</dc:date>
    </item>
    <item>
      <title>you see the same performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Diagonal-SpMV/m-p/1080249#M22774</link>
      <description>&lt;P&gt;you see the same performance because of these routines ( for diagonal format ) are not threaded.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2016 06:28:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-Diagonal-SpMV/m-p/1080249#M22774</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2016-08-03T06:28:11Z</dc:date>
    </item>
  </channel>
</rss>

