<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Performance bug in GEMM? in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085254#M22944</link>
    <description>&lt;P&gt;Hi all,&lt;BR /&gt;
	&lt;BR /&gt;
	I just noticed a potential performance bug in the DGEMM implementation of MKL (16.0.1)&lt;BR /&gt;
	when using a single thread. I merely want to make someone at Intel aware of it, in case it is of interest.&lt;BR /&gt;
	&lt;BR /&gt;
	Strangely DGEMM performs better for beta=1 than for beta=0 in certain&lt;BR /&gt;
	situations. Here is an example:&lt;BR /&gt;
	&lt;BR /&gt;
	Intel(R) Xeon(R) CPU E5-2650:&lt;BR /&gt;
	m=72, n=373248, k=72, beta=0.00 : 14.25 GF&lt;BR /&gt;
	m=72, n=373248, k=72, beta=1.00 : 18.36 GF&lt;BR /&gt;
	&lt;BR /&gt;
	Intel(R) Xeon(R) CPU E5-2650:&lt;BR /&gt;
	m=72, n=373248, k=72, beta=0.00 : 19.25 GF&lt;BR /&gt;
	m=72, n=373248, k=72, beta=1.00 : 28.34 GF&lt;BR /&gt;
	&lt;BR /&gt;
	As you can see, the performance difference is significant. It is&lt;BR /&gt;
	actually so significant that it pays off to set C to zero explicitly&lt;BR /&gt;
	before calling MKL and then using the more efficient beta=1&lt;BR /&gt;
	implementation instead.&lt;/P&gt;

&lt;P&gt;Here is a quick test driver:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

extern "C"
int dgemm_(char *transa, char *transb, int *m, int *
      n, int *k, double *alpha, double *a, int *lda,
      double *b, int *ldb, double *beta, double *c, int *ldc);

void trashCache(float* trash1, float* trash2, int nTotal){
   for(int i = 0; i &amp;lt; nTotal; i ++)
      trash1&lt;I&gt; += 0.99 * trash2&lt;I&gt;;
}

int main(int argc, char ** argv)
{
  if(argc &amp;lt; 2 ){
   printf("Usage: &amp;lt;beta&amp;gt;\n");
   exit(-1);
  }
  float *trash1, *trash2;
  int nTotal = 1024*1024*100;
  trash1 = (float*) malloc(sizeof(float)*nTotal);
  trash2 = (float*) malloc(sizeof(float)*nTotal);

  int m = 72;
  int n = 72*72*72;
  int k = 72;
  double flops = 2.E-9 * m*n*k;
  double alpha=1;
  double beta=atof(argv[1]);
  double *A, *B, *C;
  int ret = posix_memalign((void**) &amp;amp;A, 64, sizeof(double) * m*k);
  ret += posix_memalign((void**) &amp;amp;B, 64, sizeof(double) * n*k);
  ret += posix_memalign((void**) &amp;amp;C, 64, sizeof(double) * m*n);

  double minTime = 1e100;
  for (int i=0; i&amp;lt;3; i++){
     trashCache(trash1, trash2, nTotal);
     double t = omp_get_wtime();
     dgemm_("T", "N", &amp;amp;m, &amp;amp;n, &amp;amp;k, &amp;amp;alpha, A, &amp;amp;m, B, &amp;amp;k, &amp;amp;beta, C, &amp;amp;m);
     t = omp_get_wtime() - t;
     minTime = (minTime &amp;lt; t) ? minTime : t;
  }
  printf("m=%d, n=%d, k=%d, beta=%.2f : %.2lf GF\n", m,n,k,beta,flops/minTime);

  free(A);
  free(B);
  free(C);
  free(trash1);
  free(trash2);
  return 0;
}&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	Best, Paul&lt;BR /&gt;
	&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sat, 04 Jun 2016 11:30:16 GMT</pubDate>
    <dc:creator>Paul_S_</dc:creator>
    <dc:date>2016-06-04T11:30:16Z</dc:date>
    <item>
      <title>Performance bug in GEMM?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085254#M22944</link>
      <description>&lt;P&gt;Hi all,&lt;BR /&gt;
	&lt;BR /&gt;
	I just noticed a potential performance bug in the DGEMM implementation of MKL (16.0.1)&lt;BR /&gt;
	when using a single thread. I merely want to make someone at Intel aware of it, in case it is of interest.&lt;BR /&gt;
	&lt;BR /&gt;
	Strangely DGEMM performs better for beta=1 than for beta=0 in certain&lt;BR /&gt;
	situations. Here is an example:&lt;BR /&gt;
	&lt;BR /&gt;
	Intel(R) Xeon(R) CPU E5-2650:&lt;BR /&gt;
	m=72, n=373248, k=72, beta=0.00 : 14.25 GF&lt;BR /&gt;
	m=72, n=373248, k=72, beta=1.00 : 18.36 GF&lt;BR /&gt;
	&lt;BR /&gt;
	Intel(R) Xeon(R) CPU E5-2650:&lt;BR /&gt;
	m=72, n=373248, k=72, beta=0.00 : 19.25 GF&lt;BR /&gt;
	m=72, n=373248, k=72, beta=1.00 : 28.34 GF&lt;BR /&gt;
	&lt;BR /&gt;
	As you can see, the performance difference is significant. It is&lt;BR /&gt;
	actually so significant that it pays off to set C to zero explicitly&lt;BR /&gt;
	before calling MKL and then using the more efficient beta=1&lt;BR /&gt;
	implementation instead.&lt;/P&gt;

&lt;P&gt;Here is a quick test driver:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;omp.h&amp;gt;

extern "C"
int dgemm_(char *transa, char *transb, int *m, int *
      n, int *k, double *alpha, double *a, int *lda,
      double *b, int *ldb, double *beta, double *c, int *ldc);

void trashCache(float* trash1, float* trash2, int nTotal){
   for(int i = 0; i &amp;lt; nTotal; i ++)
      trash1&lt;I&gt; += 0.99 * trash2&lt;I&gt;;
}

int main(int argc, char ** argv)
{
  if(argc &amp;lt; 2 ){
   printf("Usage: &amp;lt;beta&amp;gt;\n");
   exit(-1);
  }
  float *trash1, *trash2;
  int nTotal = 1024*1024*100;
  trash1 = (float*) malloc(sizeof(float)*nTotal);
  trash2 = (float*) malloc(sizeof(float)*nTotal);

  int m = 72;
  int n = 72*72*72;
  int k = 72;
  double flops = 2.E-9 * m*n*k;
  double alpha=1;
  double beta=atof(argv[1]);
  double *A, *B, *C;
  int ret = posix_memalign((void**) &amp;amp;A, 64, sizeof(double) * m*k);
  ret += posix_memalign((void**) &amp;amp;B, 64, sizeof(double) * n*k);
  ret += posix_memalign((void**) &amp;amp;C, 64, sizeof(double) * m*n);

  double minTime = 1e100;
  for (int i=0; i&amp;lt;3; i++){
     trashCache(trash1, trash2, nTotal);
     double t = omp_get_wtime();
     dgemm_("T", "N", &amp;amp;m, &amp;amp;n, &amp;amp;k, &amp;amp;alpha, A, &amp;amp;m, B, &amp;amp;k, &amp;amp;beta, C, &amp;amp;m);
     t = omp_get_wtime() - t;
     minTime = (minTime &amp;lt; t) ? minTime : t;
  }
  printf("m=%d, n=%d, k=%d, beta=%.2f : %.2lf GF\n", m,n,k,beta,flops/minTime);

  free(A);
  free(B);
  free(C);
  free(trash1);
  free(trash2);
  return 0;
}&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	Best, Paul&lt;BR /&gt;
	&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 04 Jun 2016 11:30:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085254#M22944</guid>
      <dc:creator>Paul_S_</dc:creator>
      <dc:date>2016-06-04T11:30:16Z</dc:date>
    </item>
    <item>
      <title>thanks Paul, we will check</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085255#M22945</link>
      <description>&lt;P&gt;thanks Paul, we will check this on our side.&lt;/P&gt;</description>
      <pubDate>Sun, 05 Jun 2016 20:36:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085255#M22945</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2016-06-05T20:36:58Z</dc:date>
    </item>
    <item>
      <title>Paul, we confirm the problem</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085256#M22946</link>
      <description>&lt;P&gt;Paul, we confirm the problem with this case. The issue is escalated and we will let you know when the problem would be resolved. Thanks for the case.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2016 09:42:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085256#M22946</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2016-06-06T09:42:32Z</dc:date>
    </item>
    <item>
      <title>Paul, please check is the</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085257#M22947</link>
      <description>&lt;P&gt;Paul, please check is the problem still exists with the latest 11.3 update 4 version of MKL and let us know the results. thanks, Gennady&lt;/P&gt;</description>
      <pubDate>Mon, 26 Sep 2016 04:30:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-bug-in-GEMM/m-p/1085257#M22947</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2016-09-26T04:30:08Z</dc:date>
    </item>
  </channel>
</rss>

