<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Youwei, in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145946#M26691</link>
    <description>&lt;P&gt;Hi Youwei,&lt;/P&gt;

&lt;P&gt;The dimension n = 9 might be too small for the Intel(R) MKL to reuse the data.&amp;nbsp; Try to increase the size of n and see happen.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Khang&lt;/P&gt;</description>
    <pubDate>Mon, 02 Jul 2018 19:25:46 GMT</pubDate>
    <dc:creator>Khang_N_Intel</dc:creator>
    <dc:date>2018-07-02T19:25:46Z</dc:date>
    <item>
      <title>MKL GEMM slower for larger matrices</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145940#M26685</link>
      <description>&lt;P&gt;For matrix mul&amp;nbsp;A(m,k) * B(k,n):&lt;/P&gt;

&lt;P&gt;m=9, k=256, n=256 is faster than m=9, k=512, n=512 and all larger k and n.&lt;/P&gt;

&lt;P&gt;On my E5-2630v3 (16 cores, HT disabled), k,n=256 get 850 GFLOPS while k,n=512 only get 257 GFLOPS.&lt;/P&gt;

&lt;P&gt;Here is my testing code. I am doing 64 gemms&amp;nbsp;here:&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;cstdio&amp;gt;
#include &amp;lt;cstdlib&amp;gt;
#include &amp;lt;chrono&amp;gt;
#include &amp;lt;algorithm&amp;gt;
#include &amp;lt;functional&amp;gt;
#include &amp;lt;random&amp;gt;
#include &amp;lt;omp.h&amp;gt;
#include &amp;lt;mkl.h&amp;gt;
#include &amp;lt;unistd.h&amp;gt;


#define ITERATION 10000

int main(int argc, char *argv[])
{
	int opt;
	int n = 9;
	int c = 256; int c_block = 256;
	int k = 256; int k_block = 256;
	int t = 1;
	while ((opt = getopt(argc, argv, "n:c:k:t:")) != -1) {
		switch (opt) {
			case 'n': n = strtol(optarg, NULL, 10); break;
			case 'c': c = strtol(optarg, NULL, 10); break;
			case 'k': k = strtol(optarg, NULL, 10); break;
			case 't': t = strtol(optarg, NULL, 10); break;
			default: printf("unknown option\n");
		}
	}

	omp_set_dynamic(0);
	omp_set_num_threads(t);
	
	float *AS[64], *BS[64], *CS[64];
	for (int i = 0; i &amp;lt; 64; ++i) {
		AS&lt;I&gt; = (float*)mkl_malloc(sizeof(float)*n*c, 64);
		BS&lt;I&gt; = (float*)mkl_malloc(sizeof(float)*c*k, 64);
		CS&lt;I&gt; = (float*)mkl_malloc(sizeof(float)*n*k, 64);
	} 
	
	auto randgen = std::bind(std::uniform_real_distribution&amp;lt;float&amp;gt;(), std::mt19937(0));
	for (int i = 0; i &amp;lt; 64; ++i) {
		std::generate(AS&lt;I&gt;, AS&lt;I&gt;+n*c, std::ref(randgen));
		std::generate(BS&lt;I&gt;, BS&lt;I&gt;+c*k, std::ref(randgen));
		// std::generate(CS&lt;I&gt;, CS&lt;I&gt;+n*k, std::ref(randgen));
	}

	using Clock = std::chrono::high_resolution_clock;
	auto t1 = Clock::now();
	for (int iter = 0; iter &amp;lt; ITERATION; ++iter) {
		#pragma omp parallel
		{
			const int nthreads = omp_get_num_threads();
    		const int mythread = omp_get_thread_num();
    		const int start = mythread*64/nthreads;
   			const int finish = (mythread+1)*64/nthreads;  
			mkl_set_num_threads_local(1);
			for (int i = start; i &amp;lt; finish; ++i)
			{
				float * A = AS&lt;I&gt;;
				float * B = BS&lt;I&gt;;
				float * C = CS&lt;I&gt;;
				cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k, c, 1, A, c, B, k, 0, C, k);
			}
		}
		
	}
	auto t2 = Clock::now();
	auto elapsed = t2 - t1;
	auto time = std::chrono::duration_cast&amp;lt;std::chrono::nanoseconds&amp;gt;(elapsed).count();
	// printf("%.1lfs\n", 1e-9 * time);
	printf("%.lfGFLOPS\n", 1.0 * ITERATION * 64 * 2 * n * c * k / time);
	
	for (int i = 0; i &amp;lt; 64; ++i) {
		mkl_free(AS&lt;I&gt;);
		mkl_free(BS&lt;I&gt;);
		mkl_free(CS&lt;I&gt;);
	} 
	return 0;
}
&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jun 2018 00:56:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145940#M26685</guid>
      <dc:creator>Youwei_Z_</dc:creator>
      <dc:date>2018-06-26T00:56:09Z</dc:date>
    </item>
    <item>
      <title>Hi Youwei,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145941#M26686</link>
      <description>&lt;P&gt;Hi Youwei,&lt;/P&gt;

&lt;P&gt;When the size is the same as L2 cache (k=n=256) the performance will improve a lot.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Once the problem size is large (k=n=512), all data are from LLC and the performance is lower.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Hope this helps!&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Khang&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jun 2018 18:24:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145941#M26686</guid>
      <dc:creator>Khang_N_Intel</dc:creator>
      <dc:date>2018-06-28T18:24:07Z</dc:date>
    </item>
    <item>
      <title>The way you use omp parallel</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145942#M26687</link>
      <description>The way you use omp parallel doesn't make sense. It appears you are forcing each thread to perform the entire problem and turning off mkl local threading. As you appear to be using the same data on each iieration while not taking advantage of mkl internal partitioning, you will see the effect Khang describes.</description>
      <pubDate>Thu, 28 Jun 2018 20:36:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145942#M26687</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2018-06-28T20:36:00Z</dc:date>
    </item>
    <item>
      <title>Quote:Nguyen, Khang T (Intel)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145943#M26688</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Nguyen, Khang T (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Youwei,&lt;/P&gt;

&lt;P&gt;When the size is the same as L2 cache (k=n=256) the performance will improve a lot.&lt;/P&gt;

&lt;P&gt;Once the problem size is large (k=n=512), all data are from LLC and the performance is lower.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Hope this helps!&lt;/P&gt;

&lt;P&gt;Khang&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Thanks Khang. An interesting followup question is: will tilling work in this case? If intel&amp;nbsp;mkl can do tiling into 256x256, there will not be significant FLOPS degradation, right?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jun 2018 20:42:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145943#M26688</guid>
      <dc:creator>Youwei_Z_</dc:creator>
      <dc:date>2018-06-28T20:42:25Z</dc:date>
    </item>
    <item>
      <title>Quote:Tim P. wrote:</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145944#M26689</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Tim P. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;The way you use omp parallel doesn't make sense. It appears you are forcing each thread to perform the entire problem and turning off mkl local threading. As you appear to be using the same data on each iieration while not taking advantage of mkl internal partitioning, you will see the effect Khang describes.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Tim,&lt;/P&gt;

&lt;P&gt;I think the openmp&amp;nbsp;here is correct. Here is the intel manual:&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications" target="_blank"&gt;https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications&lt;/A&gt;. Setting mkl_thread to 1 is good practice.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Moreover, the gemm&amp;nbsp;I do in each thread is small. Even if I set mkl_thread_num to a larger one (say 16), mkl&amp;nbsp;is free to determine the actual number of threads (1 in my case).&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;&amp;nbsp;In fact, in my first attempt, I did not do openmp&amp;nbsp;threading and let&amp;nbsp;mkl&amp;nbsp;do its internal threading. I got much lower FLOPS.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jun 2018 20:48:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145944#M26689</guid>
      <dc:creator>Youwei_Z_</dc:creator>
      <dc:date>2018-06-28T20:48:59Z</dc:date>
    </item>
    <item>
      <title>Following the advice from</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145945#M26690</link>
      <description>&lt;P&gt;Following the advice from Khang and Tim, I try to do the tiling.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;// replace this line
// cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k, c, 1, A, c, B, k, 0, C, k);
// with
// where c_block, k_block is 256, 256
for (int c_i = 0; c_i &amp;lt; c; c_i += c_block)
{
	float *TA = A + n * c_i;
	float beta = c_i ? 0 : 1;
	for (int k_i = 0; k_i &amp;lt; k; k_i += k_block)
	{
		float *TB = B + c_block * k_i + c_i * k;
		float *TC = C + n * k_i;
		cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k_block, c_block, 1, TA, c_block, TB, k_block, beta, TC, k_block);
	}
}&lt;/PRE&gt;

&lt;P&gt;In this way, even for a larger matrix (9,512,512), it will keep the working set of the matrix the same size as in (9,256,256) case. &lt;STRONG&gt;However, I can not see any speedup.&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jun 2018 20:57:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145945#M26690</guid>
      <dc:creator>Youwei_Z_</dc:creator>
      <dc:date>2018-06-28T20:57:34Z</dc:date>
    </item>
    <item>
      <title>Hi Youwei,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145946#M26691</link>
      <description>&lt;P&gt;Hi Youwei,&lt;/P&gt;

&lt;P&gt;The dimension n = 9 might be too small for the Intel(R) MKL to reuse the data.&amp;nbsp; Try to increase the size of n and see happen.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Khang&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jul 2018 19:25:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145946#M26691</guid>
      <dc:creator>Khang_N_Intel</dc:creator>
      <dc:date>2018-07-02T19:25:46Z</dc:date>
    </item>
    <item>
      <title>Quote:Nguyen, Khang T (Intel)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145947#M26692</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Nguyen, Khang T (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Youwei,&lt;/P&gt;

&lt;P&gt;The dimension n = 9 might be too small for the Intel(R) MKL to reuse the data.&amp;nbsp; Try to increase the size of n and see happen.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Khang&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks Khang. I know that increasing N will increase the use. However, the question here is: why is there a large performance gap in the relative performance?&lt;/P&gt;

&lt;P&gt;After tiling, the working set of both programs are the same. The only difference is that whether all the data can be held in L3 cache. I can expect lower FLOPS because of LLC replacement.&lt;/P&gt;

&lt;P&gt;I used to believe that if I keep the &lt;STRONG&gt;working set&lt;/STRONG&gt; the same, the total data size will not matter that much. In this example, if I can do 64 9x256x256 gemm&amp;nbsp;fast, I can also do 256 9x256x256 gemm&amp;nbsp;(after tiling 64 9x512x512 gemm). Could you please comment on that?&lt;/P&gt;</description>
      <pubDate>Thu, 05 Jul 2018 18:20:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-GEMM-slower-for-larger-matrices/m-p/1145947#M26692</guid>
      <dc:creator>Youwei_Z_</dc:creator>
      <dc:date>2018-07-05T18:20:03Z</dc:date>
    </item>
  </channel>
</rss>

