<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Quote:Sergey Kostrov wrote: in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112903#M72999</link>
    <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;...Is the Strassen MM open sourced?&lt;/P&gt;

&lt;P&gt;No and it won't be due to some issues. It is a part of a Linear Algebra domain of the ScaLib for BDP library and all codes are proprietary.&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;Or is there any implementation of the advanced algorithms, which I can use directly so as to achieve better performance than MKL&lt;/P&gt;

&lt;P&gt;A very tuned implementation of the Classic Matrix multiplication algorithm could be considered as advanced. I was very surprised to learn that the Libxsmm library uses MKL internally and there is nothing wrong with that. The only issue is that based on my R&amp;amp;D CMMA for very small matrices outperforms MKL's sgemm. So, you need to complete a series of tests with CMMA that uses streaming stores since they boost performance and Intel C++ compiler ( see options ) could do that for you.&lt;/P&gt;

&lt;P&gt;Once again, MKL is a "heavy player" and it is recommended to use for larger matrices, that is significantly greater than 16x16.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Sergey,&lt;/P&gt;

&lt;P&gt;Thanks for the reply. I have read the article. For streaming store, do you mean that I can only rely on ICC to stream the store. I think that in the CMMA code, there is no&amp;nbsp;explicit streaming store. I will benchmark the code using small matrix sizes. It is really&amp;nbsp;exciting that the CMMA can outperform&amp;nbsp;outperform MKL, the CMMA is not a very complicated implementation.&lt;/P&gt;

&lt;P&gt;Best regards,&lt;/P&gt;

&lt;P&gt;Zhen&lt;/P&gt;</description>
    <pubDate>Fri, 02 Jun 2017 14:40:19 GMT</pubDate>
    <dc:creator>Zhen</dc:creator>
    <dc:date>2017-06-02T14:40:19Z</dc:date>
    <item>
      <title>Knights Landing Cache Prefetchers question</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112872#M72968</link>
      <description>&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;Hi&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;I am working on processing a batch of small matrix multiplications on Knights Landing. Since the MKL's performance is not good for small matrix, I use libxsmm. However I find there are a lot of cache misses by using Vtune. The L1 cache miss rate is about 10%, and L2 cache miss rate is about 17%.&amp;nbsp;&lt;BR style="box-sizing: border-box;" /&gt;
	The gflops it achieves is less than 20 for single thread. I also write a sample program to test the performance under a ideal condition (no or very few cache miss), it can achieve 50 gflops for single thread.&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;The code is:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt; for(i = 0; i &amp;lt; batch; i++){
         const float *  inA = A +  i*Arows*Acols;
         const float *  inB = B +  i*Bcols*Brows;
         float * oto = out + i*Orows*Ocols;
	 libxsmm_smm_16_16_16(inA,inB,oto);
	// libxsmm_smm_16_16_16(A,B,out); //ideal condition, no or very few cache miss
}&lt;/PRE&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;In this case, it seems that the hardware prefetchers do not work well. So I am very curious about why the hardware prefetchers can not always prefetch the next matrix? Each matrix has size of 16*16. So for each gemm, the input matrices and the result can fit into L1 cache. The memory access is consecutive, so the prefether should prefetch data for the next matrix multiplication. However, the prefetchers are not, otherwise there should not be so many cache misses.&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;According to the Intel Xeon Phi Book, the hardware prefetcher will not stream across a 4KB boundary. Is that the problem?&lt;BR style="box-sizing: border-box;" /&gt;
	Does the 4KB boundary mean the page size boundary? I also used the huge page, i.e., 2M page, through hugetlbfs. However, huge page is not going to help.&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;I also check the assembly code, even though the compiler’s software prefetching is enable, I do not see prefetch instruction in the assembly code. So I think I may need to trigger the software prefecther manually.&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;Any idea to optimize the program?&amp;nbsp;&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;Thanks！&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;Best regards,&lt;/P&gt;

&lt;P style="box-sizing: border-box; word-wrap: break-word; margin-bottom: 20px; line-height: 1.4; color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 14px; background-color: rgb(255, 244, 244);"&gt;Zhen&lt;/P&gt;</description>
      <pubDate>Fri, 19 May 2017 15:27:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112872#M72968</guid>
      <dc:creator>Zhen</dc:creator>
      <dc:date>2017-05-19T15:27:07Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Each matrix has size of</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112873#M72969</link>
      <description>&amp;gt;&amp;gt;...Each matrix has size of 16*16...

I think we've discussed that topic in the past, right?

With so small matrix sizes you could have a fully in-lined, that is completely unrolled, matrix multiplication ( MM ) piece of codes with manual prefetches, and if you use a block-based matrix multiplication algorithm, aka Tiled algorithm for MM, you don't need to prefetch data at all. But you need to do a data mining with your data sets to have partitioned blocks of matrices to be in cache lines by default. It means, a smallest block of data, for example for a matrix A, needs to be 4x4 for a Single precision FP data type ( 4x4x4=64 bytes ).

It is correct about poor performance of CBLAS sgemm / dgemm MKL functions for very small matrices. A classic matrix multiplication algorithm outperforms MKL's functions for matrix sizes up to 2048 x 2048. This is because MKL's functions have more overheads at the very beginning of processing.</description>
      <pubDate>Fri, 19 May 2017 21:56:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112873#M72969</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-19T21:56:30Z</dc:date>
    </item>
    <item>
      <title>Hi Sergey,</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112874#M72970</link>
      <description>&lt;P&gt;Hi Sergey,&lt;BR /&gt;
	Thanks for the reply. I just see your reply in my Half Precision Floats post. My bad, and also thanks very much for that reply. This one is a different problem from that.&amp;nbsp;&lt;BR /&gt;
	In my understanding of your opinion, there are two ways to achieve better performance for MM. One is completely unrolled and manual prefetches data. Another is blocking the data, and the block size should equal to one cache line size. Right?&lt;BR /&gt;
	If the blocking method is used, is that to say the hardware prefetcher will streaming prefecth the data. For the unrolling method, the data will not be prefecthed?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Zhen&lt;/P&gt;</description>
      <pubDate>Sun, 21 May 2017 02:24:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112874#M72970</guid>
      <dc:creator>Zhen</dc:creator>
      <dc:date>2017-05-21T02:24:39Z</dc:date>
    </item>
    <item>
      <title>Why do you think that your</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112875#M72971</link>
      <description>&lt;P&gt;Why do you think that your memory accesses are consecutive?&lt;/P&gt;

&lt;P&gt;I suggest you to dump the addresses of those memory accesses to have a better look of the problem.&lt;/P&gt;</description>
      <pubDate>Sun, 21 May 2017 02:31:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112875#M72971</guid>
      <dc:creator>JWong19</dc:creator>
      <dc:date>2017-05-21T02:31:40Z</dc:date>
    </item>
    <item>
      <title>Hi Jeremy,</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112876#M72972</link>
      <description>&lt;P&gt;Hi Jeremy,&lt;/P&gt;

&lt;P&gt;I do a batch gemm. So the addresses of A1 and A2 are&amp;nbsp;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;consecutive in my code. So do B1 and B2; C1 and C2. &amp;nbsp;C1 = A1*B1.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;A, B, C&amp;nbsp;&lt;/SPAN&gt;represent&amp;nbsp;matrices.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;BTW, could you explain more about how to dump the&amp;nbsp;addresses?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 21 May 2017 02:43:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112876#M72972</guid>
      <dc:creator>Zhen</dc:creator>
      <dc:date>2017-05-21T02:43:53Z</dc:date>
    </item>
    <item>
      <title>Quote:Zhen J. wrote:</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112877#M72973</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Zhen J. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Jeremy,&lt;/P&gt;

&lt;P&gt;I do a batch gemm. So the addresses of A1 and A2 are&amp;nbsp;consecutive in my code. So do B1 and B2; C1 and C2. &amp;nbsp;C1 = A1*B1.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;A, B, C&amp;nbsp;represent&amp;nbsp;matrices.&lt;/P&gt;

&lt;P&gt;BTW, could you explain more about how to dump the&amp;nbsp;addresses?&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;For your case, it would be straightforward to obtain the addresses with a debugger and step through each instruction. (there are only 256 elements in each 16x16 matrix...). For each memory access, you need the instruction address and the memory address. In addition, remember to obtain the addresses of your pointers so that you know which matrix element is being accessed in those memory accesses.&lt;/P&gt;

&lt;P&gt;To my understanding, an implementation of hardware prefetcher takes instruction address as one of its input. Therefore, instruction addresses of memory accesses could be a critical information to explain your observation.&lt;/P&gt;

&lt;P&gt;Finally, you'd better confirm that whether your code is memory-bound or cpu-bound first.&lt;/P&gt;</description>
      <pubDate>Sun, 21 May 2017 14:07:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112877#M72973</guid>
      <dc:creator>JWong19</dc:creator>
      <dc:date>2017-05-21T14:07:19Z</dc:date>
    </item>
    <item>
      <title>Thanks Jeremy, that is a good</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112878#M72974</link>
      <description>&lt;P&gt;Thanks Jeremy, that is a good suggestion.&lt;/P&gt;</description>
      <pubDate>Mon, 22 May 2017 18:34:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112878#M72974</guid>
      <dc:creator>Zhen</dc:creator>
      <dc:date>2017-05-22T18:34:12Z</dc:date>
    </item>
    <item>
      <title>Dear Zhen,</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112879#M72975</link>
      <description>&lt;P&gt;Dear Zhen,&lt;/P&gt;

&lt;P&gt;perhaps you can have a look at &lt;A href="https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133" target="_blank"&gt;https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133&lt;/A&gt;. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;Dispatch a matrix kernel, which applies for the entire batch ("xmm", see line #130).&lt;/LI&gt;
	&lt;LI&gt;The kernel in the sample is requested *without* prefetches, but that's easy see &lt;A href="https://github.com/hfp/libxsmm/blob/master/src/template/libxsmm.h#L136"&gt;libxsmm_dmmdispatch&lt;/A&gt;.&lt;/LI&gt;
	&lt;LI&gt;An OpenMP parallelized loop sweeps over the batch of matrices.&lt;/LI&gt;
	&lt;LI&gt;There are several cases of which operands you want to stream: (A,B,C), (A,B), (A,C), (B,C).&lt;/LI&gt;
	&lt;LI&gt;The term "batched matrix multiplication" usually applies to the case (A,B,C).&lt;/LI&gt;
	&lt;LI&gt;The signature of an SMM kernel in LIBXSMM is: kernel(a, b, c).&lt;/LI&gt;
	&lt;LI&gt;However, when you dispatch a kernel with prefetch support, the signature is: kernel(a,b,c, pa,pb,pc).&lt;/LI&gt;
	&lt;LI&gt;With the latter you can do: kernel(a&lt;I&gt;, b&lt;I&gt;, c&lt;I&gt;, a[i+1], b[i+1], c[i+1]), which prefetches the "next" operands.&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/LI&gt;
	&lt;LI&gt;You may request/try different prefetch strategies, see &lt;A href="https://github.com/hfp/libxsmm/blob/master/include/libxsmm_typedefs.h#L83"&gt;here&lt;/A&gt;.&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 23 May 2017 06:51:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112879#M72975</guid>
      <dc:creator>Hans_P_Intel</dc:creator>
      <dc:date>2017-05-23T06:51:24Z</dc:date>
    </item>
    <item>
      <title>Quote:Hans P. (Intel) wrote:</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112880#M72976</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Hans P. (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;perhaps you can have a look at &lt;/SPAN&gt;&lt;A href="https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133" style="font-size: 1em; line-height: 1.5;"&gt;https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133&lt;/A&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Thanks Hans. I will try. &amp;nbsp;If I use&amp;nbsp;multiple OMP threads to perform small gemms, the bandwidth may become a problem. Like&amp;nbsp;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Jeremy suggested, a batch of small gemms for parallel version is&amp;nbsp;memory-bound. Do you have other suggestions if the&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;bandwidth becomes a problem.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 24 May 2017 03:00:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112880#M72976</guid>
      <dc:creator>Zhen</dc:creator>
      <dc:date>2017-05-24T03:00:30Z</dc:date>
    </item>
    <item>
      <title>Quote:Zhen J. wrote:</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112881#M72977</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Zhen J. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG class="quote-header"&gt;Quote:&lt;/STRONG&gt;&lt;/P&gt;

&lt;BLOCKQUOTE class="quote-msg quote-nest-1 odd"&gt;
	&lt;DIV class="quote-author"&gt;&lt;EM class="placeholder"&gt;Hans P. (Intel)&lt;/EM&gt; wrote:&lt;/DIV&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;perhaps you can have a look at &lt;A href="https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133" rel="nofollow"&gt;https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133&lt;/A&gt;. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks Hans. I will try. &amp;nbsp;If I use&amp;nbsp;multiple OMP threads to perform small gemms, the bandwidth may become a problem. Like&amp;nbsp;Jeremy suggested, a batch of small gemms for parallel version is&amp;nbsp;memory-bound. Do you have other suggestions if the&amp;nbsp;bandwidth becomes a problem.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I was only thinking about small matrix multiplication within the batch (memory-bound). This has nothing to do with using OpenMP or not. In fact, it is recommended to use multiple threads scattered across the machine in order to harvest the aggregated memory bandwidth. You want at least as many as you need to "visit" all memory controllers. There are two ingredients in LIBXSMM that allow for a higher sustained bandwidth: (1) multiplications kernels are direct kernels and not "copy-kernels", and (2) effectively prefetching the "next" operands while performing the "current" multiplication.&lt;/P&gt;

&lt;P&gt;Btw, for some (other) reason I added a new variety of our classic SMM samples, which compares with the Eigen C++ library:&lt;BR /&gt;
	&lt;A href="https://github.com/hfp/libxsmm/tree/master/samples/eigen" target="_blank"&gt;https://github.com/hfp/libxsmm/tree/master/samples/eigen&lt;/A&gt;&lt;BR /&gt;
	If you don't mind the C++ focus of this sample code, you may give it a try. You can clone Eigen from the original Mercurial repository, grab a regular release or package, or clone from GitHub (e.g., "git clone https://github.com/hfp/eigen.git"). If you install/clone into your $HOME, you don't even need to say "make EIGENROOT=/path/to/eigen" but simply "make" it. I checked with 2 GB of small matrices of shape 23x23 (an arbitrary choice), and found performance on KNL using LIBXSMM is &lt;STRONG&gt;3x higher&lt;/STRONG&gt; than using the kernels as inlined via Eigen. This speedup is for the streaming case (A,B,C) which is typically referred as "batch multiplication". For fairness I have to say, one can hard-code the matrix dimensions to make inlining/compiler-codegen even more effective (Eigen has some compile-time shaped matrices). Anyhow, those are somewhat inconvenient compared to the "dynamically sized" matrix class. Last note here: the performance of inlining an implementation written in a high-level language (e.g. C/++) depends on the compiler used whereas LIBXSMM's performance does not depend on the compiler used.&lt;/P&gt;

&lt;P&gt;One more comment about Intel MKL and small matrix multiplications: it is not true that MKL is bad with small matrix multiplication (as said at the begin of this discussion). There is MKL_DIRECT since v11.0, and there is even a classic batch interface, which gets you for instance around of writing a parallelized loop when processing a batch (and other optimizations). I haven't tried the latter, but MKL_DIRECT is definitely effective (although I don't know the status on MIC/KNL). In fact, MKL_DIRECT not only indicates that multiplication kernels are accessed with lower overhead (when compared to a full-blown and error-checked [xerbla] BLAS/GEMM call), but implies to omit any tiling and hence does not rely on copy-kernels (which would just eat-up memory bandwidth in case of small matrices with no chance of hiding it behind a sufficient amount of compute).&lt;/P&gt;</description>
      <pubDate>Wed, 24 May 2017 05:15:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112881#M72977</guid>
      <dc:creator>Hans_P_Intel</dc:creator>
      <dc:date>2017-05-24T05:15:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt; Do you have other</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112882#M72978</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt; Do you have other suggestions if the&amp;nbsp;bandwidth becomes a problem.&lt;/P&gt;

&lt;P&gt;As said, it's already a problem with single core and OpenMP (or whatever kind of TRT) just helps to harvest the aggregated bandwidth of the entire processor/socket. If your 16x16 matrices are batched like (A,B,C), this comes down as low as 1.33 FLOPS/Byte (DP) or 2.67 FLOPS/Byte (SP). This is not exactly an arithmetically intense workload!&lt;/P&gt;

&lt;P&gt;FLOPS: 2 * m * n * k = 2 * 16 * 16 * 16 = 8192&lt;BR /&gt;
	Assuming batched matrix multiplications i.e., case (A,B,C):&lt;BR /&gt;
	Bytes (DP): 3 * 16 * 16 * 8 = 6144&lt;BR /&gt;
	Bytes (SP): 3 * 16 * 16 * 4 = 3072&lt;/P&gt;

&lt;P&gt;Btw, the above assumes NTS rather than RFO. If you read for ownership it yields 8192 Bytes (DP) and 4096 Bytes (SP), or basically drops arithmetic intensity by another 33%.&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;&amp;gt; shape 23x23 (an arbitrary choice)&lt;/P&gt;

&lt;P&gt;It's not too arbitrary but "fair enough" since each operand (matrix) is definitely farther away than 4KB. Btw, these 4KB are not related to the page size, or if you increase the page to 2MB -- you will not bypass this property. You may rather see this value measured in "number of cachelines" (instead of "xKB"). However, at least a single 16x16 matrix fits into 64 cachelines (4KB) such that the "next operand" is nicely in reach (assuming a packed batch).&lt;/P&gt;

&lt;P&gt;Another PLUS+ you get with LIBXSMM is that you don't need the index set explicitly; you may compute it somehow (no need to store it upfront).&lt;/P&gt;</description>
      <pubDate>Wed, 24 May 2017 19:23:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112882#M72978</guid>
      <dc:creator>Hans_P_Intel</dc:creator>
      <dc:date>2017-05-24T19:23:31Z</dc:date>
    </item>
    <item>
      <title>Hi Zhen,</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112883#M72979</link>
      <description>Hi Zhen,

&amp;gt;&amp;gt;In my understanding of your opinion, there are two ways to achieve better performance for MM...
 &amp;gt;&amp;gt;...
 &amp;gt;&amp;gt;One is completely unrolled and manual prefetches data.

For 16x16 matrices it should have a lowest degree of overheads. But too excessive application of software prefetches could degrade performance.

&amp;gt;&amp;gt;Another is blocking the data, and the block size should equal to one cache line size. Right?

Yes, something like that but some time will be spent on data mining, that is on transformation of data in memory. Intel C++ compiler could do that and take a look at compiler options.

Also, take into account that &lt;STRONG&gt;advanced algorithm&lt;/STRONG&gt; for Matrix Multiplication ( MM ) implemented in pure C language could outperform MKL's sgemm and dgemm highly optimized functions. Here are some performance results for two versions of Strassen Matrix Multiplication algorithm on a KNL system:

Abbreviations used in reports:

MKL - Math Kernel Library
 CMMA - Classic Matrix Multiplication Algorithm
 SHBI - Strassen Heap Based Incomplete
 SHBC - Strassen Heap Based Complete</description>
      <pubDate>Thu, 25 May 2017 22:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112883#M72979</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-25T22:05:00Z</dc:date>
    </item>
    <item>
      <title>/////////////////////////////</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112884#M72980</link>
      <description>&lt;PRE class="brush:cpp;"&gt;///////////////////////////////////////////////////////////////////////////////
// 16384 x 16384 - Processing using MCDRAM

&amp;nbsp;Strassen HBI
&amp;nbsp;Matrix Size&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : 16384 x 16384
&amp;nbsp;Matrix Size Threshold :&amp;nbsp; 8192 x&amp;nbsp; 8192
&amp;nbsp;Matrix Partitions&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; :&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 8
&amp;nbsp;Degree of Recursion&amp;nbsp;&amp;nbsp; :&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1
&amp;nbsp;Result Sets Reflection: N/A
&amp;nbsp;Calculating...
&amp;nbsp;Strassen HBI - Pass 01 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.88600 secs
&amp;nbsp;Strassen HBI - Pass 02 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.27700 secs
&amp;nbsp;Strassen HBI - Pass 05 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.24900 secs
&amp;nbsp;Strassen HBI - Pass 03 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.24000 secs
&amp;nbsp;Strassen HBI - Pass 04 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.24800 secs
&amp;nbsp;ALGORITHM_STRASSENHBI - Passed

&amp;nbsp;Strassen HBC
&amp;nbsp;Matrix Size&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : 16384 x 16384
&amp;nbsp;Matrix Size Threshold :&amp;nbsp; 8192 x&amp;nbsp; 8192
&amp;nbsp;Matrix Partitions&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; :&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 8
&amp;nbsp;Degree of Recursion&amp;nbsp;&amp;nbsp; :&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1
&amp;nbsp;Result Sets Reflection: Disabled
&amp;nbsp;Calculating...
&amp;nbsp;Strassen HBC - Pass 01 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.03100 secs
&amp;nbsp;Strassen HBC - Pass 03 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.96100 secs
&amp;nbsp;Strassen HBC - Pass 05 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.94200 secs
&amp;nbsp;Strassen HBC - Pass 03 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.96200 secs
&amp;nbsp;Strassen HBC - Pass 04 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.95400 secs
&amp;nbsp;ALGORITHM_STRASSENHBC - 1 - Passed

&amp;nbsp;Cblas xGEMM
&amp;nbsp;Matrix Size&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : 16384 x 16384
&amp;nbsp;Matrix Size Threshold : N/A
&amp;nbsp;Matrix Partitions&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : N/A
&amp;nbsp;Degree of Recursion&amp;nbsp;&amp;nbsp; : N/A
&amp;nbsp;Result Sets Reflection: N/A
&amp;nbsp;Calculating...
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 01 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.57000 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 02 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.61600 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 03 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.56600 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 04 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.61800 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 05 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.60700 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Passed
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 May 2017 22:06:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112884#M72980</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-25T22:06:57Z</dc:date>
    </item>
    <item>
      <title>/////////////////////////////</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112885#M72981</link>
      <description>&lt;PRE class="brush:cpp;"&gt;///////////////////////////////////////////////////////////////////////////////
// 24576 x 24576 - Processing using MCDRAM

&amp;nbsp;Strassen HBI
&amp;nbsp;Matrix Size&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : 24576 x 24576
&amp;nbsp;Matrix Size Threshold : 12288 x 12288
&amp;nbsp;Matrix Partitions&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; :&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 8
&amp;nbsp;Degree of Recursion&amp;nbsp;&amp;nbsp; :&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1
&amp;nbsp;Result Sets Reflection: N/A
&amp;nbsp;Calculating...
&amp;nbsp;Strassen HBI - Pass 01 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 11.90400 secs
&amp;nbsp;Strassen HBI - Pass 02 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 11.54500 secs
&amp;nbsp;Strassen HBI - Pass 03 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 11.47500 secs
&amp;nbsp;Strassen HBI - Pass 04 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 11.54300 secs
&amp;nbsp;Strassen HBI - Pass 05 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 11.43400 secs
&amp;nbsp;ALGORITHM_STRASSENHBI - Passed

&amp;nbsp;Strassen HBC - Not Tested ( doesn't fit into MCDRAM )

&amp;nbsp;Cblas xGEMM
&amp;nbsp;Matrix Size&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : 24576 x 24576
&amp;nbsp;Matrix Size Threshold : N/A
&amp;nbsp;Matrix Partitions&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : N/A
&amp;nbsp;Degree of Recursion&amp;nbsp;&amp;nbsp; : N/A
&amp;nbsp;Result Sets Reflection: N/A
&amp;nbsp;Calculating...
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 01 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 15.26800 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 02 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 15.30900 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 03 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 15.32700 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 04 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 15.26900 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Pass 05 - Completed:&amp;nbsp;&amp;nbsp;&amp;nbsp; 15.29700 secs
&amp;nbsp;Cblas xGEMM&amp;nbsp; - Passed
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 May 2017 22:09:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112885#M72981</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-25T22:09:08Z</dc:date>
    </item>
    <item>
      <title>Here two more performance</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112886#M72982</link>
      <description>Here two more performance reports...</description>
      <pubDate>Thu, 25 May 2017 22:13:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112886#M72982</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-25T22:13:26Z</dc:date>
    </item>
    <item>
      <title>  // Summary of Test Results</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112887#M72983</link>
      <description>&lt;PRE class="brush:cpp;"&gt;&amp;nbsp;&amp;nbsp;// Summary of Test Results
&amp;nbsp;&amp;nbsp;// KNL Server Modes: Cluster=All2All / MCDRAM=Cache
&amp;nbsp;&amp;nbsp;// KMP_AFFINITY = scatter
&amp;nbsp;&amp;nbsp;// Number of OpenMP threads = 64
&amp;nbsp;&amp;nbsp;// MAS=DDR4:DDR4:DDR4
&amp;nbsp;&amp;nbsp;// MKL cblas_sgemm vs. CMMA vs. SHBI vs. SHBC
&amp;nbsp;&amp;nbsp;// CMMA with LPS=1:1:1 and CS=ij:ik:jk
&amp;nbsp;&amp;nbsp;// Measurements are in seconds

&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; N&amp;nbsp;&amp;nbsp;&amp;nbsp; MMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Scatter

&amp;nbsp;&amp;nbsp;&amp;nbsp; 256&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0106379
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0008685
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0010000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0020000
&amp;nbsp;&amp;nbsp;&amp;nbsp; 512&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0110626
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0011320
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0020000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0030000
&amp;nbsp;&amp;nbsp; 1024   &amp;nbsp;MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0124671
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0152496
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0040000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0060000
&amp;nbsp;&amp;nbsp; 2048&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0329953
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.1201248
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0140000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0250000
&amp;nbsp;&amp;nbsp; 4096&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.1085401
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.9715966
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0730000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.1150000
&amp;nbsp;&amp;nbsp; 8192&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.6367596
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 9.4611956
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.6110000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.7640000
&amp;nbsp;&amp;nbsp;16384&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 5.9910029
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 64.1694818
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.3020000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 4.9670000
&amp;nbsp;&amp;nbsp;32768&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 49.3818010
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp; 525.0617163
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 41.7650000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 44.4220000
&amp;nbsp;&amp;nbsp;65536&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 372.3268384
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp; 4459.1382627
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp; 361.8030100
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp; 377.5080000
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 May 2017 22:14:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112887#M72983</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-25T22:14:00Z</dc:date>
    </item>
    <item>
      <title>  // Summary of Test Results</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112888#M72984</link>
      <description>&lt;PRE class="brush:cpp;"&gt;&amp;nbsp;&amp;nbsp;// Summary of Test Results
&amp;nbsp;&amp;nbsp;// KNL Server Modes: Cluster=All2All / MCDRAM=Flat
&amp;nbsp;&amp;nbsp;// KMP_AFFINITY = scatter
&amp;nbsp;&amp;nbsp;// Number of OpenMP threads = 64
&amp;nbsp;&amp;nbsp;// MAS=DDR4:DDR4:DDR4
&amp;nbsp;&amp;nbsp;// MKL cblas_sgemm vs. CMMA vs. SHBI vs. SHBC
&amp;nbsp;&amp;nbsp;// CMMA with LPS=1:1:1 and CS=ij:ik:jk
&amp;nbsp;&amp;nbsp;// Measurements are in seconds

&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; N&amp;nbsp;&amp;nbsp;&amp;nbsp; MMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Scatter

&amp;nbsp;&amp;nbsp;&amp;nbsp; 256&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0241061
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0000882
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0010000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0010000
&amp;nbsp;&amp;nbsp;&amp;nbsp; 512&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0234587
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0010301
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0010000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0020000
&amp;nbsp;&amp;nbsp; 1024&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0248575
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0151657
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0040000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0060000
&amp;nbsp;&amp;nbsp; 2048&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0352349
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.1201408
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0150000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.0250000
&amp;nbsp;&amp;nbsp; 4096&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.1296450
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.9940334
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.1030000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.1450000
&amp;nbsp;&amp;nbsp; 8192&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.9212241
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 11.3463882
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.7460000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0.8980000
&amp;nbsp;&amp;nbsp;16384&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6.5647681
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 99.9722107
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6.2860000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6.7630000
&amp;nbsp;&amp;nbsp;32768&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 51.2515874
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp;&amp;nbsp;&amp;nbsp; 866.5838490
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 48.1560000
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 51.0280000
&amp;nbsp;&amp;nbsp;65536&amp;nbsp;   MKL&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 399.2602713
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;CMMA&amp;nbsp; 14912.4903906
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBI&amp;nbsp;&amp;nbsp;&amp;nbsp; 383.6749900
       &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;SHBC&amp;nbsp;&amp;nbsp;&amp;nbsp; 393.2439900
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 May 2017 22:17:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112888#M72984</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-25T22:17:00Z</dc:date>
    </item>
    <item>
      <title>Zhen, please take a look at</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112889#M72985</link>
      <description>Zhen, please take a look at &lt;STRONG&gt;Performance of Classic Matrix Multiplication algorithm on a KNL system&lt;/STRONG&gt; article &lt;A href="https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system" target="_blank"&gt;https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system&lt;/A&gt;</description>
      <pubDate>Thu, 25 May 2017 22:21:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112889#M72985</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-05-25T22:21:31Z</dc:date>
    </item>
    <item>
      <title>Hi Hans,</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112890#M72986</link>
      <description>&lt;P&gt;Hi Hans,&lt;/P&gt;

&lt;P&gt;I have tried the prefetch method. LIBXSMM_PREFETCH_AL1_BL1_CL1 gives me better performance. One question about Libxsmm, does it depends on MKL? I come across the following errors when I try to compile the code, after I using the flag -mkl, the errors disappear.&lt;/P&gt;

&lt;P&gt;lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_sgemm':&lt;BR /&gt;
	libxsmm_gemm.c:(.text+0xb88): undefined reference to `sgemm_'&lt;BR /&gt;
	lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_dgemm':&lt;BR /&gt;
	libxsmm_gemm.c:(.text+0xfd8): undefined reference to `dgemm_'&lt;/P&gt;

&lt;P&gt;What do you mean: "multiplications kernels are direct kernels and not copy-kernels", could you explain more?&lt;BR /&gt;
	And could you also explain more about the 4KB case. I do not think I get it.&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Sun, 28 May 2017 23:37:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112890#M72986</guid>
      <dc:creator>Zhen</dc:creator>
      <dc:date>2017-05-28T23:37:05Z</dc:date>
    </item>
    <item>
      <title>Quote:Hans P. (Intel) wrote:</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112891#M72987</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Hans P. (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;One more comment about Intel MKL and small matrix multiplications: it is not true that MKL is bad with small matrix multiplication (as said at the begin of this discussion). There is MKL_DIRECT since v11.0, and there is even a classic batch interface, which gets you for instance around of writing a parallelized loop when processing a batch (and other optimizations). I haven't tried the latter, but MKL_DIRECT is definitely effective (although I don't know the status on MIC/KNL). In fact, MKL_DIRECT not only indicates that multiplication kernels are accessed with lower overhead (when compared to a full-blown and error-checked [xerbla] BLAS/GEMM call), but implies to omit any tiling and hence does not rely on copy-kernels (which would just eat-up memory bandwidth in case of small matrices with no chance of hiding it behind a sufficient amount of compute).&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Hans,&lt;/P&gt;

&lt;P&gt;I will try&amp;nbsp;&lt;SPAN style="font-size: 13.008px; line-height: 19.512px;"&gt;MKL_DIRECT and see the performance on KNL. Could you also share some info about&amp;nbsp;MKL_DIRECT if you happen to have.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.512px;"&gt;Thanks!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 28 May 2017 23:44:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-Cache-Prefetchers-question/m-p/1112891#M72987</guid>
      <dc:creator>Zhen</dc:creator>
      <dc:date>2017-05-28T23:44:47Z</dc:date>
    </item>
  </channel>
</rss>

