Knights Landing Cache Prefetchers question

Zhen · ‎05-19-2017

Hi

I am working on processing a batch of small matrix multiplications on Knights Landing. Since the MKL's performance is not good for small matrix, I use libxsmm. However I find there are a lot of cache misses by using Vtune. The L1 cache miss rate is about 10%, and L2 cache miss rate is about 17%.
The gflops it achieves is less than 20 for single thread. I also write a sample program to test the performance under a ideal condition (no or very few cache miss), it can achieve 50 gflops for single thread.

The code is:

 for(i = 0; i < batch; i++){
         const float *  inA = A +  i*Arows*Acols;
         const float *  inB = B +  i*Bcols*Brows;
         float * oto = out + i*Orows*Ocols;
	 libxsmm_smm_16_16_16(inA,inB,oto);
	// libxsmm_smm_16_16_16(A,B,out); //ideal condition, no or very few cache miss
}

In this case, it seems that the hardware prefetchers do not work well. So I am very curious about why the hardware prefetchers can not always prefetch the next matrix? Each matrix has size of 16*16. So for each gemm, the input matrices and the result can fit into L1 cache. The memory access is consecutive, so the prefether should prefetch data for the next matrix multiplication. However, the prefetchers are not, otherwise there should not be so many cache misses.

According to the Intel Xeon Phi Book, the hardware prefetcher will not stream across a 4KB boundary. Is that the problem?
Does the 4KB boundary mean the page size boundary? I also used the huge page, i.e., 2M page, through hugetlbfs. However, huge page is not going to help.

I also check the assembly code, even though the compiler’s software prefetching is enable, I do not see prefetch instruction in the assembly code. So I think I may need to trigger the software prefecther manually.

Any idea to optimize the program?

Thanks！

Best regards,

Zhen

SergeyKostrov · ‎05-19-2017

>>...Each matrix has size of 16*16... I think we've discussed that topic in the past, right? With so small matrix sizes you could have a fully in-lined, that is completely unrolled, matrix multiplication ( MM ) piece of codes with manual prefetches, and if you use a block-based matrix multiplication algorithm, aka Tiled algorithm for MM, you don't need to prefetch data at all. But you need to do a data mining with your data sets to have partitioned blocks of matrices to be in cache lines by default. It means, a smallest block of data, for example for a matrix A, needs to be 4x4 for a Single precision FP data type ( 4x4x4=64 bytes ). It is correct about poor performance of CBLAS sgemm / dgemm MKL functions for very small matrices. A classic matrix multiplication algorithm outperforms MKL's functions for matrix sizes up to 2048 x 2048. This is because MKL's functions have more overheads at the very beginning of processing.

Zhen · ‎05-20-2017

Hi Sergey,
Thanks for the reply. I just see your reply in my Half Precision Floats post. My bad, and also thanks very much for that reply. This one is a different problem from that.
In my understanding of your opinion, there are two ways to achieve better performance for MM. One is completely unrolled and manual prefetches data. Another is blocking the data, and the block size should equal to one cache line size. Right?
If the blocking method is used, is that to say the hardware prefetcher will streaming prefecth the data. For the unrolling method, the data will not be prefecthed?

Zhen

JWong19 · ‎05-20-2017

Why do you think that your memory accesses are consecutive?

I suggest you to dump the addresses of those memory accesses to have a better look of the problem.

Zhen · ‎05-20-2017

Hi Jeremy,

I do a batch gemm. So the addresses of A1 and A2 are consecutive in my code. So do B1 and B2; C1 and C2. C1 = A1*B1.

A, B, C represent matrices.

BTW, could you explain more about how to dump the addresses?

JWong19 · ‎05-21-2017

Zhen J. wrote:

Hi Jeremy,

I do a batch gemm. So the addresses of A1 and A2 are consecutive in my code. So do B1 and B2; C1 and C2. C1 = A1*B1.

A, B, C represent matrices.

BTW, could you explain more about how to dump the addresses?

For your case, it would be straightforward to obtain the addresses with a debugger and step through each instruction. (there are only 256 elements in each 16x16 matrix...). For each memory access, you need the instruction address and the memory address. In addition, remember to obtain the addresses of your pointers so that you know which matrix element is being accessed in those memory accesses.

To my understanding, an implementation of hardware prefetcher takes instruction address as one of its input. Therefore, instruction addresses of memory accesses could be a critical information to explain your observation.

Finally, you'd better confirm that whether your code is memory-bound or cpu-bound first.

Zhen · ‎05-22-2017

Thanks Jeremy, that is a good suggestion.

Hans_P_Intel · ‎05-22-2017

Dear Zhen,

perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:

Dispatch a matrix kernel, which applies for the entire batch ("xmm", see line #130).
The kernel in the sample is requested *without* prefetches, but that's easy see libxsmm_dmmdispatch.
An OpenMP parallelized loop sweeps over the batch of matrices.
There are several cases of which operands you want to stream: (A,B,C), (A,B), (A,C), (B,C).
The term "batched matrix multiplication" usually applies to the case (A,B,C).
The signature of an SMM kernel in LIBXSMM is: kernel(a, b, c).
However, when you dispatch a kernel with prefetch support, the signature is: kernel(a,b,c, pa,pb,pc).
With the latter you can do: kernel(a, b, c, a[i+1], b[i+1], c[i+1]), which prefetches the "next" operands.
You may request/try different prefetch strategies, see here.

Zhen · ‎05-23-2017

Hans P. (Intel) wrote:

perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:

Thanks Hans. I will try. If I use multiple OMP threads to perform small gemms, the bandwidth may become a problem. Like Jeremy suggested, a batch of small gemms for parallel version is memory-bound. Do you have other suggestions if the bandwidth becomes a problem.

Hans_P_Intel · ‎05-23-2017

Zhen J. wrote:

Quote:

Hans P. (Intel) wrote:

perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:

Thanks Hans. I will try. If I use multiple OMP threads to perform small gemms, the bandwidth may become a problem. Like Jeremy suggested, a batch of small gemms for parallel version is memory-bound. Do you have other suggestions if the bandwidth becomes a problem.

I was only thinking about small matrix multiplication within the batch (memory-bound). This has nothing to do with using OpenMP or not. In fact, it is recommended to use multiple threads scattered across the machine in order to harvest the aggregated memory bandwidth. You want at least as many as you need to "visit" all memory controllers. There are two ingredients in LIBXSMM that allow for a higher sustained bandwidth: (1) multiplications kernels are direct kernels and not "copy-kernels", and (2) effectively prefetching the "next" operands while performing the "current" multiplication.

Btw, for some (other) reason I added a new variety of our classic SMM samples, which compares with the Eigen C++ library:
https://github.com/hfp/libxsmm/tree/master/samples/eigen
If you don't mind the C++ focus of this sample code, you may give it a try. You can clone Eigen from the original Mercurial repository, grab a regular release or package, or clone from GitHub (e.g., "git clone https://github.com/hfp/eigen.git"). If you install/clone into your $HOME, you don't even need to say "make EIGENROOT=/path/to/eigen" but simply "make" it. I checked with 2 GB of small matrices of shape 23x23 (an arbitrary choice), and found performance on KNL using LIBXSMM is 3x higher than using the kernels as inlined via Eigen. This speedup is for the streaming case (A,B,C) which is typically referred as "batch multiplication". For fairness I have to say, one can hard-code the matrix dimensions to make inlining/compiler-codegen even more effective (Eigen has some compile-time shaped matrices). Anyhow, those are somewhat inconvenient compared to the "dynamically sized" matrix class. Last note here: the performance of inlining an implementation written in a high-level language (e.g. C/++) depends on the compiler used whereas LIBXSMM's performance does not depend on the compiler used.

One more comment about Intel MKL and small matrix multiplications: it is not true that MKL is bad with small matrix multiplication (as said at the begin of this discussion). There is MKL_DIRECT since v11.0, and there is even a classic batch interface, which gets you for instance around of writing a parallelized loop when processing a batch (and other optimizations). I haven't tried the latter, but MKL_DIRECT is definitely effective (although I don't know the status on MIC/KNL). In fact, MKL_DIRECT not only indicates that multiplication kernels are accessed with lower overhead (when compared to a full-blown and error-checked [xerbla] BLAS/GEMM call), but implies to omit any tiling and hence does not rely on copy-kernels (which would just eat-up memory bandwidth in case of small matrices with no chance of hiding it behind a sufficient amount of compute).

Hans_P_Intel · ‎05-24-2017

>>> Do you have other suggestions if the bandwidth becomes a problem.

As said, it's already a problem with single core and OpenMP (or whatever kind of TRT) just helps to harvest the aggregated bandwidth of the entire processor/socket. If your 16x16 matrices are batched like (A,B,C), this comes down as low as 1.33 FLOPS/Byte (DP) or 2.67 FLOPS/Byte (SP). This is not exactly an arithmetically intense workload!

FLOPS: 2 * m * n * k = 2 * 16 * 16 * 16 = 8192
Assuming batched matrix multiplications i.e., case (A,B,C):
Bytes (DP): 3 * 16 * 16 * 8 = 6144
Bytes (SP): 3 * 16 * 16 * 4 = 3072

Btw, the above assumes NTS rather than RFO. If you read for ownership it yields 8192 Bytes (DP) and 4096 Bytes (SP), or basically drops arithmetic intensity by another 33%.

>>> shape 23x23 (an arbitrary choice)

It's not too arbitrary but "fair enough" since each operand (matrix) is definitely farther away than 4KB. Btw, these 4KB are not related to the page size, or if you increase the page to 2MB -- you will not bypass this property. You may rather see this value measured in "number of cachelines" (instead of "xKB"). However, at least a single 16x16 matrix fits into 64 cachelines (4KB) such that the "next operand" is nicely in reach (assuming a packed batch).

Another PLUS+ you get with LIBXSMM is that you don't need the index set explicitly; you may compute it somehow (no need to store it upfront).

SergeyKostrov · ‎05-25-2017

Hi Zhen, >>In my understanding of your opinion, there are two ways to achieve better performance for MM... >>... >>One is completely unrolled and manual prefetches data. For 16x16 matrices it should have a lowest degree of overheads. But too excessive application of software prefetches could degrade performance. >>Another is blocking the data, and the block size should equal to one cache line size. Right? Yes, something like that but some time will be spent on data mining, that is on transformation of data in memory. Intel C++ compiler could do that and take a look at compiler options. Also, take into account that advanced algorithm for Matrix Multiplication ( MM ) implemented in pure C language could outperform MKL's sgemm and dgemm highly optimized functions. Here are some performance results for two versions of Strassen Matrix Multiplication algorithm on a KNL system: Abbreviations used in reports: MKL - Math Kernel Library CMMA - Classic Matrix Multiplication Algorithm SHBI - Strassen Heap Based Incomplete SHBC - Strassen Heap Based Complete

SergeyKostrov · ‎05-25-2017

///////////////////////////////////////////////////////////////////////////////
// 16384 x 16384 - Processing using MCDRAM

 Strassen HBI
 Matrix Size           : 16384 x 16384
 Matrix Size Threshold :  8192 x  8192
 Matrix Partitions     :     8
 Degree of Recursion   :     1
 Result Sets Reflection: N/A
 Calculating...
 Strassen HBI - Pass 01 - Completed:     4.88600 secs
 Strassen HBI - Pass 02 - Completed:     4.27700 secs
 Strassen HBI - Pass 05 - Completed:     4.24900 secs
 Strassen HBI - Pass 03 - Completed:     4.24000 secs
 Strassen HBI - Pass 04 - Completed:     4.24800 secs
 ALGORITHM_STRASSENHBI - Passed

 Strassen HBC
 Matrix Size           : 16384 x 16384
 Matrix Size Threshold :  8192 x  8192
 Matrix Partitions     :     8
 Degree of Recursion   :     1
 Result Sets Reflection: Disabled
 Calculating...
 Strassen HBC - Pass 01 - Completed:     5.03100 secs
 Strassen HBC - Pass 03 - Completed:     4.96100 secs
 Strassen HBC - Pass 05 - Completed:     4.94200 secs
 Strassen HBC - Pass 03 - Completed:     4.96200 secs
 Strassen HBC - Pass 04 - Completed:     4.95400 secs
 ALGORITHM_STRASSENHBC - 1 - Passed

 Cblas xGEMM
 Matrix Size           : 16384 x 16384
 Matrix Size Threshold : N/A
 Matrix Partitions     : N/A
 Degree of Recursion   : N/A
 Result Sets Reflection: N/A
 Calculating...
 Cblas xGEMM  - Pass 01 - Completed:     5.57000 secs
 Cblas xGEMM  - Pass 02 - Completed:     5.61600 secs
 Cblas xGEMM  - Pass 03 - Completed:     5.56600 secs
 Cblas xGEMM  - Pass 04 - Completed:     5.61800 secs
 Cblas xGEMM  - Pass 05 - Completed:     5.60700 secs
 Cblas xGEMM  - Passed

SergeyKostrov · ‎05-25-2017

///////////////////////////////////////////////////////////////////////////////
// 24576 x 24576 - Processing using MCDRAM

 Strassen HBI
 Matrix Size           : 24576 x 24576
 Matrix Size Threshold : 12288 x 12288
 Matrix Partitions     :     8
 Degree of Recursion   :     1
 Result Sets Reflection: N/A
 Calculating...
 Strassen HBI - Pass 01 - Completed:    11.90400 secs
 Strassen HBI - Pass 02 - Completed:    11.54500 secs
 Strassen HBI - Pass 03 - Completed:    11.47500 secs
 Strassen HBI - Pass 04 - Completed:    11.54300 secs
 Strassen HBI - Pass 05 - Completed:    11.43400 secs
 ALGORITHM_STRASSENHBI - Passed

 Strassen HBC - Not Tested ( doesn't fit into MCDRAM )

 Cblas xGEMM
 Matrix Size           : 24576 x 24576
 Matrix Size Threshold : N/A
 Matrix Partitions     : N/A
 Degree of Recursion   : N/A
 Result Sets Reflection: N/A
 Calculating...
 Cblas xGEMM  - Pass 01 - Completed:    15.26800 secs
 Cblas xGEMM  - Pass 02 - Completed:    15.30900 secs
 Cblas xGEMM  - Pass 03 - Completed:    15.32700 secs
 Cblas xGEMM  - Pass 04 - Completed:    15.26900 secs
 Cblas xGEMM  - Pass 05 - Completed:    15.29700 secs
 Cblas xGEMM  - Passed

SergeyKostrov · ‎05-25-2017

Here two more performance reports...

SergeyKostrov · ‎05-25-2017

  // Summary of Test Results
  // KNL Server Modes: Cluster=All2All / MCDRAM=Cache
  // KMP_AFFINITY = scatter
  // Number of OpenMP threads = 64
  // MAS=DDR4:DDR4:DDR4
  // MKL cblas_sgemm vs. CMMA vs. SHBI vs. SHBC
  // CMMA with LPS=1:1:1 and CS=ij:ik:jk
  // Measurements are in seconds

      N    MMA        Scatter

    256    MKL       0.0106379
           CMMA      0.0008685
           SHBI      0.0010000
           SHBC      0.0020000
    512    MKL       0.0110626
           CMMA      0.0011320
           SHBI      0.0020000
           SHBC      0.0030000
   1024    MKL       0.0124671
           CMMA      0.0152496
           SHBI      0.0040000
           SHBC      0.0060000
   2048    MKL       0.0329953
           CMMA      0.1201248
           SHBI      0.0140000
           SHBC      0.0250000
   4096    MKL       0.1085401
           CMMA      0.9715966
           SHBI      0.0730000
           SHBC      0.1150000
   8192    MKL       0.6367596
           CMMA      9.4611956
           SHBI      0.6110000
           SHBC      0.7640000
  16384    MKL       5.9910029
           CMMA     64.1694818
           SHBI      4.3020000
           SHBC      4.9670000
  32768    MKL      49.3818010
           CMMA    525.0617163
           SHBI     41.7650000
           SHBC     44.4220000
  65536    MKL     372.3268384
           CMMA   4459.1382627
           SHBI    361.8030100
           SHBC    377.5080000

SergeyKostrov · ‎05-25-2017

  // Summary of Test Results
  // KNL Server Modes: Cluster=All2All / MCDRAM=Flat
  // KMP_AFFINITY = scatter
  // Number of OpenMP threads = 64
  // MAS=DDR4:DDR4:DDR4
  // MKL cblas_sgemm vs. CMMA vs. SHBI vs. SHBC
  // CMMA with LPS=1:1:1 and CS=ij:ik:jk
  // Measurements are in seconds

      N    MMA        Scatter

    256    MKL       0.0241061
           CMMA      0.0000882
           SHBI      0.0010000
           SHBC      0.0010000
    512    MKL       0.0234587
           CMMA      0.0010301
           SHBI      0.0010000
           SHBC      0.0020000
   1024    MKL       0.0248575
           CMMA      0.0151657
           SHBI      0.0040000
           SHBC      0.0060000
   2048    MKL       0.0352349
           CMMA      0.1201408
           SHBI      0.0150000
           SHBC      0.0250000
   4096    MKL       0.1296450
           CMMA      0.9940334
           SHBI      0.1030000
           SHBC      0.1450000
   8192    MKL       0.9212241
           CMMA     11.3463882
           SHBI      0.7460000
           SHBC      0.8980000
  16384    MKL       6.5647681
           CMMA     99.9722107
           SHBI      6.2860000
           SHBC      6.7630000
  32768    MKL      51.2515874
           CMMA    866.5838490
           SHBI     48.1560000
           SHBC     51.0280000
  65536    MKL     399.2602713
           CMMA  14912.4903906
           SHBI    383.6749900
           SHBC    393.2439900

SergeyKostrov · ‎05-25-2017

Zhen, please take a look at Performance of Classic Matrix Multiplication algorithm on a KNL system article https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system

Zhen · ‎05-28-2017

Hi Hans,

I have tried the prefetch method. LIBXSMM_PREFETCH_AL1_BL1_CL1 gives me better performance. One question about Libxsmm, does it depends on MKL? I come across the following errors when I try to compile the code, after I using the flag -mkl, the errors disappear.

lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_sgemm':
libxsmm_gemm.c:(.text+0xb88): undefined reference to `sgemm_'
lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_dgemm':
libxsmm_gemm.c:(.text+0xfd8): undefined reference to `dgemm_'

What do you mean: "multiplications kernels are direct kernels and not copy-kernels", could you explain more?
And could you also explain more about the 4KB case. I do not think I get it.

Thanks!

Zhen · ‎05-28-2017

Hans P. (Intel) wrote:

One more comment about Intel MKL and small matrix multiplications: it is not true that MKL is bad with small matrix multiplication (as said at the begin of this discussion). There is MKL_DIRECT since v11.0, and there is even a classic batch interface, which gets you for instance around of writing a parallelized loop when processing a batch (and other optimizations). I haven't tried the latter, but MKL_DIRECT is definitely effective (although I don't know the status on MIC/KNL). In fact, MKL_DIRECT not only indicates that multiplication kernels are accessed with lower overhead (when compared to a full-blown and error-checked [xerbla] BLAS/GEMM call), but implies to omit any tiling and hence does not rely on copy-kernels (which would just eat-up memory bandwidth in case of small matrices with no chance of hiding it behind a sufficient amount of compute).

Hi Hans,

I will try MKL_DIRECT and see the performance on KNL. Could you also share some info about MKL_DIRECT if you happen to have.

Thanks!

Zhen · ‎05-28-2017

Sergey Kostrov wrote:

Hi Zhen,

>>In my understanding of your opinion, there are two ways to achieve better performance for MM...
>>...
>>One is completely unrolled and manual prefetches data.

For 16x16 matrices it should have a lowest degree of overheads. But too excessive application of software prefetches could degrade performance.

>>Another is blocking the data, and the block size should equal to one cache line size. Right?

Yes, something like that but some time will be spent on data mining, that is on transformation of data in memory. Intel C++ compiler could do that and take a look at compiler options.

Also, take into account that advanced algorithm for Matrix Multiplication ( MM ) implemented in pure C language could outperform MKL's sgemm and dgemm highly optimized functions. Here are some performance results for two versions of Strassen Matrix Multiplication algorithm on a KNL system:

Thanks Sergey for the reply and the info shared. Is the Strassen MM open sourced? Or is there any implementation of the advanced algorithms, which I can use directly so as to achieve better performance than MKL.

Thanks!