- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi

I am working on processing a batch of small matrix multiplications on Knights Landing. Since the MKL's performance is not good for small matrix, I use libxsmm. However I find there are a lot of cache misses by using Vtune. The L1 cache miss rate is about 10%, and L2 cache miss rate is about 17%.

The gflops it achieves is less than 20 for single thread. I also write a sample program to test the performance under a ideal condition (no or very few cache miss), it can achieve 50 gflops for single thread.

The code is:

for(i = 0; i < batch; i++){ const float * inA = A + i*Arows*Acols; const float * inB = B + i*Bcols*Brows; float * oto = out + i*Orows*Ocols; libxsmm_smm_16_16_16(inA,inB,oto); // libxsmm_smm_16_16_16(A,B,out); //ideal condition, no or very few cache miss }

In this case, it seems that the hardware prefetchers do not work well. So I am very curious about why the hardware prefetchers can not always prefetch the next matrix? Each matrix has size of 16*16. So for each gemm, the input matrices and the result can fit into L1 cache. The memory access is consecutive, so the prefether should prefetch data for the next matrix multiplication. However, the prefetchers are not, otherwise there should not be so many cache misses.

According to the Intel Xeon Phi Book, the hardware prefetcher will not stream across a 4KB boundary. Is that the problem?

Does the 4KB boundary mean the page size boundary? I also used the huge page, i.e., 2M page, through hugetlbfs. However, huge page is not going to help.

I also check the assembly code, even though the compiler’s software prefetching is enable, I do not see prefetch instruction in the assembly code. So I think I may need to trigger the software prefecther manually.

Any idea to optimize the program?

Thanks！

Best regards,

Zhen

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Sergey,

Thanks for the reply. I just see your reply in my Half Precision Floats post. My bad, and also thanks very much for that reply. This one is a different problem from that.

In my understanding of your opinion, there are two ways to achieve better performance for MM. One is completely unrolled and manual prefetches data. Another is blocking the data, and the block size should equal to one cache line size. Right?

If the blocking method is used, is that to say the hardware prefetcher will streaming prefecth the data. For the unrolling method, the data will not be prefecthed?

Zhen

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Why do you think that your memory accesses are consecutive?

I suggest you to dump the addresses of those memory accesses to have a better look of the problem.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Jeremy,

I do a batch gemm. So the addresses of A1 and A2 are consecutive in my code. So do B1 and B2; C1 and C2. C1 = A1*B1.

A, B, C represent matrices.

BTW, could you explain more about how to dump the addresses?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Zhen J. wrote:

Hi Jeremy,

I do a batch gemm. So the addresses of A1 and A2 are consecutive in my code. So do B1 and B2; C1 and C2. C1 = A1*B1.

A, B, C represent matrices.

BTW, could you explain more about how to dump the addresses?

For your case, it would be straightforward to obtain the addresses with a debugger and step through each instruction. (there are only 256 elements in each 16x16 matrix...). For each memory access, you need the instruction address and the memory address. In addition, remember to obtain the addresses of your pointers so that you know which matrix element is being accessed in those memory accesses.

To my understanding, an implementation of hardware prefetcher takes instruction address as one of its input. Therefore, instruction addresses of memory accesses could be a critical information to explain your observation.

Finally, you'd better confirm that whether your code is memory-bound or cpu-bound first.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Jeremy, that is a good suggestion.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dear Zhen,

perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:

- Dispatch a matrix kernel, which applies for the entire batch ("xmm", see line #130).
- The kernel in the sample is requested *without* prefetches, but that's easy see libxsmm_dmmdispatch.
- An OpenMP parallelized loop sweeps over the batch of matrices.
- There are several cases of which operands you want to stream: (A,B,C), (A,B), (A,C), (B,C).
- The term "batched matrix multiplication" usually applies to the case (A,B,C).
- The signature of an SMM kernel in LIBXSMM is: kernel(a, b, c).
- However, when you dispatch a kernel with prefetch support, the signature is: kernel(a,b,c, pa,pb,pc).
- With the latter you can do: kernel(a
*, b**, c**, a[i+1], b[i+1], c[i+1]), which prefetches the "next" operands.* - You may request/try different prefetch strategies, see here.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hans P. (Intel) wrote:

perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:

Thanks Hans. I will try. If I use multiple OMP threads to perform small gemms, the bandwidth may become a problem. Like Jeremy suggested, a batch of small gemms for parallel version is memory-bound. Do you have other suggestions if the bandwidth becomes a problem.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Zhen J. wrote:

Quote:Hans P. (Intel)wrote:

perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:

Thanks Hans. I will try. If I use multiple OMP threads to perform small gemms, the bandwidth may become a problem. Like Jeremy suggested, a batch of small gemms for parallel version is memory-bound. Do you have other suggestions if the bandwidth becomes a problem.

I was only thinking about small matrix multiplication within the batch (memory-bound). This has nothing to do with using OpenMP or not. In fact, it is recommended to use multiple threads scattered across the machine in order to harvest the aggregated memory bandwidth. You want at least as many as you need to "visit" all memory controllers. There are two ingredients in LIBXSMM that allow for a higher sustained bandwidth: (1) multiplications kernels are direct kernels and not "copy-kernels", and (2) effectively prefetching the "next" operands while performing the "current" multiplication.

Btw, for some (other) reason I added a new variety of our classic SMM samples, which compares with the Eigen C++ library:

https://github.com/hfp/libxsmm/tree/master/samples/eigen

If you don't mind the C++ focus of this sample code, you may give it a try. You can clone Eigen from the original Mercurial repository, grab a regular release or package, or clone from GitHub (e.g., "git clone https://github.com/hfp/eigen.git"). If you install/clone into your $HOME, you don't even need to say "make EIGENROOT=/path/to/eigen" but simply "make" it. I checked with 2 GB of small matrices of shape 23x23 (an arbitrary choice), and found performance on KNL using LIBXSMM is **3x higher** than using the kernels as inlined via Eigen. This speedup is for the streaming case (A,B,C) which is typically referred as "batch multiplication". For fairness I have to say, one can hard-code the matrix dimensions to make inlining/compiler-codegen even more effective (Eigen has some compile-time shaped matrices). Anyhow, those are somewhat inconvenient compared to the "dynamically sized" matrix class. Last note here: the performance of inlining an implementation written in a high-level language (e.g. C/++) depends on the compiler used whereas LIBXSMM's performance does not depend on the compiler used.

One more comment about Intel MKL and small matrix multiplications: it is not true that MKL is bad with small matrix multiplication (as said at the begin of this discussion). There is MKL_DIRECT since v11.0, and there is even a classic batch interface, which gets you for instance around of writing a parallelized loop when processing a batch (and other optimizations). I haven't tried the latter, but MKL_DIRECT is definitely effective (although I don't know the status on MIC/KNL). In fact, MKL_DIRECT not only indicates that multiplication kernels are accessed with lower overhead (when compared to a full-blown and error-checked [xerbla] BLAS/GEMM call), but implies to omit any tiling and hence does not rely on copy-kernels (which would just eat-up memory bandwidth in case of small matrices with no chance of hiding it behind a sufficient amount of compute).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

>>> Do you have other suggestions if the bandwidth becomes a problem.

As said, it's already a problem with single core and OpenMP (or whatever kind of TRT) just helps to harvest the aggregated bandwidth of the entire processor/socket. If your 16x16 matrices are batched like (A,B,C), this comes down as low as 1.33 FLOPS/Byte (DP) or 2.67 FLOPS/Byte (SP). This is not exactly an arithmetically intense workload!

FLOPS: 2 * m * n * k = 2 * 16 * 16 * 16 = 8192

Assuming batched matrix multiplications i.e., case (A,B,C):

Bytes (DP): 3 * 16 * 16 * 8 = 6144

Bytes (SP): 3 * 16 * 16 * 4 = 3072

Btw, the above assumes NTS rather than RFO. If you read for ownership it yields 8192 Bytes (DP) and 4096 Bytes (SP), or basically drops arithmetic intensity by another 33%.

>>> shape 23x23 (an arbitrary choice)

It's not too arbitrary but "fair enough" since each operand (matrix) is definitely farther away than 4KB. Btw, these 4KB are not related to the page size, or if you increase the page to 2MB -- you will not bypass this property. You may rather see this value measured in "number of cachelines" (instead of "xKB"). However, at least a single 16x16 matrix fits into 64 cachelines (4KB) such that the "next operand" is nicely in reach (assuming a packed batch).

Another PLUS+ you get with LIBXSMM is that you don't need the index set explicitly; you may compute it somehow (no need to store it upfront).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**advanced algorithm**for Matrix Multiplication ( MM ) implemented in pure C language could outperform MKL's sgemm and dgemm highly optimized functions. Here are some performance results for two versions of Strassen Matrix Multiplication algorithm on a KNL system: Abbreviations used in reports: MKL - Math Kernel Library CMMA - Classic Matrix Multiplication Algorithm SHBI - Strassen Heap Based Incomplete SHBC - Strassen Heap Based Complete

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

/////////////////////////////////////////////////////////////////////////////// // 16384 x 16384 - Processing using MCDRAM Strassen HBI Matrix Size : 16384 x 16384 Matrix Size Threshold : 8192 x 8192 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 4.88600 secs Strassen HBI - Pass 02 - Completed: 4.27700 secs Strassen HBI - Pass 05 - Completed: 4.24900 secs Strassen HBI - Pass 03 - Completed: 4.24000 secs Strassen HBI - Pass 04 - Completed: 4.24800 secs ALGORITHM_STRASSENHBI - Passed Strassen HBC Matrix Size : 16384 x 16384 Matrix Size Threshold : 8192 x 8192 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: Disabled Calculating... Strassen HBC - Pass 01 - Completed: 5.03100 secs Strassen HBC - Pass 03 - Completed: 4.96100 secs Strassen HBC - Pass 05 - Completed: 4.94200 secs Strassen HBC - Pass 03 - Completed: 4.96200 secs Strassen HBC - Pass 04 - Completed: 4.95400 secs ALGORITHM_STRASSENHBC - 1 - Passed Cblas xGEMM Matrix Size : 16384 x 16384 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas xGEMM - Pass 01 - Completed: 5.57000 secs Cblas xGEMM - Pass 02 - Completed: 5.61600 secs Cblas xGEMM - Pass 03 - Completed: 5.56600 secs Cblas xGEMM - Pass 04 - Completed: 5.61800 secs Cblas xGEMM - Pass 05 - Completed: 5.60700 secs Cblas xGEMM - Passed

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

/////////////////////////////////////////////////////////////////////////////// // 24576 x 24576 - Processing using MCDRAM Strassen HBI Matrix Size : 24576 x 24576 Matrix Size Threshold : 12288 x 12288 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 11.90400 secs Strassen HBI - Pass 02 - Completed: 11.54500 secs Strassen HBI - Pass 03 - Completed: 11.47500 secs Strassen HBI - Pass 04 - Completed: 11.54300 secs Strassen HBI - Pass 05 - Completed: 11.43400 secs ALGORITHM_STRASSENHBI - Passed Strassen HBC - Not Tested ( doesn't fit into MCDRAM ) Cblas xGEMM Matrix Size : 24576 x 24576 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas xGEMM - Pass 01 - Completed: 15.26800 secs Cblas xGEMM - Pass 02 - Completed: 15.30900 secs Cblas xGEMM - Pass 03 - Completed: 15.32700 secs Cblas xGEMM - Pass 04 - Completed: 15.26900 secs Cblas xGEMM - Pass 05 - Completed: 15.29700 secs Cblas xGEMM - Passed

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

// Summary of Test Results // KNL Server Modes: Cluster=All2All / MCDRAM=Cache // KMP_AFFINITY = scatter // Number of OpenMP threads = 64 // MAS=DDR4:DDR4:DDR4 // MKL cblas_sgemm vs. CMMA vs. SHBI vs. SHBC // CMMA with LPS=1:1:1 and CS=ij:ik:jk // Measurements are in seconds N MMA Scatter 256 MKL 0.0106379 CMMA 0.0008685 SHBI 0.0010000 SHBC 0.0020000 512 MKL 0.0110626 CMMA 0.0011320 SHBI 0.0020000 SHBC 0.0030000 1024 MKL 0.0124671 CMMA 0.0152496 SHBI 0.0040000 SHBC 0.0060000 2048 MKL 0.0329953 CMMA 0.1201248 SHBI 0.0140000 SHBC 0.0250000 4096 MKL 0.1085401 CMMA 0.9715966 SHBI 0.0730000 SHBC 0.1150000 8192 MKL 0.6367596 CMMA 9.4611956 SHBI 0.6110000 SHBC 0.7640000 16384 MKL 5.9910029 CMMA 64.1694818 SHBI 4.3020000 SHBC 4.9670000 32768 MKL 49.3818010 CMMA 525.0617163 SHBI 41.7650000 SHBC 44.4220000 65536 MKL 372.3268384 CMMA 4459.1382627 SHBI 361.8030100 SHBC 377.5080000

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

// Summary of Test Results // KNL Server Modes: Cluster=All2All / MCDRAM=Flat // KMP_AFFINITY = scatter // Number of OpenMP threads = 64 // MAS=DDR4:DDR4:DDR4 // MKL cblas_sgemm vs. CMMA vs. SHBI vs. SHBC // CMMA with LPS=1:1:1 and CS=ij:ik:jk // Measurements are in seconds N MMA Scatter 256 MKL 0.0241061 CMMA 0.0000882 SHBI 0.0010000 SHBC 0.0010000 512 MKL 0.0234587 CMMA 0.0010301 SHBI 0.0010000 SHBC 0.0020000 1024 MKL 0.0248575 CMMA 0.0151657 SHBI 0.0040000 SHBC 0.0060000 2048 MKL 0.0352349 CMMA 0.1201408 SHBI 0.0150000 SHBC 0.0250000 4096 MKL 0.1296450 CMMA 0.9940334 SHBI 0.1030000 SHBC 0.1450000 8192 MKL 0.9212241 CMMA 11.3463882 SHBI 0.7460000 SHBC 0.8980000 16384 MKL 6.5647681 CMMA 99.9722107 SHBI 6.2860000 SHBC 6.7630000 32768 MKL 51.2515874 CMMA 866.5838490 SHBI 48.1560000 SHBC 51.0280000 65536 MKL 399.2602713 CMMA 14912.4903906 SHBI 383.6749900 SHBC 393.2439900

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Performance of Classic Matrix Multiplication algorithm on a KNL system**article https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Hans,

I have tried the prefetch method. LIBXSMM_PREFETCH_AL1_BL1_CL1 gives me better performance. One question about Libxsmm, does it depends on MKL? I come across the following errors when I try to compile the code, after I using the flag -mkl, the errors disappear.

lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_sgemm':

libxsmm_gemm.c:(.text+0xb88): undefined reference to `sgemm_'

lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_dgemm':

libxsmm_gemm.c:(.text+0xfd8): undefined reference to `dgemm_'

What do you mean: "multiplications kernels are direct kernels and not copy-kernels", could you explain more?

And could you also explain more about the 4KB case. I do not think I get it.

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hans P. (Intel) wrote:

One more comment about Intel MKL and small matrix multiplications: it is not true that MKL is bad with small matrix multiplication (as said at the begin of this discussion). There is MKL_DIRECT since v11.0, and there is even a classic batch interface, which gets you for instance around of writing a parallelized loop when processing a batch (and other optimizations). I haven't tried the latter, but MKL_DIRECT is definitely effective (although I don't know the status on MIC/KNL). In fact, MKL_DIRECT not only indicates that multiplication kernels are accessed with lower overhead (when compared to a full-blown and error-checked [xerbla] BLAS/GEMM call), but implies to omit any tiling and hence does not rely on copy-kernels (which would just eat-up memory bandwidth in case of small matrices with no chance of hiding it behind a sufficient amount of compute).

Hi Hans,

I will try MKL_DIRECT and see the performance on KNL. Could you also share some info about MKL_DIRECT if you happen to have.

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sergey Kostrov wrote:

Hi Zhen,

>>In my understanding of your opinion, there are two ways to achieve better performance for MM...

>>...

>>One is completely unrolled and manual prefetches data.For 16x16 matrices it should have a lowest degree of overheads. But too excessive application of software prefetches could degrade performance.

>>Another is blocking the data, and the block size should equal to one cache line size. Right?

Yes, something like that but some time will be spent on data mining, that is on transformation of data in memory. Intel C++ compiler could do that and take a look at compiler options.

Also, take into account that

advanced algorithmfor Matrix Multiplication ( MM ) implemented in pure C language could outperform MKL's sgemm and dgemm highly optimized functions. Here are some performance results for two versions of Strassen Matrix Multiplication algorithm on a KNL system:

Thanks Sergey for the reply and the info shared. Is the Strassen MM open sourced? Or is there any implementation of the advanced algorithms, which I can use directly so as to achieve better performance than MKL.

Thanks!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page