- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I am working on processing a batch of small matrix multiplications on Knights Landing. Since the MKL's performance is not good for small matrix, I use libxsmm. However I find there are a lot of cache misses by using Vtune. The L1 cache miss rate is about 10%, and L2 cache miss rate is about 17%.
The gflops it achieves is less than 20 for single thread. I also write a sample program to test the performance under a ideal condition (no or very few cache miss), it can achieve 50 gflops for single thread.
The code is:
for(i = 0; i < batch; i++){
const float * inA = A + i*Arows*Acols;
const float * inB = B + i*Bcols*Brows;
float * oto = out + i*Orows*Ocols;
libxsmm_smm_16_16_16(inA,inB,oto);
// libxsmm_smm_16_16_16(A,B,out); //ideal condition, no or very few cache miss
}
In this case, it seems that the hardware prefetchers do not work well. So I am very curious about why the hardware prefetchers can not always prefetch the next matrix? Each matrix has size of 16*16. So for each gemm, the input matrices and the result can fit into L1 cache. The memory access is consecutive, so the prefether should prefetch data for the next matrix multiplication. However, the prefetchers are not, otherwise there should not be so many cache misses.
According to the Intel Xeon Phi Book, the hardware prefetcher will not stream across a 4KB boundary. Is that the problem?
Does the 4KB boundary mean the page size boundary? I also used the huge page, i.e., 2M page, through hugetlbfs. However, huge page is not going to help.
I also check the assembly code, even though the compiler’s software prefetching is enable, I do not see prefetch instruction in the assembly code. So I think I may need to trigger the software prefecther manually.
Any idea to optimize the program?
Thanks!
Best regards,
Zhen
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
Thanks for the reply. I just see your reply in my Half Precision Floats post. My bad, and also thanks very much for that reply. This one is a different problem from that.
In my understanding of your opinion, there are two ways to achieve better performance for MM. One is completely unrolled and manual prefetches data. Another is blocking the data, and the block size should equal to one cache line size. Right?
If the blocking method is used, is that to say the hardware prefetcher will streaming prefecth the data. For the unrolling method, the data will not be prefecthed?
Zhen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why do you think that your memory accesses are consecutive?
I suggest you to dump the addresses of those memory accesses to have a better look of the problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jeremy,
I do a batch gemm. So the addresses of A1 and A2 are consecutive in my code. So do B1 and B2; C1 and C2. C1 = A1*B1.
A, B, C represent matrices.
BTW, could you explain more about how to dump the addresses?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Zhen J. wrote:
Hi Jeremy,
I do a batch gemm. So the addresses of A1 and A2 are consecutive in my code. So do B1 and B2; C1 and C2. C1 = A1*B1.
A, B, C represent matrices.
BTW, could you explain more about how to dump the addresses?
For your case, it would be straightforward to obtain the addresses with a debugger and step through each instruction. (there are only 256 elements in each 16x16 matrix...). For each memory access, you need the instruction address and the memory address. In addition, remember to obtain the addresses of your pointers so that you know which matrix element is being accessed in those memory accesses.
To my understanding, an implementation of hardware prefetcher takes instruction address as one of its input. Therefore, instruction addresses of memory accesses could be a critical information to explain your observation.
Finally, you'd better confirm that whether your code is memory-bound or cpu-bound first.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jeremy, that is a good suggestion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Zhen,
perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:
- Dispatch a matrix kernel, which applies for the entire batch ("xmm", see line #130).
- The kernel in the sample is requested *without* prefetches, but that's easy see libxsmm_dmmdispatch.
- An OpenMP parallelized loop sweeps over the batch of matrices.
- There are several cases of which operands you want to stream: (A,B,C), (A,B), (A,C), (B,C).
- The term "batched matrix multiplication" usually applies to the case (A,B,C).
- The signature of an SMM kernel in LIBXSMM is: kernel(a, b, c).
- However, when you dispatch a kernel with prefetch support, the signature is: kernel(a,b,c, pa,pb,pc).
- With the latter you can do: kernel(a, b, c, a[i+1], b[i+1], c[i+1]), which prefetches the "next" operands.
- You may request/try different prefetch strategies, see here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hans P. (Intel) wrote:
perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:
Thanks Hans. I will try. If I use multiple OMP threads to perform small gemms, the bandwidth may become a problem. Like Jeremy suggested, a batch of small gemms for parallel version is memory-bound. Do you have other suggestions if the bandwidth becomes a problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Zhen J. wrote:
Quote:
Hans P. (Intel) wrote:
perhaps you can have a look at https://github.com/hfp/libxsmm/blob/master/samples/smm/specialized.cpp#L133. Btw, you do not need to use C++ as used in this particular code sample. The mentioned code section does the following:
Thanks Hans. I will try. If I use multiple OMP threads to perform small gemms, the bandwidth may become a problem. Like Jeremy suggested, a batch of small gemms for parallel version is memory-bound. Do you have other suggestions if the bandwidth becomes a problem.
I was only thinking about small matrix multiplication within the batch (memory-bound). This has nothing to do with using OpenMP or not. In fact, it is recommended to use multiple threads scattered across the machine in order to harvest the aggregated memory bandwidth. You want at least as many as you need to "visit" all memory controllers. There are two ingredients in LIBXSMM that allow for a higher sustained bandwidth: (1) multiplications kernels are direct kernels and not "copy-kernels", and (2) effectively prefetching the "next" operands while performing the "current" multiplication.
Btw, for some (other) reason I added a new variety of our classic SMM samples, which compares with the Eigen C++ library:
https://github.com/hfp/libxsmm/tree/master/samples/eigen
If you don't mind the C++ focus of this sample code, you may give it a try. You can clone Eigen from the original Mercurial repository, grab a regular release or package, or clone from GitHub (e.g., "git clone https://github.com/hfp/eigen.git"). If you install/clone into your $HOME, you don't even need to say "make EIGENROOT=/path/to/eigen" but simply "make" it. I checked with 2 GB of small matrices of shape 23x23 (an arbitrary choice), and found performance on KNL using LIBXSMM is 3x higher than using the kernels as inlined via Eigen. This speedup is for the streaming case (A,B,C) which is typically referred as "batch multiplication". For fairness I have to say, one can hard-code the matrix dimensions to make inlining/compiler-codegen even more effective (Eigen has some compile-time shaped matrices). Anyhow, those are somewhat inconvenient compared to the "dynamically sized" matrix class. Last note here: the performance of inlining an implementation written in a high-level language (e.g. C/++) depends on the compiler used whereas LIBXSMM's performance does not depend on the compiler used.
One more comment about Intel MKL and small matrix multiplications: it is not true that MKL is bad with small matrix multiplication (as said at the begin of this discussion). There is MKL_DIRECT since v11.0, and there is even a classic batch interface, which gets you for instance around of writing a parallelized loop when processing a batch (and other optimizations). I haven't tried the latter, but MKL_DIRECT is definitely effective (although I don't know the status on MIC/KNL). In fact, MKL_DIRECT not only indicates that multiplication kernels are accessed with lower overhead (when compared to a full-blown and error-checked [xerbla] BLAS/GEMM call), but implies to omit any tiling and hence does not rely on copy-kernels (which would just eat-up memory bandwidth in case of small matrices with no chance of hiding it behind a sufficient amount of compute).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> Do you have other suggestions if the bandwidth becomes a problem.
As said, it's already a problem with single core and OpenMP (or whatever kind of TRT) just helps to harvest the aggregated bandwidth of the entire processor/socket. If your 16x16 matrices are batched like (A,B,C), this comes down as low as 1.33 FLOPS/Byte (DP) or 2.67 FLOPS/Byte (SP). This is not exactly an arithmetically intense workload!
FLOPS: 2 * m * n * k = 2 * 16 * 16 * 16 = 8192
Assuming batched matrix multiplications i.e., case (A,B,C):
Bytes (DP): 3 * 16 * 16 * 8 = 6144
Bytes (SP): 3 * 16 * 16 * 4 = 3072
Btw, the above assumes NTS rather than RFO. If you read for ownership it yields 8192 Bytes (DP) and 4096 Bytes (SP), or basically drops arithmetic intensity by another 33%.
>>> shape 23x23 (an arbitrary choice)
It's not too arbitrary but "fair enough" since each operand (matrix) is definitely farther away than 4KB. Btw, these 4KB are not related to the page size, or if you increase the page to 2MB -- you will not bypass this property. You may rather see this value measured in "number of cachelines" (instead of "xKB"). However, at least a single 16x16 matrix fits into 64 cachelines (4KB) such that the "next operand" is nicely in reach (assuming a packed batch).
Another PLUS+ you get with LIBXSMM is that you don't need the index set explicitly; you may compute it somehow (no need to store it upfront).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/////////////////////////////////////////////////////////////////////////////// // 16384 x 16384 - Processing using MCDRAM Strassen HBI Matrix Size : 16384 x 16384 Matrix Size Threshold : 8192 x 8192 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 4.88600 secs Strassen HBI - Pass 02 - Completed: 4.27700 secs Strassen HBI - Pass 05 - Completed: 4.24900 secs Strassen HBI - Pass 03 - Completed: 4.24000 secs Strassen HBI - Pass 04 - Completed: 4.24800 secs ALGORITHM_STRASSENHBI - Passed Strassen HBC Matrix Size : 16384 x 16384 Matrix Size Threshold : 8192 x 8192 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: Disabled Calculating... Strassen HBC - Pass 01 - Completed: 5.03100 secs Strassen HBC - Pass 03 - Completed: 4.96100 secs Strassen HBC - Pass 05 - Completed: 4.94200 secs Strassen HBC - Pass 03 - Completed: 4.96200 secs Strassen HBC - Pass 04 - Completed: 4.95400 secs ALGORITHM_STRASSENHBC - 1 - Passed Cblas xGEMM Matrix Size : 16384 x 16384 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas xGEMM - Pass 01 - Completed: 5.57000 secs Cblas xGEMM - Pass 02 - Completed: 5.61600 secs Cblas xGEMM - Pass 03 - Completed: 5.56600 secs Cblas xGEMM - Pass 04 - Completed: 5.61800 secs Cblas xGEMM - Pass 05 - Completed: 5.60700 secs Cblas xGEMM - Passed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/////////////////////////////////////////////////////////////////////////////// // 24576 x 24576 - Processing using MCDRAM Strassen HBI Matrix Size : 24576 x 24576 Matrix Size Threshold : 12288 x 12288 Matrix Partitions : 8 Degree of Recursion : 1 Result Sets Reflection: N/A Calculating... Strassen HBI - Pass 01 - Completed: 11.90400 secs Strassen HBI - Pass 02 - Completed: 11.54500 secs Strassen HBI - Pass 03 - Completed: 11.47500 secs Strassen HBI - Pass 04 - Completed: 11.54300 secs Strassen HBI - Pass 05 - Completed: 11.43400 secs ALGORITHM_STRASSENHBI - Passed Strassen HBC - Not Tested ( doesn't fit into MCDRAM ) Cblas xGEMM Matrix Size : 24576 x 24576 Matrix Size Threshold : N/A Matrix Partitions : N/A Degree of Recursion : N/A Result Sets Reflection: N/A Calculating... Cblas xGEMM - Pass 01 - Completed: 15.26800 secs Cblas xGEMM - Pass 02 - Completed: 15.30900 secs Cblas xGEMM - Pass 03 - Completed: 15.32700 secs Cblas xGEMM - Pass 04 - Completed: 15.26900 secs Cblas xGEMM - Pass 05 - Completed: 15.29700 secs Cblas xGEMM - Passed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
// Summary of Test Results
// KNL Server Modes: Cluster=All2All / MCDRAM=Cache
// KMP_AFFINITY = scatter
// Number of OpenMP threads = 64
// MAS=DDR4:DDR4:DDR4
// MKL cblas_sgemm vs. CMMA vs. SHBI vs. SHBC
// CMMA with LPS=1:1:1 and CS=ij:ik:jk
// Measurements are in seconds
N MMA Scatter
256 MKL 0.0106379
CMMA 0.0008685
SHBI 0.0010000
SHBC 0.0020000
512 MKL 0.0110626
CMMA 0.0011320
SHBI 0.0020000
SHBC 0.0030000
1024 MKL 0.0124671
CMMA 0.0152496
SHBI 0.0040000
SHBC 0.0060000
2048 MKL 0.0329953
CMMA 0.1201248
SHBI 0.0140000
SHBC 0.0250000
4096 MKL 0.1085401
CMMA 0.9715966
SHBI 0.0730000
SHBC 0.1150000
8192 MKL 0.6367596
CMMA 9.4611956
SHBI 0.6110000
SHBC 0.7640000
16384 MKL 5.9910029
CMMA 64.1694818
SHBI 4.3020000
SHBC 4.9670000
32768 MKL 49.3818010
CMMA 525.0617163
SHBI 41.7650000
SHBC 44.4220000
65536 MKL 372.3268384
CMMA 4459.1382627
SHBI 361.8030100
SHBC 377.5080000
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
// Summary of Test Results
// KNL Server Modes: Cluster=All2All / MCDRAM=Flat
// KMP_AFFINITY = scatter
// Number of OpenMP threads = 64
// MAS=DDR4:DDR4:DDR4
// MKL cblas_sgemm vs. CMMA vs. SHBI vs. SHBC
// CMMA with LPS=1:1:1 and CS=ij:ik:jk
// Measurements are in seconds
N MMA Scatter
256 MKL 0.0241061
CMMA 0.0000882
SHBI 0.0010000
SHBC 0.0010000
512 MKL 0.0234587
CMMA 0.0010301
SHBI 0.0010000
SHBC 0.0020000
1024 MKL 0.0248575
CMMA 0.0151657
SHBI 0.0040000
SHBC 0.0060000
2048 MKL 0.0352349
CMMA 0.1201408
SHBI 0.0150000
SHBC 0.0250000
4096 MKL 0.1296450
CMMA 0.9940334
SHBI 0.1030000
SHBC 0.1450000
8192 MKL 0.9212241
CMMA 11.3463882
SHBI 0.7460000
SHBC 0.8980000
16384 MKL 6.5647681
CMMA 99.9722107
SHBI 6.2860000
SHBC 6.7630000
32768 MKL 51.2515874
CMMA 866.5838490
SHBI 48.1560000
SHBC 51.0280000
65536 MKL 399.2602713
CMMA 14912.4903906
SHBI 383.6749900
SHBC 393.2439900
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Hans,
I have tried the prefetch method. LIBXSMM_PREFETCH_AL1_BL1_CL1 gives me better performance. One question about Libxsmm, does it depends on MKL? I come across the following errors when I try to compile the code, after I using the flag -mkl, the errors disappear.
lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_sgemm':
libxsmm_gemm.c:(.text+0xb88): undefined reference to `sgemm_'
lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_dgemm':
libxsmm_gemm.c:(.text+0xfd8): undefined reference to `dgemm_'
What do you mean: "multiplications kernels are direct kernels and not copy-kernels", could you explain more?
And could you also explain more about the 4KB case. I do not think I get it.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hans P. (Intel) wrote:
One more comment about Intel MKL and small matrix multiplications: it is not true that MKL is bad with small matrix multiplication (as said at the begin of this discussion). There is MKL_DIRECT since v11.0, and there is even a classic batch interface, which gets you for instance around of writing a parallelized loop when processing a batch (and other optimizations). I haven't tried the latter, but MKL_DIRECT is definitely effective (although I don't know the status on MIC/KNL). In fact, MKL_DIRECT not only indicates that multiplication kernels are accessed with lower overhead (when compared to a full-blown and error-checked [xerbla] BLAS/GEMM call), but implies to omit any tiling and hence does not rely on copy-kernels (which would just eat-up memory bandwidth in case of small matrices with no chance of hiding it behind a sufficient amount of compute).
Hi Hans,
I will try MKL_DIRECT and see the performance on KNL. Could you also share some info about MKL_DIRECT if you happen to have.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Hi Zhen,
>>In my understanding of your opinion, there are two ways to achieve better performance for MM...
>>...
>>One is completely unrolled and manual prefetches data.For 16x16 matrices it should have a lowest degree of overheads. But too excessive application of software prefetches could degrade performance.
>>Another is blocking the data, and the block size should equal to one cache line size. Right?
Yes, something like that but some time will be spent on data mining, that is on transformation of data in memory. Intel C++ compiler could do that and take a look at compiler options.
Also, take into account that advanced algorithm for Matrix Multiplication ( MM ) implemented in pure C language could outperform MKL's sgemm and dgemm highly optimized functions. Here are some performance results for two versions of Strassen Matrix Multiplication algorithm on a KNL system:
Thanks Sergey for the reply and the info shared. Is the Strassen MM open sourced? Or is there any implementation of the advanced algorithms, which I can use directly so as to achieve better performance than MKL.
Thanks!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page