Hi Hans, - Page 2

Zhen · ‎05-19-2017

Hi

I am working on processing a batch of small matrix multiplications on Knights Landing. Since the MKL's performance is not good for small matrix, I use libxsmm. However I find there are a lot of cache misses by using Vtune. The L1 cache miss rate is about 10%, and L2 cache miss rate is about 17%.
The gflops it achieves is less than 20 for single thread. I also write a sample program to test the performance under a ideal condition (no or very few cache miss), it can achieve 50 gflops for single thread.

The code is:

 for(i = 0; i < batch; i++){
         const float *  inA = A +  i*Arows*Acols;
         const float *  inB = B +  i*Bcols*Brows;
         float * oto = out + i*Orows*Ocols;
	 libxsmm_smm_16_16_16(inA,inB,oto);
	// libxsmm_smm_16_16_16(A,B,out); //ideal condition, no or very few cache miss
}

In this case, it seems that the hardware prefetchers do not work well. So I am very curious about why the hardware prefetchers can not always prefetch the next matrix? Each matrix has size of 16*16. So for each gemm, the input matrices and the result can fit into L1 cache. The memory access is consecutive, so the prefether should prefetch data for the next matrix multiplication. However, the prefetchers are not, otherwise there should not be so many cache misses.

According to the Intel Xeon Phi Book, the hardware prefetcher will not stream across a 4KB boundary. Is that the problem?
Does the 4KB boundary mean the page size boundary? I also used the huge page, i.e., 2M page, through hugetlbfs. However, huge page is not going to help.

I also check the assembly code, even though the compiler’s software prefetching is enable, I do not see prefetch instruction in the assembly code. So I think I may need to trigger the software prefecther manually.

Any idea to optimize the program?

Thanks！

Best regards,

Zhen

Hans_P_Intel · ‎05-30-2017

Zhen J. wrote:

Hi Hans,

I have tried the prefetch method. LIBXSMM_PREFETCH_AL1_BL1_CL1 gives me better performance. One question about Libxsmm, does it depends on MKL? I come across the following errors when I try to compile the code, after I using the flag -mkl, the errors disappear.

lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_sgemm':
libxsmm_gemm.c:(.text+0xb88): undefined reference to `sgemm_'
lib/libxsmm.a(libxsmm_gemm.o): In function `libxsmm_original_dgemm':
libxsmm_gemm.c:(.text+0xfd8): undefined reference to `dgemm_'

What do you mean: "multiplications kernels are direct kernels and not copy-kernels", could you explain more?
And could you also explain more about the 4KB case. I do not think I get it.

Thanks!

(Sorry for my late reply!)

The LIBXSMM_PREFETCH_AL1_BL1_CL1 strategy is pretty new and I have not investigated the performance on KNL. What's probably the best is, to rely on our pseudo-strategy called LIBXSMM_PREFETCH_AUTO, which is internally CPUID-dispatched. Right now this strategy maps to LIBXSMM_PREFETCH_SIGONLY (effectively LIBXSMM_PREFETCH_NONE) on BigCore but something else on KNL. I can update this mapping to LIBXSMM_PREFETCH_AL1_BL1_CL1 for KNL if this is generally beneficial. In any case, you can always request a particular strategy (as you already did).

Regarding the dependency on MKL, there is no such thing. LIBXSMM does *not* depend on Intel MKL, but it falls back to any kind of BLAS/LAPACK you wish (e.g., OpenBLAS or ATLAS). If you know /for sure) that you never perform an unsupported GEMM call, you can *remove* any BLAS/LAPACK dependency entirely. For the latter you may link against libxsmmnoblas, which simply satisfies the symbols that you guessed to be an "MKL dependency". Regarding "unsupported GEMM calls", you can have a look at our Q&A (Alpha, Beta, TransA, and TransB are limited to 1, { 1, 0 }, 'N', and 'N' respectively); also this may change in the future (to the better).

Regarding the "multiplications kernels are direct kernels and not copy-kernels" question, most BLAS/LAPACK libraries focus on sufficiently sized matrices, and copy tiles from the (supposedly big) operand-matrices into a (thread-)local buffer. This copy-operation thereby remove the leading dimension of the source matrices and perhaps ensure proper vector-alignment. LIBXSMM however uses direct kernels, which embed the leading dimension into the generated (JIT-)code and effectively omit any copy-operation. The latter is beneficial for small matrix multiplication with no chance to hide the copy-op. behind the compute. Remember, compute grows with O(N^3) and matrix-size only grows with O(N^2) which means arithmetic intensity grows linearly with the operand sizes. In turn, this gives a cross-over point for when it is beneficial to stick with direct kernels or to go for copy-kernels. The latter for instance avoid certain cache effects with Power-Of-Two leading dimensions.

A few more things...
First, if you're interested in batched multiplication of small matrices: it's completely irrelevant to look at Strassen or such. Strassen, Winograd and similar transforms are only beneficial with large(r) matrix multiplications, where O(N^3) vs. O(N^2.7xxx) makes an absolute difference. Of course, using Strassen for the tiling-approach generates a lesser amount of small multiplications. Secondly, unrolling a small multiplications does *not* matter much if you stream one (or even more) operands -- it's memory bound and the effect of unrolling is only measurable when you run the multiplication kernel only in L1/L2 cache. The latter is pretty useless as there is no useful/new data processed. Remember, when you need to load/store data in main memory (DDR or MCDRAM) you will see the real difference between good and better implementations.

Zhen · ‎05-30-2017

Hi Hans,

Thanks very much for the reply. In my case the LIBXSMM_PREFETCH_AL1_BL1_CL1 is better than LIBXSMM_PREFETCH_AUTO. Could you also help to double check the way I invoke the function.

int prefetch = LIBXSMM_PREFETCH_AUTO ; //LIBXSMM_PREFETCH_AL1_BL1_CL1 ; //
int flags = 0; /* LIBXSMM_FLAGS */
float alpha = 1, beta = 0;
libxsmm_smmfunction xmm = libxsmm_smmdispatch(m/*m*/, n/*n*/, k/*k*/,
NULL/*lda*/, NULL/*ldb*/, NULL/*ldc*/,
&alpha, &beta, &flags, &prefetch);

xmm(A, B,C,LIBXSMM_PREFETCH_A(A+i*S),LIBXSMM_PREFETCH_B(B+i*S),LIBXSMM_PREFETCH_C(C+i*S));

For the dependency, I only use the code above. I think I do not have BLAS/LAPACK dependency but still need to link libxsmmnoblas.a without mkl.

Thanks again for the explanations.

Best regards,
Zhen

Hans_P_Intel · ‎05-31-2017

Zhen J. wrote:

Could you also help to double check the way I invoke the function?

int prefetch = LIBXSMM_PREFETCH_AUTO ; //LIBXSMM_PREFETCH_AL1_BL1_CL1 ; //
int flags = 0; /* LIBXSMM_FLAGS */
float alpha = 1, beta = 0;
libxsmm_smmfunction xmm = libxsmm_smmdispatch(m/*m*/, n/*n*/, k/*k*/,
NULL/*lda*/, NULL/*ldb*/, NULL/*ldc*/,
&alpha, &beta, &flags, &prefetch);

xmm(A, B,C,LIBXSMM_PREFETCH_A(A+i*S),LIBXSMM_PREFETCH_B(B+i*S),LIBXSMM_PREFETCH_C(C+i*S));

Hmm, perhaps you can share the enclosing loop (or the relevant part of it)?

To me the call looks wrong. For instance, if A is the current A-matrix it is likely a pointer calculated like A=a_array+i*S but then the prefetch location PA should be something like PA=A+S (or PA=a_array+(i+1)*S if you calculate again using the basepointer "a_array"). I am assuming that "S" is the size of your matrices. This implies that m, n, and k are equal! If not, you may want separate sizes for A, B, and C. Typically, you can calculate these sizes as SA=m*k, SB=k*n, and SC=m*n.

Please also note, you can omit using LIBXSMM_PREFETCH_x as it only elides the enclosed calculation when LIBXSMM is configured statically for a particular prefetch strategy. By default, it's LIBXSMM_PREFETCH_AUTO meaning that the expression given to LIBXSMM_PREFETCH_x is always taken. This macro is somewhat related to LIBXSMM's legacy where the static (non-JIT) code generation was more important (JIT code generation is now in production since a long time and many releases).

Zhen J. wrote:

For the dependency, I only use the code above. I think I do not have BLAS/LAPACK dependency but still need to link libxsmmnoblas.a without mkl.

Yes, if you do not call libxsmm_?gemm functions with e.g., 'T' (or other unsupported things where you rely on fallback code internal to LIBXSMM), you will not need any BLAS/LAPACK as far as LIBXSMM is concerned. Therefore, you may want to link with libxsmmnoblas.a to simply give some dummy-symbols that pretend to be "BLAS" and thereby avoid the unresolved symbols (sgemm_ and dgemm_). On the other hand you can also install the reference BLAS and/or LAPACK library in its static flavor (or OpenBLAS if you like). The latter depends a bit on your Linux distribution (package names). Note, the static flavor often comes with the "dev" packages (which is somewhat weird).

SergeyKostrov · ‎05-31-2017

>>...One question about Libxsmm, does it depends on MKL? I come across the following errors when I try to compile the code, >>after I using the flag -mkl, the errors disappear. it looks like Yes and to confirm that you need to review source codes of the Libxsmm library.

SergeyKostrov · ‎05-31-2017

>>...Is the Strassen MM open sourced? No and it won't be due to some issues. It is a part of a Linear Algebra domain of the ScaLib for BDP library and all codes are proprietary. >>Or is there any implementation of the advanced algorithms, which I can use directly so as to achieve better performance than MKL A very tuned implementation of the Classic Matrix multiplication algorithm could be considered as advanced. I was very surprised to learn that the Libxsmm library uses MKL internally and there is nothing wrong with that. The only issue is that based on my R&D CMMA for very small matrices outperforms MKL's sgemm. So, you need to complete a series of tests with CMMA that uses streaming stores since they boost performance and Intel C++ compiler ( see options ) could do that for you. Once again, MKL is a "heavy player" and it is recommended to use for larger matrices, that is significantly greater than 16x16.

SergeyKostrov · ‎05-31-2017

>>...Once again, MKL is a "heavy player" and it is recommended to use for larger matrices, that is significantly greater than 16x16. Please review test reports in Posts #16 and 17. I didn't do tests for matrices smaller than 256x256 but you can do it using tests uploaded with the article I've mentioned.

Zhen · ‎05-31-2017

Hi Hans,
Thanks for your quick reply. In the last post, the i*S was a typo, it should be S not i*S. I wanted to make it simple, but introduced a mistake, sorry.
Here is my code. In this case, the matrices are all 16*16. So the m, n and k are equal.

int prefetch = LIBXSMM_PREFETCH_AUTO ; //LIBXSMM_PREFETCH_AL1_BL1_CL1 ; //
int flags = 0; /* LIBXSMM_FLAGS */
float alpha = 1, beta = 0;
libxsmm_smmfunction xmm = libxsmm_smmdispatch(m/*m*/, n/*n*/, k/*k*/,
NULL/*lda*/, NULL/*ldb*/, NULL/*ldc*/,
&alpha, &beta, &flags, &prefetch);
int S=16*16;
for(t = 0; t < batch; t++){
for(int i = 0; i < 216; i++){
A = inputA + t * 216 * S + i*S;
B = inputB+i*S;
C = out + t*216*S + i*S;
xmm(A, B,C,LIBXSMM_PREFETCH_A(A+S),LIBXSMM_PREFETCH_B(B+S),LIBXSMM_PREFETCH_C(C+S));
}
}

For the prefetch strategies, is that to say, no mater which one I choose, I can always use A+S, B+S,C+S as the last three operands?

Hans_P_Intel · ‎06-01-2017

Zhen J. wrote:

For the prefetch strategies, is that to say, no mater which one I choose, I can always use A+S, B+S,C+S as the last three operands?

Yes, absolutely. The idea is to allow changing the strategy without changing the call-site. The only minor thing is (not a real disadvantage) that you eventually calculate a "next location" for some of the operands (e.g., C), but the prefetch strategy does not use the given hint (e.g., it is a strategy which does not prefetch all of the given hints). In such a case, you waste a few integer operations to calculate this "next location". If you run on KNL (KNL is a two-wide CPU architecture), and finally only target KNL plus you found the best strategy -- you can eventually gain something by not calculating some of the hints. However, I think the latter is almost artificial.

Of course, your prefetch expressions should actually model the "next location" accurately (depends on your application). For example, if you had Beta=1 and accumulate into the same C, then the "next C" just stays the same (C+0). Just as a side note, you can also prefetch from "past the end" without fearing an out-of-bounds (OOB) access. It is architecturally guaranteed that prefetching from an invalid memory location will not produce an OOB. However, such a prefetch causes a page-fault, which costs you some time (which should not matter if your batch size is able to amortize this). If you are (conceptionally) worried about an access "past the end", you can peel-off the last iteration (loop only up to N-1 instead of N), and call the same kernel with some (valid) dummy location outside of the loop (Nth call). Of course, you can also dispatch a kernel without prefetch and call that one, but that's almost paranoid and not worth the effort.

Hans_P_Intel · ‎06-02-2017

Sergey Kostrov wrote:

I was very surprised to learn that the Libxsmm library uses MKL internally and there is nothing wrong with that.

No, LIBXSMM does not "use Intel MKL internally". The statement suggests LIBXSMM's functionality is somehow implemented using MKL, which is not true. LIBXSMM implements small matrix multiplication kernels where Alpha, Beta, TransA, and TransB are limited to {1}, {1, 0}, {'N'}, and {'N'} respectively. Since LIBXSMM (alternatively) offers this functionality with a GEMM compatible interface, it needs to do something about calls that are not supported (fallback). Also, a choice is given to only handle small problem sizes (below a certain threshold) and to fallback for larger problems. Larger problems can be handled by tiling the multiplication and using LIBXSMM's kernels for the tiles, or by falling back to BLAS. Anyhow, the fallback can be any BLAS/LAPACK implementation or more technically anything providing _dgemm or _sgemm symbols. The compatibility with BLAS/LAPACK goes as far as allowing an application to link against LIBXSMM, and to intercept all calls to _dgemm or _sgemm.

Zhen · ‎06-02-2017

Thanks Hans for the reply. It is very useful for me.

Zhen · ‎06-02-2017

Sergey Kostrov wrote:

>>...Is the Strassen MM open sourced?

No and it won't be due to some issues. It is a part of a Linear Algebra domain of the ScaLib for BDP library and all codes are proprietary.

>>Or is there any implementation of the advanced algorithms, which I can use directly so as to achieve better performance than MKL

A very tuned implementation of the Classic Matrix multiplication algorithm could be considered as advanced. I was very surprised to learn that the Libxsmm library uses MKL internally and there is nothing wrong with that. The only issue is that based on my R&D CMMA for very small matrices outperforms MKL's sgemm. So, you need to complete a series of tests with CMMA that uses streaming stores since they boost performance and Intel C++ compiler ( see options ) could do that for you.

Once again, MKL is a "heavy player" and it is recommended to use for larger matrices, that is significantly greater than 16x16.

Hi Sergey,

Thanks for the reply. I have read the article. For streaming store, do you mean that I can only rely on ICC to stream the store. I think that in the CMMA code, there is no explicit streaming store. I will benchmark the code using small matrix sizes. It is really exciting that the CMMA can outperform outperform MKL, the CMMA is not a very complicated implementation.

Best regards,

Zhen

Hans_P_Intel · ‎06-02-2017

Zhen J. wrote:

I will try MKL_DIRECT and see the performance on KNL. Could you also share some info about MKL_DIRECT if you happen to have.

You can rely on Intel MKL's reference guide, or more specifically:
https://software.intel.com/en-us/mkl-linux-developer-guide-improving-performance-for-small-size-problems
https://software.intel.com/en-us/mkl-linux-developer-guide-using-mkl-direct-call-in-c-applications

The important part is to actually include the MKL header file like in the introductory code snippet of the aforementioned URL i.e., one cannot rely on plain old C and call gemm without a prototype. The reason for this is that MKL_DIRECT depends on the preprocessor, which turns any gemm symbol into something else.

Hans_P_Intel · ‎06-02-2017

Zhen J. wrote:

Thanks Hans for the reply. It is very useful for me.

As Sergey also mentioned, NTS can be beneficial. Your case (code snippet where you call LIBXSMM) looks indeed like NTS makes sense, and may improve performance when compared to RFO (btw, LIBXSMM does not allow at the moment to harvest NTS vs. RFO). To make use of NTS, you also need aligned memory. Moreover, your particular case (16x16 matrices) always ends up with aligned accesses (your batch is consecutively stored). Regardless of getting the compiler to emit NTS or not, I recommend to align your memory buffers to 64 Byte (512 Bit). Aligned memory will help for sure on KNL (but you don't need explicitly aligned load/store instructions).

Zhen · ‎06-03-2017

Hans P. (Intel) wrote:

As Sergey also mentioned, NTS can be beneficial. Your case (code snippet where you call LIBXSMM) looks indeed like NTS makes sense, and may improve performance when compared to RFO (btw, LIBXSMM does not allow at the moment to harvest NTS vs. RFO). To make use of NTS, you also need aligned memory. Moreover, your particular case (16x16 matrices) always ends up with aligned accesses (your batch is consecutively stored). Regardless of getting the compiler to emit NTS or not, I recommend to align your memory buffers to 64 Byte (512 Bit). Aligned memory will help for sure on KNL (but you don't need explicitly aligned load/store instructions).

Hi Hans,

Thanks for the suggestion. For NTS, do you think the compiler flag like opt-streaming-stores helpful? Or should I use the intrinsic or pragma to rewrite the gemm.

Best regards,

Zhen

Hans_P_Intel · ‎06-04-2017

Zhen J. wrote:

For NTS, do you think the compiler flag like opt-streaming-stores helpful? Or should I use the intrinsic or pragma to rewrite the gemm.

I recommend you to rely on #pragma vector nontemporal(dst) and #pragma vector aligned(dst). This gives you much better control (at loop level), and embeds the intent into the source code. I think unless your code/loop structure does not vectorize, it is not necessary to go for Intrinsics. Relying on the compiler flag gives at best the granularity of a single translation unit.

Put the aforementioned directives in front of the direct loop that performs the writes:

for (...) {
# pragma vector aligned(dst)
# pragma vector nontemporal(dst)
  for (...) {
    dst = ...
  }
}

Please also keep in mind that NTS can be counter-productive on KNL, or better said you cannot bypass MCDRAM as a cache (KNL cache mode).

Zhen · ‎06-05-2017

Hans P. (Intel) wrote:

Put the aforementioned directives in front of the direct loop that performs the writes:
for (...) {
# pragma vector aligned(dst)
# pragma vector nontemporal(dst)
  for (...) {
    dst = ...
  }
}
Please also keep in mind that NTS can be counter-productive on KNL, or better said you cannot bypass MCDRAM as a cache (KNL cache mode).

Thanks Hans. You mean the NTS will be stored in MCDRAM, right?

Hans_P_Intel · ‎06-05-2017

Zhen J. wrote:

Thanks Hans. You mean the NTS will be stored in MCDRAM, right?

Yes, normally Non-Temporal-Streaming stores (NTS) are meant to bypass the cache-hierarchy (L1, L2) and one may expect that MCDRAM used as a cache (KNL's "cache mode") would be bypassed as well. However, this is not the case since it is implemented as a memory side cache (MSC) which cannot be bypassed by NTS. Therefore, when you configured your KNL system in cache mode you will not only put the values into that cache (when stored via NTS) but also not realize any lower latency than the regular latency of cache mode.

Hans_P_Intel · ‎06-06-2017

Hello Zhen,

I guess you are checking all options to achieve the best performance for your use case. If it is possible, perhaps you can share some of your findings, performance results, and also tell us a bit more about your application?

Zhen · ‎06-09-2017

Hans P. (Intel) wrote:

Hello Zhen,

I guess you are checking all options to achieve the best performance for your use case. If it is possible, perhaps you can share some of your findings, performance results, and also tell us a bit more about your application?

Hi Hans,

Thanks very much and sorry for my late reply. We are working on a batch of gemms. I choose small gemms because the whole batch can fit in cache. But the small one may suffer from memory bandwidth problems. So we are trying larger gemms, i.e., not dividing into too many small ones.

One lesson I've learned is that prefetching is helpful and it can improve the performance, but still working on finding a best way to use it. I would like to share more findings when I have.

Thanks for your help and also thanks others for providing invaluable suggestions.

Best,

Zhen

Hans_P_Intel · ‎06-12-2017

Zhen J. wrote:

Hi Hans,

Thanks very much and sorry for my late reply. We are working on a batch of gemms. I choose small gemms because the whole batch can fit in cache. But the small one may suffer from memory bandwidth problems. So we are trying larger gemms, i.e., not dividing into too many small ones.

One lesson I've learned is that prefetching is helpful and it can improve the performance, but still working on finding a best way to use it. I would like to share more findings when I have.

Thanks for your help and also thanks others for providing invaluable suggestions.

Best,

Zhen

Thank you!

Knights Landing Cache Prefetchers question