Showing results for 
Search instead for 
Did you mean: 

Knights Landing Cache Prefetchers question


I am working on processing a batch of small matrix multiplications on Knights Landing. Since the MKL's performance is not good for small matrix, I use libxsmm. However I find there are a lot of cache misses by using Vtune. The L1 cache miss rate is about 10%, and L2 cache miss rate is about 17%. 
The gflops it achieves is less than 20 for single thread. I also write a sample program to test the performance under a ideal condition (no or very few cache miss), it can achieve 50 gflops for single thread.

The code is:

 for(i = 0; i < batch; i++){
         const float *  inA = A +  i*Arows*Acols;
         const float *  inB = B +  i*Bcols*Brows;
         float * oto = out + i*Orows*Ocols;
	// libxsmm_smm_16_16_16(A,B,out); //ideal condition, no or very few cache miss

In this case, it seems that the hardware prefetchers do not work well. So I am very curious about why the hardware prefetchers can not always prefetch the next matrix? Each matrix has size of 16*16. So for each gemm, the input matrices and the result can fit into L1 cache. The memory access is consecutive, so the prefether should prefetch data for the next matrix multiplication. However, the prefetchers are not, otherwise there should not be so many cache misses.

According to the Intel Xeon Phi Book, the hardware prefetcher will not stream across a 4KB boundary. Is that the problem?
Does the 4KB boundary mean the page size boundary? I also used the huge page, i.e., 2M page, through hugetlbfs. However, huge page is not going to help.

I also check the assembly code, even though the compiler’s software prefetching is enable, I do not see prefetch instruction in the assembly code. So I think I may need to trigger the software prefecther manually.

Any idea to optimize the program? 



Best regards,


0 Kudos
40 Replies

Zhen J. wrote:

Hi Hans,

Thanks very much and sorry for my late reply. We are working on a batch of gemms. I choose small gemms because the whole batch can fit in cache. But the small one may suffer from memory bandwidth problems. So we are trying larger gemms, i.e., not dividing into too many small ones.

One lesson I've learned is that prefetching is helpful and it can improve the performance, but still working on finding a best way to use it. I would like to share more findings when I have.

Thanks for your help and also thanks others for providing invaluable suggestions.



Thank you!

0 Kudos