MKL can not get correct benchmark in MIC

Wei_W_2 · ‎06-26-2013

I am trying to use cholesky factorization in intel mic, but I am not able to get correct performance.

This is how I run the code:

[root@bunsen-mic0 /tmp]# env USE_2MB_BUFFERS=3000 MKL_NUM_THREADS=240 KMP_AFFINITY=proclist=[1-240],granularity=fine,explicit ./testing_native_dpotrf -N 9600 -L 5
time 1.201130, gflops 245.567177 >>>>>>>>> this is warm up
time 0.865080, gflops 340.960515
time 0.865288, gflops 340.878499
time 0.864819, gflops 341.063349
time 0.864337, gflops 341.253577
time 0.863623, gflops 341.535639

The correct performance should be 500 gfops when size is 9600*9600

here is the strange results when I only use 1 core:

[root@bunsen-mic0 /tmp]# env USE_2MB_BUFFERS=3000 MKL_NUM_THREADS=4 KMP_AFFINITY=proclist=[1-4],granularity=fine,explicit ./testing_native_dpotrf -N 9600 -L 5
time 0.902745, gflops 326.734658
time 0.871131, gflops 338.592037
time 0.870778, gflops 338.729428
time 0.868808, gflops 339.497416
time 0.866140, gflops 340.543143
time 0.864064, gflops 341.361391

This is significantly not corret, looks like the program is mess up with cores. Anyone can help me figure out where is the problem.

The attachment is my code, there is really nothing in it, just call lapacke_dpotrf

BTW, the following is how I compile my code

testing_native_dpotrf: testing_native_dpotrf.c
icc -O3 -mmic -mkl $< -o $@

Wei_W_2 · ‎06-26-2013

I also checked the dgetrf routine in MKL, and I have the same problem

[root@bunsen-mic0 /tmp]# env USE_2MB_BUFFERS=3000 OMP_NUM_THREADS=240 KMP_AFFINITY=proclist=[1-240],granularity=fine,explicit ./testing_native_dgetrf -N 9600 -L 5
time 3.637088, gflops 162.156626
time 2.682708, gflops 219.844248
time 2.696129, gflops 218.749905
time 2.664388, gflops 221.355876
time 2.685536, gflops 219.612731
time 2.702955, gflops 218.197464

from the mkl website http://software.intel.com/en-us/intel-mkl, the benchmark should not that slow.

I use the same core mapping, and I have no problem with MKL BLAS, but when I use MKL LAPACK, I have such troubles.

Zhang_Z_Intel · ‎07-12-2013

Hello,

I've seen similar problems before. It's probably a data alignment issue. Cholesky and LU performance is very sensitive to alignment. Please try align your memory on 64-byte boundaries. For example, you should use mkl_malloc() routine to allocate your memory and pass 64 as the alignment. Another thing you also need is the huge pages (2MB pages) for memory allocation. See this article for instructions.