- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to use cholesky factorization in intel mic, but I am not able to get correct performance.
This is how I run the code:
[root@bunsen-mic0 /tmp]# env USE_2MB_BUFFERS=3000 MKL_NUM_THREADS=240 KMP_AFFINITY=proclist=[1-240],granularity=fine,explicit ./testing_native_dpotrf -N 9600 -L 5
time 1.201130, gflops 245.567177 >>>>>>>>> this is warm up
time 0.865080, gflops 340.960515
time 0.865288, gflops 340.878499
time 0.864819, gflops 341.063349
time 0.864337, gflops 341.253577
time 0.863623, gflops 341.535639
The correct performance should be 500 gfops when size is 9600*9600
here is the strange results when I only use 1 core:
[root@bunsen-mic0 /tmp]# env USE_2MB_BUFFERS=3000 MKL_NUM_THREADS=4 KMP_AFFINITY=proclist=[1-4],granularity=fine,explicit ./testing_native_dpotrf -N 9600 -L 5
time 0.902745, gflops 326.734658
time 0.871131, gflops 338.592037
time 0.870778, gflops 338.729428
time 0.868808, gflops 339.497416
time 0.866140, gflops 340.543143
time 0.864064, gflops 341.361391
This is significantly not corret, looks like the program is mess up with cores. Anyone can help me figure out where is the problem.
The attachment is my code, there is really nothing in it, just call lapacke_dpotrf
BTW, the following is how I compile my code
testing_native_dpotrf: testing_native_dpotrf.c
icc -O3 -mmic -mkl $< -o $@
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I also checked the dgetrf routine in MKL, and I have the same problem
[root@bunsen-mic0 /tmp]# env USE_2MB_BUFFERS=3000 OMP_NUM_THREADS=240 KMP_AFFINITY=proclist=[1-240],granularity=fine,explicit ./testing_native_dgetrf -N 9600 -L 5
time 3.637088, gflops 162.156626
time 2.682708, gflops 219.844248
time 2.696129, gflops 218.749905
time 2.664388, gflops 221.355876
time 2.685536, gflops 219.612731
time 2.702955, gflops 218.197464
from the mkl website http://software.intel.com/en-us/intel-mkl, the benchmark should not that slow.
I use the same core mapping, and I have no problem with MKL BLAS, but when I use MKL LAPACK, I have such troubles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I've seen similar problems before. It's probably a data alignment issue. Cholesky and LU performance is very sensitive to alignment. Please try align your memory on 64-byte boundaries. For example, you should use mkl_malloc() routine to allocate your memory and pass 64 as the alignment. Another thing you also need is the huge pages (2MB pages) for memory allocation. See this article for instructions.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page