I have using MKL to replace lapack and fftw3 by run the openmx, a material simulation code.
gcc -msse4.2 for the openmx ( the openmx could not been build by icc, or you would got wrong result.)
====================== case 1: ifort -msse4.2 for lapack and blas
icc -msse2 -openmp for fftw3
case2: mkl ========================
my cpu is i3 330M, as you know , it is 2 true core with 4 logical core. input is in the work folder.
case 1 with 4 thread :
Met.dat : 41s GaAs : 347s C60: 81s
case1 with 2 thread:
Met.dat : 41s GaAs : 290s
case2 with 4 thread:
Met.dat : 40s GaAs : 327s
I do not know why the mkl would slow than auto-vectorizing by using 4 threads. MKL is vectorizing lapack/blas by hand, not by compiler, it should better than machine done that. Is it means, application should using number of true core instead of logical core by using MKL?