Concurrency Problem with Intel MKL BLAS

nunoxic · ‎09-14-2011

I am implementing a code (Iterative Matrix Solver : Language is Fortran) which basically works like this :

while (error < tolerance OR iterations < max iterations)

{

Few DGEMV

Few DDOT

Few DNRM

}

The problem is as follows :

1. The time required for solving increases erratically with the increase in number of threads.

I put it in a loop which works like (Not actual code)

export MKL_DYNAMIC=FALSE

for i=1 to 8

export MKL_NUM_THREADS=i

end

The CPU usage is in accordance with the number of threads at all times (For instance, 200% for 2 threads, 800% for 8 threads etc.) but the lowest time (and hence best performance) is observed at 2 Threads which is bothering me.

2. I am linking my program as :

ifort -fp-model source ParaQMR.f90 -xSSE3 -lmkl_blas95 -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -m32 && time ./a.out

(I can attach the source code but it will be of no real use since the code reads data from a 2 GB text file which obviously I can't upload)

My hardware is Intel Xeon with 8 cores 8 threads. (32 bit Ubuntu 10.04 < I know 32 bit is a mistake but I doubt it will make a difference)

3. I tried VTune and although I am not a computer major or something, I tried my level best at interpreting the results and what I feel from the graph below is that most of the the time is spent creating and destroying threads each time a new iteration starts. But I am not so sure. Is there any way to circumvent this problem ?

I tried everything but nothing seems to work

I have tried :

1. Using my own Matrix-Vector Multiplication code. It suffers immensely when compared to MKL BLAS.

2. Intel Inspector. Didn't seem to make head or tail of what was happening.

3. Experiment with different flags, compiler (gfortran), gfortran and -fopenmp tag instead etc. Didn't work.

(By didn't work I mean, didn't produce the decrease in time as I expected)

Please help me out and let me know if I should upload more VTune Graphs.

TimP · ‎09-14-2011

Following comments might be more topical and get more expert suggestions on the MKL forum:
By resetting MKL_DYNAMIC, I believe you are disabling MKL's own effort to place threads efficiently, so you should be setting KMP_AFFINITY or MKL equivalent directly for each number of threads, according to your platform. Probably 1 thread per CPU, if a dual CPU, for 2 threads, 1 thread per cache, if a split cache CPU; never 2 threads per core when other cores are idle, ....
As you are using dgemv, you might expect performance to drop as soon as you run 2 threads each on 1 or more cores; are you trying to quantify that?
If your cache footprint is very large, it is possible to see your performance peak as soon as you have enough threads to use all of the cache. In such a case, VTune cache events could clarify it.

nunoxic · ‎09-14-2011

I used KMP_AFFINITY and set it as :

export MKL_DYNAMIC=FALSE

export MKL_NUM_THREADS=8

export KMP_AFFINITY="verbose,granularity=fine,proclist=[0,1,2,3,4,5,6,7],explicit"

ifort blah blah

It didn't help. As far as I know, cache was being managed optimally and a second thread was not started on any core if another core was sitting idle.

I still belive that what is happening is

it = 0

create threads > carry out dgemv > destroy threads

create threads > carry out ddot > destroy threads

create threads > carry out dnrm > destroy threads

it = 1

create threads > carry out dgemv > destroy threads

.

Can this be the problem ? If yes, Can this be avoided ?

Chao_Y_Intel · ‎09-14-2011

Hi,

Have you tried export MKL_DYNAMIC=TRUE

It will suggest MKl to choose the good the threading number for the problem. As Tim noted, for the DGEMV, DDOT function, increasing the threading number may not improve the performance. If MKL_DYNAMIC is FALSE, it will force MKL to the threading you set.

Thanks,
Chao

barragan_villanueva_ · ‎09-15-2011

Hi,

Overheadon joining OMP-threads can be significant if your data-volume in MKL functions is notbig enough.
IsSMT (hyper-threading) on?
So that 8 CPUs means: 4-cores with hyper-threading. Please clarify.

nunoxic · ‎09-15-2011

I didn't get your point on joining OMP threads. I have a 10000x10000 matrix being multiplied with a 10000x1 vector.

I don't think SMT is on (I googled and it gave me results related to BIOS settings which I swore never to fiddle with), the only settings I tweaked with are the ones mentioned (MKL_NUM_THREADS and MKL_DYNAMIC and more recently KMP_AFFINITY)

Also,

MKL_DYNAMIC=TRUE uses 2 threads most of the times which is irritating since my goal is to create a code which uses max. system resources.

barragan_villanueva_ · ‎09-15-2011

To see SMT setting from libiomp5 please add `verbose' to your KMP_AFFINITY setting:
KMP_AFFINITY=verbose,$KMP_AFFINITY
If logical thread is bound tonot oneCPUs (likely two) then SMT is ON.

Also please look at related MKL articles/discussions:
http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading/
http://software.intel.com/en-us/articles/setting-thread-affinity-on-smt-or-ht-enabled-systems/
http://software.intel.com/en-us/forums/showthread.php?t=68141

BTW, If Hyper-Threading technology is enabled on the systems, it is recommended that the threading numbers be set equal to the number of real processors or cores. That is only half number of the logical processors.

nunoxic · ‎09-17-2011

Just one more question :

Will changing my code to C benefit the speed in any way ?

TimP · ‎09-17-2011

Quoting nunoxic

Will changing my code to C benefit the speed in any way ?

Unlikely from what you have said so far.