Concurrency Problem with Intel MKL BLAS

nunoxic · ‎09-14-2011

http://software.intel.com/en-us/forums/showthread.php?t=86020

Didn't intend to multi-post but there seems to be no choice since I need inputs from MKL experts and VTune experts.

Thanks

barragan_villanueva_ · ‎09-15-2011

Please use the following environment settings while using libiomp5
1) set KMP_VERSION
to see OpenMP run-time library version you are using
2) set KMP_AFFINITY=verbose,$KMP_AFFINITY

to see used affinity

3) try KMP_AFFINITY=granularity=fine,compact,1,0

this is recommended affinity from MKL doc if SMT(HT)is enabled

4) play with KMP_BLOCKTIME

Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region,before sleeping (default is 200 milliseconds)

nunoxic · ‎09-15-2011

Thanks for your inputs but none of the above made any difference to the code
I played with KMP_BLOCKTIME for an hour or more. I set it to 0 200 inf and what not but it lead to nowhere. Somtimes it sped up the execution for a given input data but when the data was changed, the optimality was lost.

What is the difference between linking using -libomp5 and -openmp
From my experiments, I found -libomp5 to be much much faster than -openmp.

I tried to read up about KMP here :
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2009/compiler_c/optaps/common/optaps_openmp_thread_affinity.htm
but it is going over my head. Are KMP and OMP different things or are they same things ?

barragan_villanueva_ · ‎09-15-2011

Quoting nunoxic

What is the difference between linking using -libomp5 and -openmp
From my experiments, I found -libomp5 to be much much faster than -openmp.

It's strange :( In case of Intel compiler and mkl_intel_thread library there should be no differences.
So, what is link link command you are using?

TimP · ‎09-15-2011

-libomp5 shouldn't work; did you mean -liomp5 ? The latter is set by ifort -openmp, but you would need to specify the library explicitly if you were using some other command for linking.
The KMP environment variables are specific to Intel OpenMP, while the OMP ones are in accordance with OpenMP standard.
A purpose of increasing KMP_BLOCKTIME would be to maintain KMP_AFFINITY settings across a gap of more than 0.2 second between OpenMP parallel regions. It's entirely possible that KMP_BLOCKTIME has little effect in normal circumstances.

nunoxic · ‎09-16-2011

Yes ! My bad, I meant liomp5

So as I see it :

1. Threading might help but not much

2. There is no point in adding threads to BLAS Level 2 Operations

Is there any way at all to speed up this code?

(Unless I move on to GPU computing ? )

TimP · ‎09-17-2011

If your application doesn't have enough inherent parallelism to benefit from threading, GPU is not a likely solution. It's true that BLAS level 2 operations, which normally would be vectorized, would need to operate on extremely large data sets to benefit from threaded parallelism internal to those operations. Thus it is normal to apply parallelism at a higher level (each thread performing independent entire level 2 operations).