Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

AndrewC · ‎10-26-2010

I am profiling using Amplifier XE 2011 on a 4 core machine Windows 64-bit machine and trying to optimize our use of MKL.
Ampflier shows that a significant amount of time is spent in _kmp_wait_sleep called by BaseThreadStart. Our code uses MKL extensively. I am trying to understand "what this means" and how to improve this. We use MKL essentially as a black box, a lot of MKL time is spent in [dz]gemm3.

BTW, Amplifier XE 2011 is excellent - a worthy replacement for the late lameted Rational Quantify.

VipinKumar_E_Intel · ‎10-26-2010

kmp_wait_sleep is related to the OpenMP library which MKL uses for threading. May be, your computation is not big enough for the number of threads you use.

VipinKumar_E_Intel · ‎10-26-2010

Can you also please mention the below?

1. Problem size
2. time spent on [dz]gemm
3. time spent on _kmp_wait_sleep

AndrewC · ‎10-26-2010

The matrix sizes are probably 512x512 double precision complex.

CPU Time Overhead Time Wait Time Spin Time Module Function (Full)
_kmp_wait_sleep 248.483s 0usec 1072.405s 893.734s libguide40.dll _kmp_wait_sleep

The stack shows that this calling sequence is where the "time" is spent, not zgemm, I was wrong about that.

CPU Time Overhead Time Wait Time Spin Time Module Function (Full)
_kmpc_invoke_task_func<-_kmp_launch_worker<-BaseThreadStart 248.399s 0usec 1072.038s [Unknown] libguide40.dll _kmp_wait_sleep

There is nothing in the stack "above" BaseThreadStart

The Summary says
CPU 1476s
Elapsed 636s
Total thread count 6
Spin time 960s
Overhead 0

Top Hot spots
[libguide40.dll] 278
NtDelayExecution 277
_kmp_wait_sleep 248
daxpy 215

Gennady_F_Intel · ‎10-26-2010

Please try to play with KMP_BLOCKTIMEenvironment variable or by thekmp_set_blocktime()function. It will allow You to manage the amount of time threads wait before sleeping.. .The default value is 200 ms. You can try to set say 100 ms and it may offer better overall performance

barragan_villanueva_ · ‎10-26-2010

Hi,

Please try using libiomp5-library instead of libguide...

TimP · ‎10-26-2010

If you set affinity, e.g. by KMP_AFFINITY, libiompprof5 can show you if certain threads spend extra time at idle (work imbalance). You'll have to decide what you want to do. Do you want idle threads to yield sooner, according to KMP_BLOCKTIME, or do you want to optimize threading for a number of threads which doesn't fit with the way a function is threaded in MKL, by providing your own source code? Certain commercial applications provide for logging the problem sizes submitted to ?gemm. For example, it seems that large N is required for efficient working of the threading built into MKL ?gemm. Large matrices, with A transposing argument set, would seem, according to public source, to be more dependent on tiling according to the dimensions. If loops are skipped according to zero elements, that could produce idle time.

AndrewC · ‎11-09-2010

Not really getting anywhere with this.. I changed over to use libomp5.dll ( quite a hassle due to some older libraries) and have played with KMP_BLOCKTIME. A smaller KMP_BLOCKTIME resulted in less overall process CPU, but no change in actual elapsed time.
The profiler shows a lot of time spent in

RtlTryEnterCriticalSection 167.424s ntdll.dll

VipinKumar_E_Intel · ‎12-03-2010

We have escalated this issue to our compiler engineering team and we will update you very soon.

Petros · ‎10-06-2011

Did we have any results?

Thanks,

Petros

AndrewC · ‎10-06-2011

Essentially, it is an issue , not suprisingly, when using working with small matrices