topic Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep in Intel® oneAPI Math Kernel Library

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

AndrewC — Tue, 26 Oct 2010 16:44:39 GMT

I am profiling using Amplifier XE 2011 on a 4 core machine Windows 64-bit machine and trying to optimize our use of MKL.
Ampflier shows that a significant amount of time is spent in _kmp_wait_sleep called by BaseThreadStart. Our code uses MKL extensively. I am trying to understand "what this means" and how to improve this. We use MKL essentially as a black box, a lot of MKL time is spent in [dz]gemm3.

BTW, Amplifier XE 2011 is excellent - a worthy replacement for the late lameted Rational Quantify.

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

VipinKumar_E_Intel — Tue, 26 Oct 2010 17:13:44 GMT

kmp_wait_sleep is related to the OpenMP library which MKL uses for threading. May be, your computation is not big enough for the number of threads you use.

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

VipinKumar_E_Intel — Tue, 26 Oct 2010 17:20:36 GMT

Can you also please mention the below?

1. Problem size
2. time spent on [dz]gemm
3. time spent on _kmp_wait_sleep

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

AndrewC — Tue, 26 Oct 2010 18:54:19 GMT

The matrix sizes are probably 512x512 double precision complex.

CPU Time Overhead Time Wait Time Spin Time Module Function (Full)
_kmp_wait_sleep 248.483s 0usec 1072.405s 893.734s libguide40.dll _kmp_wait_sleep

The stack shows that this calling sequence is where the "time" is spent, not zgemm, I was wrong about that.

CPU Time Overhead Time Wait Time Spin Time Module Function (Full)
_kmpc_invoke_task_func<-_kmp_launch_worker<-BaseThreadStart 248.399s 0usec 1072.038s [Unknown] libguide40.dll _kmp_wait_sleep

There is nothing in the stack "above" BaseThreadStart

The Summary says
CPU 1476s
Elapsed 636s
Total thread count 6
Spin time 960s
Overhead 0

Top Hot spots
[libguide40.dll] 278
NtDelayExecution 277
_kmp_wait_sleep 248
daxpy 215

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

Gennady_F_Intel — Wed, 27 Oct 2010 04:51:16 GMT

Please try to play with KMP_BLOCKTIMEenvironment variable or by thekmp_set_blocktime()function. It will allow You to manage the amount of time threads wait before sleeping.. .The default value is 200 ms. You can try to set say 100 ms and it may offer better overall performance

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

barragan_villanueva_ — Wed, 27 Oct 2010 05:19:41 GMT

Hi,

Please try using libiomp5-library instead of libguide...

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

TimP — Wed, 27 Oct 2010 05:34:38 GMT

If you set affinity, e.g. by KMP_AFFINITY, libiompprof5 can show you if certain threads spend extra time at idle (work imbalance). You'll have to decide what you want to do. Do you want idle threads to yield sooner, according to KMP_BLOCKTIME, or do you want to optimize threading for a number of threads which doesn't fit with the way a function is threaded in MKL, by providing your own source code? Certain commercial applications provide for logging the problem sizes submitted to ?gemm. For example, it seems that large N is required for efficient working of the threading built into MKL ?gemm. Large matrices, with A transposing argument set, would seem, according to public source, to be more dependent on tiling according to the dimensions. If loops are skipped according to zero elements, that could produce idle time.

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

AndrewC — Tue, 09 Nov 2010 19:06:18 GMT

Not really getting anywhere with this.. I changed over to use libomp5.dll ( quite a hassle due to some older libraries) and have played with KMP_BLOCKTIME. A smaller KMP_BLOCKTIME resulted in less overall process CPU, but no change in actual elapsed time.
The profiler shows a lot of time spent in

RtlTryEnterCriticalSection 167.424s ntdll.dll

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

VipinKumar_E_Intel — Fri, 03 Dec 2010 12:46:01 GMT

We have escalated this issue to our compiler engineering team and we will update you very soon.

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

Petros — Thu, 06 Oct 2011 14:50:02 GMT

Did we have any results?

Thanks,

Petros

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

AndrewC — Thu, 06 Oct 2011 14:54:24 GMT

Essentially, it is an issue , not suprisingly, when using working with small matrices