Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

AndrewC
New Contributor III
341 Views
I am profiling using Amplifier XE 2011 on a 4 core machine Windows 64-bit machine and trying to optimize our use of MKL.
Ampflier shows that a significant amount of time is spent in _kmp_wait_sleep called by BaseThreadStart. Our code uses MKL extensively. I am trying to understand "what this means" and how to improve this. We use MKL essentially as a black box, a lot of MKL time is spent in [dz]gemm3.

BTW, Amplifier XE 2011 is excellent - a worthy replacement for the late lameted Rational Quantify.

0 Kudos
10 Replies
VipinKumar_E_Intel
341 Views

kmp_wait_sleep is related to the OpenMP library which MKL uses for threading. May be, your computation is not big enough for the number of threads you use.

0 Kudos
VipinKumar_E_Intel
341 Views

Can you also please mention the below?

1. Problem size
2. time spent on [dz]gemm
3. time spent on _kmp_wait_sleep

0 Kudos
AndrewC
New Contributor III
341 Views

The matrix sizes are probably 512x512 double precision complex.

CPU Time Overhead Time Wait Time Spin Time Module Function (Full)
_kmp_wait_sleep 248.483s 0usec 1072.405s 893.734s libguide40.dll _kmp_wait_sleep

The stack shows that this calling sequence is where the "time" is spent, not zgemm, I was wrong about that.

CPU Time Overhead Time Wait Time Spin Time Module Function (Full)
_kmpc_invoke_task_func<-_kmp_launch_worker<-BaseThreadStart 248.399s 0usec 1072.038s [Unknown] libguide40.dll _kmp_wait_sleep

There is nothing in the stack "above" BaseThreadStart

The Summary says
CPU 1476s
Elapsed 636s
Total thread count 6
Spin time 960s
Overhead 0

Top Hot spots
[libguide40.dll] 278
NtDelayExecution 277
_kmp_wait_sleep 248
daxpy 215
0 Kudos
Gennady_F_Intel
Moderator
341 Views

Please try to play with KMP_BLOCKTIMEenvironment variable or by thekmp_set_blocktime()function. It will allow You to manage the amount of time threads wait before sleeping.. .The default value is 200 ms. You can try to set say 100 ms and it may offer better overall performance

0 Kudos
barragan_villanueva_
Valued Contributor I
341 Views

Hi,

Please try using libiomp5-library instead of libguide...

0 Kudos
TimP
Honored Contributor III
341 Views
If you set affinity, e.g. by KMP_AFFINITY, libiompprof5 can show you if certain threads spend extra time at idle (work imbalance). You'll have to decide what you want to do. Do you want idle threads to yield sooner, according to KMP_BLOCKTIME, or do you want to optimize threading for a number of threads which doesn't fit with the way a function is threaded in MKL, by providing your own source code? Certain commercial applications provide for logging the problem sizes submitted to ?gemm. For example, it seems that large N is required for efficient working of the threading built into MKL ?gemm. Large matrices, with A transposing argument set, would seem, according to public source, to be more dependent on tiling according to the dimensions. If loops are skipped according to zero elements, that could produce idle time.
0 Kudos
AndrewC
New Contributor III
341 Views
Not really getting anywhere with this.. I changed over to use libomp5.dll ( quite a hassle due to some older libraries) and have played with KMP_BLOCKTIME. A smaller KMP_BLOCKTIME resulted in less overall process CPU, but no change in actual elapsed time.
The profiler shows a lot of time spent in

RtlTryEnterCriticalSection 167.424s ntdll.dll

0 Kudos
VipinKumar_E_Intel
341 Views
We have escalated this issue to our compiler engineering team and we will update you very soon.
0 Kudos
Petros
Beginner
341 Views
Did we have any results?
Thanks,
Petros
0 Kudos
AndrewC
New Contributor III
341 Views
Essentially, it is an issue , not suprisingly, when using working with small matrices
0 Kudos
Reply