Community
cancel
Showing results for 
Search instead for 
Did you mean: 
AndrewC
New Contributor I
87 Views

Profiling MKL using Amplifier XE 2011 and _kmp_wait_sleep

I am profiling using Amplifier XE 2011 on a 4 core machine Windows 64-bit machine and trying to optimize our use of MKL.
Ampflier shows that a significant amount of time is spent in _kmp_wait_sleep called by BaseThreadStart. Our code uses MKL extensively. I am trying to understand "what this means" and how to improve this. We use MKL essentially as a black box, a lot of MKL time is spent in [dz]gemm3.

BTW, Amplifier XE 2011 is excellent - a worthy replacement for the late lameted Rational Quantify.

0 Kudos
10 Replies
87 Views

kmp_wait_sleep is related to the OpenMP library which MKL uses for threading. May be, your computation is not big enough for the number of threads you use.

87 Views

Can you also please mention the below?

1. Problem size
2. time spent on [dz]gemm
3. time spent on _kmp_wait_sleep

AndrewC
New Contributor I
87 Views


The matrix sizes are probably 512x512 double precision complex.

CPU Time Overhead Time Wait Time Spin Time Module Function (Full)
_kmp_wait_sleep 248.483s 0usec 1072.405s 893.734s libguide40.dll _kmp_wait_sleep

The stack shows that this calling sequence is where the "time" is spent, not zgemm, I was wrong about that.

CPU Time Overhead Time Wait Time Spin Time Module Function (Full)
_kmpc_invoke_task_func<-_kmp_launch_worker<-BaseThreadStart 248.399s 0usec 1072.038s [Unknown] libguide40.dll _kmp_wait_sleep

There is nothing in the stack "above" BaseThreadStart

The Summary says
CPU 1476s
Elapsed 636s
Total thread count 6
Spin time 960s
Overhead 0

Top Hot spots
[libguide40.dll] 278
NtDelayExecution 277
_kmp_wait_sleep 248
daxpy 215
Gennady_F_Intel
Moderator
87 Views

Please try to play with KMP_BLOCKTIMEenvironment variable or by thekmp_set_blocktime()function. It will allow You to manage the amount of time threads wait before sleeping.. .The default value is 200 ms. You can try to set say 100 ms and it may offer better overall performance

barragan_villanueva_
Valued Contributor I
87 Views

Hi,

Please try using libiomp5-library instead of libguide...

TimP
Black Belt
87 Views

If you set affinity, e.g. by KMP_AFFINITY, libiompprof5 can show you if certain threads spend extra time at idle (work imbalance). You'll have to decide what you want to do. Do you want idle threads to yield sooner, according to KMP_BLOCKTIME, or do you want to optimize threading for a number of threads which doesn't fit with the way a function is threaded in MKL, by providing your own source code? Certain commercial applications provide for logging the problem sizes submitted to ?gemm. For example, it seems that large N is required for efficient working of the threading built into MKL ?gemm. Large matrices, with A transposing argument set, would seem, according to public source, to be more dependent on tiling according to the dimensions. If loops are skipped according to zero elements, that could produce idle time.
AndrewC
New Contributor I
87 Views

Not really getting anywhere with this.. I changed over to use libomp5.dll ( quite a hassle due to some older libraries) and have played with KMP_BLOCKTIME. A smaller KMP_BLOCKTIME resulted in less overall process CPU, but no change in actual elapsed time.
The profiler shows a lot of time spent in

RtlTryEnterCriticalSection 167.424s ntdll.dll

87 Views

We have escalated this issue to our compiler engineering team and we will update you very soon.
Petros
Beginner
87 Views

Did we have any results?
Thanks,
Petros
AndrewC
New Contributor I
87 Views

Essentially, it is an issue , not suprisingly, when using working with small matrices
Reply