Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

Huge kmp_* overhead

Matt_S_
Beginner
2,428 Views

I'm checking out my code's "CPU Usage Histogram" and I have a fairly significant overhead in the function kmp_launch_thread. Is there any way to reduce this time? I have tried varying KMP_AFFINITY with only minor improvements. I'm at a loss. I realize I'm not giving a whole lot of information, so just let me know what you need (system info, code snippets, etc.).

Thanks in advance!

0 Kudos
10 Replies
Peter_W_Intel
Employee
2,428 Views

I have a fairly significant overhead in the function kmp_launch_thread

Usually it indicated many threads were invoked (kmp_launcd_thread's CPU time was accumulated) but ran shortly, since tasks were tiny in these threads. ...using KMP_AFFINITY doesn't help, please control OMP threads' number and extend threads' life. 

0 Kudos
Bernard
Valued Contributor I
2,428 Views

 >>>Is there any way to reduce this time? I have tried varying KMP_AFFINITY with only minor improvements>>>

I am not sure if anything significant can be done. AFAIK  kmp_launch_thread can be calling OS threads further in the call chain in order to spawn new threads. So it can have significant overhead at start time of your app. Bear in mind that thread creation task is costly in terms of CPU cycles.

0 Kudos
Dmitry_P_Intel1
Employee
2,428 Views

What VTune analysis do you use? If you use collection with stacks could you please switch to Call Stack Mode on filter bar to "User/system fucntion" mode and see if CPU distribution is different and what function consums it while it is not attributed to _kmp_launch..

0 Kudos
Matt_S_
Beginner
2,428 Views

Peter, no matter the thread number, the kmp overhead is large. I vary from 1 to 4x #cores.

Dmitry, I've just been using the "Basic Hotspots Analysis." I'm not sure where the filter bar you're talking about is located. I've attached a screenshot snippet of what my CPU Usage Histogram looks like.

Thanks for the help!

0 Kudos
Bernard
Valued Contributor I
2,428 Views

@Matt S

Is there an option to disassemble  main$omp$parallel function? Maybe there is a call to OS system function and the perceived delay is on the side of OS kernel.

0 Kudos
Peter_W_Intel
Employee
2,428 Views

Matt S

You had 48 threads to run on 4 cores box, so overhead was high. Many time on spin locks and context switch.

You need to reduce thread number to logical cores, and run task more time on these threads. Another thought is to reduce spin locks, or put spin lock in small code region, etc to reduce wait time or wait count. 

0 Kudos
Bernard
Valued Contributor I
2,428 Views

@Peter

Good point.

0 Kudos
Dmitry_P_Intel1
Employee
2,428 Views

Before jumping to conclusiont we need to be sure that overhead attribution to kmp_launch by VTune is correct.

Matt, on the picture below is the knob that will allow you to disable reattribution from system to user.

0 Kudos
Matt_S_
Beginner
2,428 Views

Dmitry, here are two screenshots of first, "User functions +1" and second, "User/system functions."

Iliya, I've also attached a screenshot of the expanded main functions for you to look at.

0 Kudos
Dmitry_P_Intel1
Employee
2,428 Views

Hello Matt,

Thank you for the screenshots - they explain things - the spin time actually happened on a barrier (kmp_wait_sleep..) but VTune attributed it to kmp_launch_thread - will think how to improve that not to be misleading.

So you have OpenMP imbalance when there were threads that finished their portion of work earlier than others and dis spinning on a barrier.

You can try dynamic schedule if you have parallel loop like

#pragma omp parallel for schedule (dynaic, <chunk_size>)

where chunk_size a number of iterations that will be assigned to a worker thread. You might not point chunk_size having it 1 by default but if you have a big number of iterations you might see overhead on work sheduling. Experiment with this stuff.

Also supporting advice on limiting the number of threads with the number of cores.

Thanks & Regards, Dmitry

0 Kudos
Reply