Nested parallelism with OpenMP and MKL

poodlediagram · ‎06-01-2011

Hi!

I maintain the Elk code (elk.sourceforge.net) and I need some help with parallelism.

The Elk code contains nested MPI and OpenMP regions (down four levels in places), and within these are calls to LAPACK.

The code runs fine on our new Intel X5650 cluster (I've tested it with up to 240 cores running across 20 nodes each with 12 cores). The problem is that using threaded MKL together with OpenMP spawns many more threads than there are cores (with 'top -H' reporting some running at 5%), making it run more slowly than non-threaded MKL in some cases. I've tried many combinations of the MKL and OpenMP environment variables but nothing seems to work properly.

Here is the most successful combination of variables:

[bash]export OMP_NUM_THREADS=12
export OMP_NESTED=true
export OMP_MAX_ACTIVE_LEVELS=4
export OMP_DYNAMIC=true

export MKL_NUM_THREADS=12
export MKL_DYNAMIC=false

[/bash]

...and here are the Fortran linker command line options

[bash]-L/cluster/intel/mkl/lib/intel64/ /cluster/intel/mkl/lib/intel64/libmkl_solver_lp64.a  -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread[/bash]

What would be ideal is if MKL creates new threads only if there are idle cores.

Is there some way of doing this?

Thanks,
Kay Dewhurst
(Max Planck Institute, Halle)

TimP · ‎06-02-2011

I'm a little confused by your description. If you are calling MKL from a threaded region which already uses all 12 cores on each node, and you don't want over-subscription, why ask MKL to generate additional threads?
OpenMP isn't well adapted to dynamic choice of number of threads. The Intel C++ TBB, Cilk+, and ArBB threading models aim to do that, but it's not clear that it would be an advantage in your situation.

jimdempseyatthecove · ‎06-03-2011

Kay,

I agree with TimP's assessment but have an additional hint to offer to you.

You can set OpenMP to undersubscribe its threads, and you can set MKL to undersubscribe its threads as well. You may find that some degree of undersubscription of each yields better overall performance. Note, the undersubscription isnot necessarily where the sum of subscriptions equal the number of hardware threads. You may also need to experiment in setting the KMP_BLOCK_TIME and/or OMP_WAIT_POLICY and MKL equivilents. MKL has better runtime control over these characteristics.

There are many programming "hacks" you can do to avoid adverse interaction, but the strategy will depend on too many factors to offer sensible advice (without you providing the factors). To give you a glimps of one of the possibilities:

Assume you do not wish to use a 0 KMP_BLOCK_TIME since this adversely affects the OpenMP performance (at the expense of interfering with the KMP performance at the transition).

Assume under the particular instance your program main loops and performs OpenMP then MKL

loop:
doOpenMPStuff();
doMKLStuff();
end loop

With a non-zero KMP_BLOCK_TIME all threads excepting for the main thread will compete with the threads managed by MKL (for the duration of the block time.

Under the above circumstance, the hack is

Done = false
loop:
Done = false
doOpenMPStuff();
#pragma omp parallel shared(Done)
{
if(omp_get_thread_num() == 0)
{
doMKLStuff();
Done = true;
}
else
{
while(!Done)
Sleep(0); // release time slot
}
} // end parallel

You can clean up the pseudo code. Done has to be volatile or atomic.

The above is only appropriate as indicated above.

Jim Dempsey

poodlediagram · ‎06-06-2011

Hi again,

Thanks for both replies.

I've been doing some more testing and the problem of oversubscription occurs even without MKL. So maybe I'm not understanding something, or something is not setup correctly.

I compiled the code with "ifort -O3 -ip -unroll -no-prec-div -openmp" and set

[bash]export OMP_NUM_THREADS=4
export OMP_NESTED=true
export OMP_DYNAMIC=true[/bash]

Then I ran it on my four core desktop PC. At times "top -H" was giving:

[bash] 3008 dewhurst  20   0  435m 165m 2744 R   72  2.1   1:01.07 elk
 3010 dewhurst  20   0  435m 165m 2744 R   26  2.1   0:42.09 elk
 3011 dewhurst  20   0  435m 165m 2744 S   25  2.1   0:07.60 elk
 3023 dewhurst  20   0  435m 165m 2744 S   23  2.1   0:18.94 elk
 3038 dewhurst  20   0  435m 165m 2744 R   11  2.1   0:04.82 elk
 3029 dewhurst  20   0  435m 165m 2744 R    9  2.1   0:03.80 elk
 3036 dewhurst  20   0  435m 165m 2744 R    9  2.1   0:02.96 elk
 3039 dewhurst  20   0  435m 165m 2744 R    9  2.1   0:04.82 elk
 3031 dewhurst  20   0  435m 165m 2744 R    9  2.1   0:04.10 elk
 3016 dewhurst  20   0  435m 165m 2744 R    7  2.1   0:03.88 elk
 3030 dewhurst  20   0  435m 165m 2744 R    7  2.1   0:02.62 elk
 3037 dewhurst  20   0  435m 165m 2744 R    7  2.1   0:03.66 elk
 3035 dewhurst  20   0  435m 165m 2744 R    7  2.1   0:03.96 elk
 3032 dewhurst  20   0  435m 165m 2744 R    6  2.1   0:04.64 elk
 3040 dewhurst  20   0  435m 165m 2744 R    6  2.1   0:04.72 elk
 3013 dewhurst  20   0  435m 165m 2744 R    5  2.1   0:04.36 elk
 3033 dewhurst  20   0  435m 165m 2744 R    5  2.1   0:04.66 elk
 3034 dewhurst  20   0  435m 165m 2744 R    5  2.1   0:04.62 elk
 3020 dewhurst  20   0  435m 165m 2744 S    3  2.1   0:18.54 elk
 3014 dewhurst  20   0  435m 165m 2744 S    2  2.1   0:04.32 elk
 3017 dewhurst  20   0  435m 165m 2744 S    2  2.1   0:03.76 elk
 3018 dewhurst  20   0  435m 165m 2744 S    2  2.1   0:04.86 elk
 3019 dewhurst  20   0  435m 165m 2744 S    2  2.1   0:18.52 elk
 3024 dewhurst  20   0  435m 165m 2744 S    2  2.1   0:12.82 elk[/bash]

To me this seems very inefficient (even if it is standards-compliant): I would hope never to see more threads than cores being used. [On Sun systems, there is a variable, SUNW_MP_MAX_POOL_THREADS, which apparently prevents this problem.]

I'll try the same thing with gfortran later today, but I suspect it will do the same thing.

Cheers,
Kay.

poodlediagram · ‎06-06-2011

...apparently I'm not the first with this problem:

http://software.intel.com/en-us/forums/showthread.php?t=64571

Kay.

TimP · ‎06-06-2011

If you're submitting multiple OpenMP jobs on a single memory image, it's up to you to assign each job to a different CPU group of cores, or to use a scheduler to sequence them. Even if you used a system load aware scheme like TBB, attention to scheduling it yourself should give better throughput.

poodlediagram · ‎06-06-2011

I'm not. It's a single job viewed with 'top -H' to see the threads. K.

jimdempseyatthecove · ‎06-06-2011

Kay,

On an unloaded system, in one console window run your TOP (with appropriate report options) to see what is your background load. Let TOP continuously run providing an update every 5 seconds or at interval you select via option.

Next, while TOP is running, open a second console window (observe changes if any to TOP). Then launch your 4-thread OpenMP app once with nested off, then second time with nested on. Do this for one run of the application.

Note, if you use a script that performs
(pseudo script)

loop:
yourApp
goto loop

Then the accumulated PID's and PPID's may change though the process name "youApp" will not.
Depending on how you did your scripting, you might be launching "yourApp" multiple times before the first instance completes. The near right column I assume was run time which seems to indicate varying degrees of load (as if multiple copies of "yourApp" were running as opposed to the individual threads within one instance of "yourApp".

Also, your TOP report did not include column headings so the data you presentedis left for interpritation.

Jim Dempsey

poodlediagram · ‎06-06-2011

Hi Jim,

There is no script. Just one executable, run once.

If, with nesting, I run top I get

[bash]  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND                                                             
11435 dewhurst  20   0  365m 226m 2452 R  394  2.9   0:53.31 elk[/bash]

In other words, one task running at 394 %CPU.

If, with nesting, I run top -H I get

[bash]11527 dewhurst  20   0  587m 239m 2584 R   17  3.1   0:00.70 elk                                                                       
11552 dewhurst  20   0  587m 239m 2584 R   17  3.1   0:02.74 elk                                                                       
11505 dewhurst  20   0  587m 239m 2584 R   16  3.1   0:07.08 elk                                                                       
11533 dewhurst  20   0  587m 239m 2584 R   16  3.1   0:02.74 elk                                                                       
11540 dewhurst  20   0  587m 239m 2584 R   16  3.1   0:03.06 elk                                                                       
11541 dewhurst  20   0  587m 239m 2584 R   16  3.1   0:02.88 elk                                                                       
11542 dewhurst  20   0  587m 239m 2584 R   16  3.1   0:02.94 elk                                                                       
11502 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:21.72 elk                                                                       
11504 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:07.34 elk                                                                       
11521 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:00.62 elk                                                                       
11531 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:02.78 elk                                                                       
11553 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:02.70 elk                                                                       
11506 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:06.28 elk                                                                       
11519 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:01.72 elk                                                                       
11520 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:00.62 elk                                                                       
11532 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:02.66 elk                                                                       
11554 dewhurst  20   0  587m 239m 2584 R   15  3.1   0:02.74 elk                                                                       
11518 dewhurst  20   0  587m 239m 2584 R   14  3.1   0:01.74 elk                                                                       
11539 dewhurst  20   0  587m 239m 2584 R    8  3.1   0:01.36 elk                                                                       
11523 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:02.50 elk                                                                       
11530 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:00.24 elk                                                                       
11537 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:01.36 elk                                                                       
11538 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:01.34 elk                                                                       
11545 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:01.36 elk                                                                       
11546 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:01.30 elk                                                                       
11547 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:01.28 elk                                                                       
11548 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:01.26 elk                                                                       
11551 dewhurst  20   0  587m 239m 2584 R    7  3.1   0:01.24 elk[/bash]

If, without nesting, I run top -H I get

[bash]  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                   
11617 dewhurst  20   0  365m 193m 2452 R  100  2.5   0:22.22 elk                                                                       
11621 dewhurst  20   0  365m 193m 2452 R  100  2.5   0:06.00 elk                                                                       
11620 dewhurst  20   0  365m 193m 2452 R   98  2.5   0:07.74 elk                                                                       
11619 dewhurst  20   0  365m 193m 2452 R   97  2.5   0:08.88 elk[/bash]

Cheers,
Kay.

jimdempseyatthecove · ‎06-06-2011

Thanks for running the reports.

Now I have a question

Is your code written to use nested parallelism?

If so, then you will need to control the degree of parallelism restricting the number of threads in the "next" parallel region (and/or outer region(s) as you drill down into the code).

If your code is not written to use nested parallelism, then experiment with setting KMP_BLOCK_TIME to a larger number (as an experiment, not as a practice).

Your report seems to indicate either more threads are created due to nested levels or thread teams are disbanded after parallel region and new team created at next region (when code is written without nesting).

I think the OpenMP spec implies (may require) that when nesting is enabled, and as each thread nests deeper on first nest call, that the thread created will be the same thread reused on the next time the nesting traversed the same path. IOW if you view the call/nest history as a tree the same threads are use at each branch/leaf on each iteration. There may be an option switch to override this and to act more like a pool (e.g. for omp tasking). Looks like you will need to tinker with option settings or read a more informative document relating to this topic.

Jim Dempsey