- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I maintain the Elk code (elk.sourceforge.net) and I need some help with parallelism.
The Elk code contains nested MPI and OpenMP regions (down four levels in places), and within these are calls to LAPACK.
The code runs fine on our new Intel X5650 cluster (I've tested it with up to 240 cores running across 20 nodes each with 12 cores). The problem is that using threaded MKL together with OpenMP spawns many more threads than there are cores (with 'top -H' reporting some running at 5%), making it run more slowly than non-threaded MKL in some cases. I've tried many combinations of the MKL and OpenMP environment variables but nothing seems to work properly.
Here is the most successful combination of variables:
What would be ideal is if MKL creates new threads only if there are idle cores.
Is there some way of doing this?
Thanks,
Kay Dewhurst
(Max Planck Institute, Halle)
I maintain the Elk code (elk.sourceforge.net) and I need some help with parallelism.
The Elk code contains nested MPI and OpenMP regions (down four levels in places), and within these are calls to LAPACK.
The code runs fine on our new Intel X5650 cluster (I've tested it with up to 240 cores running across 20 nodes each with 12 cores). The problem is that using threaded MKL together with OpenMP spawns many more threads than there are cores (with 'top -H' reporting some running at 5%), making it run more slowly than non-threaded MKL in some cases. I've tried many combinations of the MKL and OpenMP environment variables but nothing seems to work properly.
Here is the most successful combination of variables:
[bash]export OMP_NUM_THREADS=12 export OMP_NESTED=true export OMP_MAX_ACTIVE_LEVELS=4 export OMP_DYNAMIC=true export MKL_NUM_THREADS=12 export MKL_DYNAMIC=false...and here are the Fortran linker command line options
[/bash]
[bash]-L/cluster/intel/mkl/lib/intel64/ /cluster/intel/mkl/lib/intel64/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread[/bash]
What would be ideal is if MKL creates new threads only if there are idle cores.
Is there some way of doing this?
Thanks,
Kay Dewhurst
(Max Planck Institute, Halle)
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm a little confused by your description. If you are calling MKL from a threaded region which already uses all 12 cores on each node, and you don't want over-subscription, why ask MKL to generate additional threads?
OpenMP isn't well adapted to dynamic choice of number of threads. The Intel C++ TBB, Cilk+, and ArBB threading models aim to do that, but it's not clear that it would be an advantage in your situation.
OpenMP isn't well adapted to dynamic choice of number of threads. The Intel C++ TBB, Cilk+, and ArBB threading models aim to do that, but it's not clear that it would be an advantage in your situation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kay,
I agree with TimP's assessment but have an additional hint to offer to you.
You can set OpenMP to undersubscribe its threads, and you can set MKL to undersubscribe its threads as well. You may find that some degree of undersubscription of each yields better overall performance. Note, the undersubscription isnot necessarily where the sum of subscriptions equal the number of hardware threads. You may also need to experiment in setting the KMP_BLOCK_TIME and/or OMP_WAIT_POLICY and MKL equivilents. MKL has better runtime control over these characteristics.
There are many programming "hacks" you can do to avoid adverse interaction, but the strategy will depend on too many factors to offer sensible advice (without you providing the factors). To give you a glimps of one of the possibilities:
Assume you do not wish to use a 0 KMP_BLOCK_TIME since this adversely affects the OpenMP performance (at the expense of interfering with the KMP performance at the transition).
Assume under the particular instance your program main loops and performs OpenMP then MKL
loop:
doOpenMPStuff();
doMKLStuff();
end loop
With a non-zero KMP_BLOCK_TIME all threads excepting for the main thread will compete with the threads managed by MKL (for the duration of the block time.
Under the above circumstance, the hack is
Done = false
loop:
Done = false
doOpenMPStuff();
#pragma omp parallel shared(Done)
{
if(omp_get_thread_num() == 0)
{
doMKLStuff();
Done = true;
}
else
{
while(!Done)
Sleep(0); // release time slot
}
} // end parallel
You can clean up the pseudo code. Done has to be volatile or atomic.
The above is only appropriate as indicated above.
Jim Dempsey
I agree with TimP's assessment but have an additional hint to offer to you.
You can set OpenMP to undersubscribe its threads, and you can set MKL to undersubscribe its threads as well. You may find that some degree of undersubscription of each yields better overall performance. Note, the undersubscription isnot necessarily where the sum of subscriptions equal the number of hardware threads. You may also need to experiment in setting the KMP_BLOCK_TIME and/or OMP_WAIT_POLICY and MKL equivilents. MKL has better runtime control over these characteristics.
There are many programming "hacks" you can do to avoid adverse interaction, but the strategy will depend on too many factors to offer sensible advice (without you providing the factors). To give you a glimps of one of the possibilities:
Assume you do not wish to use a 0 KMP_BLOCK_TIME since this adversely affects the OpenMP performance (at the expense of interfering with the KMP performance at the transition).
Assume under the particular instance your program main loops and performs OpenMP then MKL
loop:
doOpenMPStuff();
doMKLStuff();
end loop
With a non-zero KMP_BLOCK_TIME all threads excepting for the main thread will compete with the threads managed by MKL (for the duration of the block time.
Under the above circumstance, the hack is
Done = false
loop:
Done = false
doOpenMPStuff();
#pragma omp parallel shared(Done)
{
if(omp_get_thread_num() == 0)
{
doMKLStuff();
Done = true;
}
else
{
while(!Done)
Sleep(0); // release time slot
}
} // end parallel
You can clean up the pseudo code. Done has to be volatile or atomic.
The above is only appropriate as indicated above.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi again,
Thanks for both replies.
I've been doing some more testing and the problem of oversubscription occurs even without MKL. So maybe I'm not understanding something, or something is not setup correctly.
I compiled the code with "ifort -O3 -ip -unroll -no-prec-div -openmp" and set
To me this seems very inefficient (even if it is standards-compliant): I would hope never to see more threads than cores being used. [On Sun systems, there is a variable, SUNW_MP_MAX_POOL_THREADS, which apparently prevents this problem.]
I'll try the same thing with gfortran later today, but I suspect it will do the same thing.
Cheers,
Kay.
Thanks for both replies.
I've been doing some more testing and the problem of oversubscription occurs even without MKL. So maybe I'm not understanding something, or something is not setup correctly.
I compiled the code with "ifort -O3 -ip -unroll -no-prec-div -openmp" and set
[bash]export OMP_NUM_THREADS=4 export OMP_NESTED=true export OMP_DYNAMIC=true[/bash]Then I ran it on my four core desktop PC. At times "top -H" was giving:
[bash] 3008 dewhurst 20 0 435m 165m 2744 R 72 2.1 1:01.07 elk 3010 dewhurst 20 0 435m 165m 2744 R 26 2.1 0:42.09 elk 3011 dewhurst 20 0 435m 165m 2744 S 25 2.1 0:07.60 elk 3023 dewhurst 20 0 435m 165m 2744 S 23 2.1 0:18.94 elk 3038 dewhurst 20 0 435m 165m 2744 R 11 2.1 0:04.82 elk 3029 dewhurst 20 0 435m 165m 2744 R 9 2.1 0:03.80 elk 3036 dewhurst 20 0 435m 165m 2744 R 9 2.1 0:02.96 elk 3039 dewhurst 20 0 435m 165m 2744 R 9 2.1 0:04.82 elk 3031 dewhurst 20 0 435m 165m 2744 R 9 2.1 0:04.10 elk 3016 dewhurst 20 0 435m 165m 2744 R 7 2.1 0:03.88 elk 3030 dewhurst 20 0 435m 165m 2744 R 7 2.1 0:02.62 elk 3037 dewhurst 20 0 435m 165m 2744 R 7 2.1 0:03.66 elk 3035 dewhurst 20 0 435m 165m 2744 R 7 2.1 0:03.96 elk 3032 dewhurst 20 0 435m 165m 2744 R 6 2.1 0:04.64 elk 3040 dewhurst 20 0 435m 165m 2744 R 6 2.1 0:04.72 elk 3013 dewhurst 20 0 435m 165m 2744 R 5 2.1 0:04.36 elk 3033 dewhurst 20 0 435m 165m 2744 R 5 2.1 0:04.66 elk 3034 dewhurst 20 0 435m 165m 2744 R 5 2.1 0:04.62 elk 3020 dewhurst 20 0 435m 165m 2744 S 3 2.1 0:18.54 elk 3014 dewhurst 20 0 435m 165m 2744 S 2 2.1 0:04.32 elk 3017 dewhurst 20 0 435m 165m 2744 S 2 2.1 0:03.76 elk 3018 dewhurst 20 0 435m 165m 2744 S 2 2.1 0:04.86 elk 3019 dewhurst 20 0 435m 165m 2744 S 2 2.1 0:18.52 elk 3024 dewhurst 20 0 435m 165m 2744 S 2 2.1 0:12.82 elk[/bash]
To me this seems very inefficient (even if it is standards-compliant): I would hope never to see more threads than cores being used. [On Sun systems, there is a variable, SUNW_MP_MAX_POOL_THREADS, which apparently prevents this problem.]
I'll try the same thing with gfortran later today, but I suspect it will do the same thing.
Cheers,
Kay.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
...apparently I'm not the first with this problem:
http://software.intel.com/en-us/forums/showthread.php?t=64571
Kay.
http://software.intel.com/en-us/forums/showthread.php?t=64571
Kay.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you're submitting multiple OpenMP jobs on a single memory image, it's up to you to assign each job to a different CPU group of cores, or to use a scheduler to sequence them. Even if you used a system load aware scheme like TBB, attention to scheduling it yourself should give better throughput.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not. It's a single job viewed with 'top -H' to see the threads. K.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kay,
On an unloaded system, in one console window run your TOP (with appropriate report options) to see what is your background load. Let TOP continuously run providing an update every 5 seconds or at interval you select via option.
Next, while TOP is running, open a second console window (observe changes if any to TOP). Then launch your 4-thread OpenMP app once with nested off, then second time with nested on. Do this for one run of the application.
Note, if you use a script that performs
(pseudo script)
loop:
yourApp
goto loop
Then the accumulated PID's and PPID's may change though the process name "youApp" will not.
Depending on how you did your scripting, you might be launching "yourApp" multiple times before the first instance completes. The near right column I assume was run time which seems to indicate varying degrees of load (as if multiple copies of "yourApp" were running as opposed to the individual threads within one instance of "yourApp".
Also, your TOP report did not include column headings so the data you presentedis left for interpritation.
Jim Dempsey
On an unloaded system, in one console window run your TOP (with appropriate report options) to see what is your background load. Let TOP continuously run providing an update every 5 seconds or at interval you select via option.
Next, while TOP is running, open a second console window (observe changes if any to TOP). Then launch your 4-thread OpenMP app once with nested off, then second time with nested on. Do this for one run of the application.
Note, if you use a script that performs
(pseudo script)
loop:
yourApp
goto loop
Then the accumulated PID's and PPID's may change though the process name "youApp" will not.
Depending on how you did your scripting, you might be launching "yourApp" multiple times before the first instance completes. The near right column I assume was run time which seems to indicate varying degrees of load (as if multiple copies of "yourApp" were running as opposed to the individual threads within one instance of "yourApp".
Also, your TOP report did not include column headings so the data you presentedis left for interpritation.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
There is no script. Just one executable, run once.
If, with nesting, I run top I get
If, with nesting, I run top -H I get
If, without nesting, I run top -H I get
Cheers,
Kay.
There is no script. Just one executable, run once.
If, with nesting, I run top I get
[bash] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11435 dewhurst 20 0 365m 226m 2452 R 394 2.9 0:53.31 elk[/bash]In other words, one task running at 394 %CPU.
If, with nesting, I run top -H I get
[bash]11527 dewhurst 20 0 587m 239m 2584 R 17 3.1 0:00.70 elk 11552 dewhurst 20 0 587m 239m 2584 R 17 3.1 0:02.74 elk 11505 dewhurst 20 0 587m 239m 2584 R 16 3.1 0:07.08 elk 11533 dewhurst 20 0 587m 239m 2584 R 16 3.1 0:02.74 elk 11540 dewhurst 20 0 587m 239m 2584 R 16 3.1 0:03.06 elk 11541 dewhurst 20 0 587m 239m 2584 R 16 3.1 0:02.88 elk 11542 dewhurst 20 0 587m 239m 2584 R 16 3.1 0:02.94 elk 11502 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:21.72 elk 11504 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:07.34 elk 11521 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:00.62 elk 11531 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:02.78 elk 11553 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:02.70 elk 11506 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:06.28 elk 11519 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:01.72 elk 11520 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:00.62 elk 11532 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:02.66 elk 11554 dewhurst 20 0 587m 239m 2584 R 15 3.1 0:02.74 elk 11518 dewhurst 20 0 587m 239m 2584 R 14 3.1 0:01.74 elk 11539 dewhurst 20 0 587m 239m 2584 R 8 3.1 0:01.36 elk 11523 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:02.50 elk 11530 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:00.24 elk 11537 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:01.36 elk 11538 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:01.34 elk 11545 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:01.36 elk 11546 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:01.30 elk 11547 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:01.28 elk 11548 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:01.26 elk 11551 dewhurst 20 0 587m 239m 2584 R 7 3.1 0:01.24 elk[/bash]
If, without nesting, I run top -H I get
[bash] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11617 dewhurst 20 0 365m 193m 2452 R 100 2.5 0:22.22 elk 11621 dewhurst 20 0 365m 193m 2452 R 100 2.5 0:06.00 elk 11620 dewhurst 20 0 365m 193m 2452 R 98 2.5 0:07.74 elk 11619 dewhurst 20 0 365m 193m 2452 R 97 2.5 0:08.88 elk[/bash]
Cheers,
Kay.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for running the reports.
Now I have a question
Is your code written to use nested parallelism?
If so, then you will need to control the degree of parallelism restricting the number of threads in the "next" parallel region (and/or outer region(s) as you drill down into the code).
If your code is not written to use nested parallelism, then experiment with setting KMP_BLOCK_TIME to a larger number (as an experiment, not as a practice).
Your report seems to indicate either more threads are created due to nested levels or thread teams are disbanded after parallel region and new team created at next region (when code is written without nesting).
I think the OpenMP spec implies (may require) that when nesting is enabled, and as each thread nests deeper on first nest call, that the thread created will be the same thread reused on the next time the nesting traversed the same path. IOW if you view the call/nest history as a tree the same threads are use at each branch/leaf on each iteration. There may be an option switch to override this and to act more like a pool (e.g. for omp tasking). Looks like you will need to tinker with option settings or read a more informative document relating to this topic.
Jim Dempsey
Now I have a question
Is your code written to use nested parallelism?
If so, then you will need to control the degree of parallelism restricting the number of threads in the "next" parallel region (and/or outer region(s) as you drill down into the code).
If your code is not written to use nested parallelism, then experiment with setting KMP_BLOCK_TIME to a larger number (as an experiment, not as a practice).
Your report seems to indicate either more threads are created due to nested levels or thread teams are disbanded after parallel region and new team created at next region (when code is written without nesting).
I think the OpenMP spec implies (may require) that when nesting is enabled, and as each thread nests deeper on first nest call, that the thread created will be the same thread reused on the next time the nesting traversed the same path. IOW if you view the call/nest history as a tree the same threads are use at each branch/leaf on each iteration. There may be an option switch to override this and to act more like a pool (e.g. for omp tasking). Looks like you will need to tinker with option settings or read a more informative document relating to this topic.
Jim Dempsey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page