- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to improve the scalability of an OpenMP app that has two main calculations that take approximately the same tame and are independent from each other.
Each of these two calculations is in turn OpenMP parallellized and assumed to run fine. So far the two parts were executing one after the other, with each of them using all the threads available to the program.
The idea now is to run both parts in parallel, using omp sections, with each of the sections taking half of the pool of threads in order to improve scalability. To do so, I have to enable nested parallelism (please correct me if that is not true), because otherwise the inner parallelization in each part will not take effect.
If I enable nested parallelism, however, some parts that were not intended to be run in a nested fashion, start running nested, and I end up with 40 threads, instead of the 8 intended, while the result is fine and the scalability improves, the app is using as much as four times the memory it uses when nested parallelism is not enabled.
Setting KMP_ALL_THREADS=8 does not help either, because then threads go to work on parts where they wasn't intended, and the scalability drops abruptly.
Searching over the internet I found that the Sun compiler has the possibility of setting a SUNW_MP_MAX_NESTED_LEVELS variable, which limits the number of nested levels available. Ideally this would work for me: just set the maximum number of nested levels to 2, and each part will work as intended (I hope).
Is there any way to do that with the Intel compiler?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This sounds like your parallel sections are entered from within a parallel section.
Since you know you have 2seperate main processes, specify the parallel sections as using 2 threads.
For each process you will want to experiment with the number of threads to use in order to get best performance. If the two main processes take the same time to compute then selecting the number of threads to use as 1/2 the number of available cores might yield less context switching (assuming nothing else running on system). If other things are running on the system or if the two main processes take different amount of computation time then limit each thread to some number from 1/2 to number of cores.
Unless your code is performing I/O it makes little sense to use more threads than you have cores.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I more or less managed to do what I wanted by downloading version 11 of the compilers and setting the max number of nesting levels to 2.
However, by analysing how the threads are distributed with the (very useful) command "show openmp" of idb 11, I find that threads 0,2,3,4 (using 0-notation) are working on the first set, and 1,5,6,7 on the second, thus not enforcing the memory locality, which I was hoping to enforce with the
KMP_AFFINITY=granularity=fine,compact,1,0
env var,
so that threads 0 and 4 would work on the first portion of the dataset (each one performing a different process), which is local to them, theads 1 and 5 on the second portion, and so on.
I guess there is no way to designate which threads should be used at any given parallel region.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mambru37,
If you do not specify the number of threads for the parallel sections then the parallel sections begins with the default number of available threads. Each section within the parallel sections in turn will run on one of the threads, in your case one thread from the set of threads for the sections will run for each of two sections. Any additional threads will wait at the end parallel sections (assuming you do not use NOWAIT).
Pseudo code for you to consider
...
int NumberOfPackages = YourGetNumberOfPackages();
void* PackageMask[] = {NULL, NULL};
bool FirstTime[] = {TRUE, TRUE};
void DoWork0(void)
{
if(FirstTime[0])
{
#omp parallel
{
YourSetAffinity(PackageMask[0]);
}
FirstTime[0] = FALSE;
}
...
}
// DoWork1 similar to above
main(...)
{
...
PackageMask[0] = YourGetAffinityMaskForPackage(0);
if(NumberOfPackages > 1)
{
PackageMask[1] = YourGetAffinityMaskForPackage(1);
} else {
PackageMask[1] = PackageMask[0];
}
#pragma omp parallel num_threads(2)
{
int ThreadNum = omp_get_thread_num();
while(ProcessMainLoop)
{
if(FirstTime[ThreadNum])
{
SetAffinity(PackageMask[ThreadNum]);
}
DoMainIterationPreamble(ThreadNum);// e.g. read data
#pragma omp barrier
if(ThreadNum == 0)
{
DoWork0();
} else {
DoWork1();
}
#pragma omp barrier
}
}
}
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page