Maximum number of nested levels

mambru37 · ‎09-02-2008

Hi,

I'm trying to improve the scalability of an OpenMP app that has two main calculations that take approximately the same tame and are independent from each other.

Each of these two calculations is in turn OpenMP parallellized and assumed to run fine. So far the two parts were executing one after the other, with each of them using all the threads available to the program.

The idea now is to run both parts in parallel, using omp sections, with each of the sections taking half of the pool of threads in order to improve scalability. To do so, I have to enable nested parallelism (please correct me if that is not true), because otherwise the inner parallelization in each part will not take effect.

If I enable nested parallelism, however, some parts that were not intended to be run in a nested fashion, start running nested, and I end up with 40 threads, instead of the 8 intended, while the result is fine and the scalability improves, the app is using as much as four times the memory it uses when nested parallelism is not enabled.

Setting KMP_ALL_THREADS=8 does not help either, because then threads go to work on parts where they wasn't intended, and the scalability drops abruptly.

Searching over the internet I found that the Sun compiler has the possibility of setting a SUNW_MP_MAX_NESTED_LEVELS variable, which limits the number of nested levels available. Ideally this would work for me: just set the maximum number of nested levels to 2, and each part will work as intended (I hope).

Is there any way to do that with the Intel compiler?

mambru37 · ‎09-02-2008

Ok, I found the omp_set_max_active_levels function of OpenMP 3.0 that already addresses the issue. Now I only wonder if there is any proprietary Intel extension for that.

jimdempseyatthecove · ‎09-02-2008

This sounds like your parallel sections are entered from within a parallel section.

Since you know you have 2seperate main processes, specify the parallel sections as using 2 threads.

For each process you will want to experiment with the number of threads to use in order to get best performance. If the two main processes take the same time to compute then selecting the number of threads to use as 1/2 the number of available cores might yield less context switching (assuming nothing else running on system). If other things are running on the system or if the two main processes take different amount of computation time then limit each thread to some number from 1/2 to number of cores.

Unless your code is performing I/O it makes little sense to use more threads than you have cores.

Jim Dempsey

mambru37 · ‎09-03-2008

As I understand it, if my sections part has only 2 sections, then only 2 threads will go on to work on that section.

I more or less managed to do what I wanted by downloading version 11 of the compilers and setting the max number of nesting levels to 2.

However, by analysing how the threads are distributed with the (very useful) command "show openmp" of idb 11, I find that threads 0,2,3,4 (using 0-notation) are working on the first set, and 1,5,6,7 on the second, thus not enforcing the memory locality, which I was hoping to enforce with the
KMP_AFFINITY=granularity=fine,compact,1,0
env var,

so that threads 0 and 4 would work on the first portion of the dataset (each one performing a different process), which is local to them, theads 1 and 5 on the second portion, and so on.

I guess there is no way to designate which threads should be used at any given parallel region.

jimdempseyatthecove · ‎09-03-2008

Mambru37,

If you do not specify the number of threads for the parallel sections then the parallel sections begins with the default number of available threads. Each section within the parallel sections in turn will run on one of the threads, in your case one thread from the set of threads for the sections will run for each of two sections. Any additional threads will wait at the end parallel sections (assuming you do not use NOWAIT).

Pseudo code for you to consider

...
int NumberOfPackages = YourGetNumberOfPackages();
void* PackageMask[] = {NULL, NULL};
bool FirstTime[] = {TRUE, TRUE};

void DoWork0(void)
{
 if(FirstTime[0])
 {
 #omp parallel
 {
 YourSetAffinity(PackageMask[0]);
 }
 FirstTime[0] = FALSE;
 }
 ...
}

// DoWork1 similar to above


main(...)
{
 ...
 PackageMask[0] = YourGetAffinityMaskForPackage(0);
 if(NumberOfPackages > 1)
 {
 PackageMask[1] = YourGetAffinityMaskForPackage(1);
 } else {
 PackageMask[1] = PackageMask[0];
 }
 #pragma omp parallel num_threads(2)
 {
 int ThreadNum = omp_get_thread_num();
 while(ProcessMainLoop)
 {
 if(FirstTime[ThreadNum])
 {
 SetAffinity(PackageMask[ThreadNum]);
 }
 DoMainIterationPreamble(ThreadNum);// e.g. read data
 #pragma omp barrier
 if(ThreadNum == 0)
 {
 DoWork0();
 } else {
 DoWork1();
 }
 #pragma omp barrier
 }
 }
}

Jim Dempsey