If I set OMP_THREAD_LIMIT=8, in an application with one level of nested parallelism, then I receive this message:
OMP: Warning #96: Cannot form a team with 8 threads, using 1 instead.
OMP: Hint: Consider unsetting KMP_ALL_THREADS and OMP_THREAD_LIMIT (if either is set).
So there is not nested parallelism. The thing is: it cant form a team of 8 threads, but why it doesnt use 4 or 5 (the free ones) for example? Meanwhile there are threads waiting to entry in a critical zone (in the superior level), without doing anything.
Thanks in advance!
Use OMP_THREAD_LIMIT to set the thread-limit-var internal control variable. thread-limit-var is used to indicate the number of OpenMP threads to be used for the whole program. The function omp_get_thread_limit can be used to retrieve this value at run time. The value for OMP_THREAD_LIMIT is a positive integer. If a value is chosen that is more than the number of threads that can be supported or is not a positive integer, the runtime will set a default value for thread-limit-var of OMP_NUM_THREADS or the number of available processors, whichever is greater. Note: if thread-limit-var is set, the default value of the nthreads-var internal control variable is equal to thread-limit-var or the number of available processors, whichever is less.
Therefore you may need to set OMP_NUM_THREADS to oversubscribe the number of threads.
Also, (not seeing your program) the error message (#96) seems to imply you are attempting to nest parallel regions with nested disabled.
With two threads waiting at critical section or working elsewhere and third thread creating a parallel region you've already maxed out your thread limit of 3 (set by OMP_THREAD_LIMIT). Therefore only one thread will be used by the new parallel region.
If you want only 3 hardware threads to be used by 3 software threads in parallel region level-0 and have one of those team members (software thread) create a nested parallel region with 3 threads (itself plus two additional threads) then consider:
at beginning of process (before start of OpenMP) affinitize the process to restrict it to 3 hardware threads
depending on O/S attribute the executable to restrict it to 3 logical processors
Then set OMP_THREAD_LIMIT=5 or OMP_NUM_THREADS=5 .and. in your first (level-0) parallel region limit that to 3 threads, same for the next level. Note, you are oversubscribed here. When the two threads blocked at the critical section are released they with compete (context switch) with the additional two threads running in the nested region. Also, critical sections will tend to have a short run of spinlock before thread suspension. The additional two threads may run timesliced with those in spinlock at critical section (i.e. you may have inefficiencies). Is this what you want?
An alternate route is to use the OpenMP 3.0 and later task construct
create a parallel region with 3 threads
create the three "level-0" tasks
task-0 that eventually reaches critical section
task-1 that eventually reaches critical section
task-2 that eventually spawns two additional tasks and participates in 3-way method
(orspawns three tasks and does not (directly) participate in 3-way method)
end parallel region
The thread running task-0, when complete will be available to run task generated by task-2 (assuming task not already run)
The thread running task-1, when complete will be available to run task generated by task-2 (assuming task not already run)
The thread running task-2, when complete will be available to run task generated by task-2 (assuming task not already run)