I'm using OpenMP, Intel C++ Compiler v9 and nested parallelism. I've read each parallelized "for" takes some threads from the thread pool to its thread team.
And I need to know whether it is possible to adjust the number of threads in the global thread pool, because in current setting (3 nested parallel fors - I know it's a lot and it should be able to be reduced but that's not the point - I'm thinking of multiple solutions of the problem and to learn some knowhow :) ) it makes a few thousands of threads with leads to load about 1000, which is quite a lot because the machine has "just" 72 cpu cores. I found that Sun's openmp implementation has a variable to control the amount of threads in thread pool. Does intel's implementation have something like this?
Thanks in advance
Tim, I think cermi does not want to use more than 72 threads. He is concerned that he creates more than 72 threads, because he uses nested parallel regions.
Cermi, have you verified that there are indeed that many threads? The documentation of the Intel compiler sounds like the thread pool is limited in size. You should be able to limit it further using the environment variable OMP_NUM_THREADS or the function void omp_set_num_threads(int nthreads).
From my use of OpenMP on Windows based systems with nested OpenMP structured applications, the OpenMP thread pools are static pools within static pools within static pools, ...
This means when the main enters the 1st layer, the number of threads for the outer most layer is either specified or assumed and those numbers of threads are allocated/created and bound to the team member numbers (0,1,2,...) or the outer most layer. Whenever you exit the outer most layer and then reenter the outer most layer the same team member numbers of that outer most layer receive (run on) the same system thread as they did on the first run. I assume (but have not confirmed) if the second entry into the outer most layer specifies more threads than on the first entry into the outer most layer that new threads are allocated (i.e. first iteration nested level created threads are not used by the outer most layer and new threads are allocated).
On the first nest level, each thread from the team ofthe outer most layer can (may) specify the number of threads it wants to use for the nest level (external circumstances can deny the number of requested threads or even deny nesting). Upon entry into the new nest level the former team member number n of the outer level becomes team member 0 of the inner nest level and the remainder of the threads become team member numbers 1, 2, ... The system threads created for this team member from the outer level, once created remain static for this nest level and get re-used upon reentry to this nest level.
In cermi's situation with 72 cores (72 hardware threads), if each of his 3 nest levels specified take all threads it would require 72 threads on main level + 71*72 on the next level, +(71*72-1)*72 on the next (?). Clearly an unworkable situation for thread explosion.
When using nested levels you must pay close attention to how you distribute the numbers of threads at each level. Too may threads will slow you down. Cermi apparently is doing this but his thread count is unmanagebly large.
A better approach would be to use a thread tasking system such at TBB (Intel Threading Building Blocks)or the thing I am working on, QuickThread (although at this time QuickThread only runs on Windows, later revisions will support Linux).
In thread tasking systems, such as TBB or QT, the 72 threads (or optionally a fewmore) would be created and placed into a general pool. Then no matter how deep of nesting (task level nesting), each level can utilize a sub-set of the total set of threads. As an example the TBBparallel_for can specify a granularity size and indirectly restrict the number of threads to be used on the construct. The task pooling nature of TBB (and QT) makes more effective use of your application threads.
I would suggest that cermi consider using TBB for his application. Although their will be considerable work involved, the payback will be worth the effort.
For example, on a multi-socket multi-core machine, the outer level might set a thread per socket, with the inner level allocating a thread to each core or logical processor. Intel MPI already provides affinity mechanism for combing an outer MPI level with inner OpenMP level of parallelism.
There is some documentation about the persistence of Intel OpenMP thread pools. Normally, the default persistence is large enough that they behave as Jim describes.