Tim, I think cermi does not want to use more than 72 threads. He is concerned that he creates more than 72 threads, because he uses nested parallel regions.
Cermi, have you verified that there are indeed that many threads? The documentation of the Intel compiler sounds like the thread pool is limited in size. You should be able to limit it further using the environment variable OMP_NUM_THREADS or the function void omp_set_num_threads(int nthreads).
From my use of OpenMP on Windows based systems with nested OpenMP structured applications, the OpenMP thread pools are static pools within static pools within static pools, ...
This means when the main enters the 1st layer, the number of threads for the outer most layer is either specified or assumed and those numbers of threads are allocated/created and bound to the team member numbers (0,1,2,...) or the outer most layer. Whenever you exit the outer most layer and then reenter the outer most layer the same team member numbers of that outer most layer receive (run on) the same system thread as they did on the first run. I assume (but have not confirmed) if the second entry into the outer most layer specifies more threads than on the first entry into the outer most layer that new threads are allocated (i.e. first iteration nested level created threads are not used by the outer most layer and new threads are allocated).
On the first nest level, each thread from the team ofthe outer most layer can (may) specify the number of threads it wants to use for the nest level (external circumstances can deny the number of requested threads or even deny nesting). Upon entry into the new nest level the former team member number n of the outer level becomes team member 0 of the inner nest level and the remainder of the threads become team member numbers 1, 2, ... The system threads created for this team member from the outer level, once created remain static for this nest level and get re-used upon reentry to this nest level.
In cermi's situation with 72 cores (72 hardware threads), if each of his 3 nest levels specified take all threads it would require 72 threads on main level + 71*72 on the next level, +(71*72-1)*72 on the next (?). Clearly an unworkable situation for thread explosion.
When using nested levels you must pay close attention to how you distribute the numbers of threads at each level. Too may threads will slow you down. Cermi apparently is doing this but his thread count is unmanagebly large.
A better approach would be to use a thread tasking system such at TBB (Intel Threading Building Blocks)or the thing I am working on, QuickThread (although at this time QuickThread only runs on Windows, later revisions will support Linux).
In thread tasking systems, such as TBB or QT, the 72 threads (or optionally a fewmore) would be created and placed into a general pool. Then no matter how deep of nesting (task level nesting), each level can utilize a sub-set of the total set of threads. As an example the TBBparallel_for can specify a granularity size and indirectly restrict the number of threads to be used on the construct. The task pooling nature of TBB (and QT) makes more effective use of your application threads.
I would suggest that cermi consider using TBB for his application. Although their will be considerable work involved, the payback will be worth the effort.