OpenMP can support nested parallelizm. So please try using
if the argument to omp_set_nested evaluates to true, nested parallelism is enabled
setOMP_NESTED environment variable:
If the environment variable is set to true, nested parallelism is enabled
>>At the moment, in order to have a working copy of the code, I am making sure to only call these routines from a single thread (i.e. by doing so within a '!$omp master' construct (or '!$omp single' construct). This means that I get the correct behavior, but the other (nthreads - 1) threads are basically sitting idle while this is happening
Your concern should not be if these (outer app level) threads are idel, your concern is if the hardware threads are idel. Assume on 4 core w/ht, you have 8 hw threads. Outer layer of your app has 8 software threads, nested parallel regions enabled, MKL may use up to 7 additional software threads (15 threads on system).
The other issue to resolve is how to keep the software threads from burning up unnecessary CPU time. In particular your app threads on the way into MKL and the MKL additional threads on the way out of MKL.
The most efficient way is if your use of MKL can be performed in independent slices. In this setup you do not use nested, but you concurrently call MKL on a slice by slice basis by your app outer level threads with MKL running with single thread (per thread making call). Now this won't work if your calls have datadependencies amongst themselves .OR. if MKL uses some global data structures inhibiting reentrancy of MKL (hard to immagine this would be so).
If you are unable to do the above then
OldBlockTime = KMP_GET_BLOCKTIME()
!$OMP BARRIER ! if necessary
!$OMP END MASTER
!$OMP BARRIER ! probably required
e.g. stick as first statement in PROGRAM (or main)
nested is implementation dependent so check for which library and version you are linking with.
If this fails, you have one other alternative
! before 1st parallel region
if(omp_get_nested() == 0) then
! nesting not available, use Plan-B
global_omp_num_threads = omp_get_num_procs()
call omp_set_num_threads(global_omp_num_threads * 2)
! assure parallel region not optimized out
! issue innocuous NOOP
if(omp_get_thread_num() == -9999) write(*,*) 'Impossible'
!$OMP END PARALLEL
What the above does is establish a single thread pool of 2x the number of logical processors.
Then sets the number of threads to use back at the number of logical processors.
Your code (sans MKL) will use global_omp_num_threads and with that many idel threads in the OpenMP thread pool.
When MKL is called, while you have global_omp_num_threads busy, you will have global_omp_num_threads available. and this number is the number of threads for use in team building.
You still have the issue of the KMP_BLOCKTIME as discussed in prior email.
You might want to experiment with mkl_set_dynamic(0); and mkl_set_dynamic(1);
Without experimenting, I would suspect mkl_set_dynamic(1); would be preferred when called from within a parallel region. *** but where the parallel region is running with a reduced number of threads.
The particular setup would depend on:
a) if mkl is called from serial region only, or parallel region only or mixture
b) if from parallel region, then if all threads in use or some available.
I do not think there is one setting that will work best for all situations.
Some conditional code can be used during testing and profiling.