want to make threaded calls to Intel MKL routines from within an openmp parallel region

cheradenine · ‎09-29-2010

Dear All,

I think the answer to the following question is "no" but I hope that someone can give me confirmation of it.

I am calling a couple of MKL (Lapack) routines from *within* an openmp parallel region. The routines are ssptrd and ssteqr (I'm trying to get eigenvalues of a large symmetrix matrix). I know from elsewhere on the intel website that both of these routines are multi-threaded, so calling them from *outside* a parallel region and having them multi-thread would be a piece of cake.

But I'm trying to call them from *within* a very large parallel region. At the moment, in order to have a working copy of the code, I am making sure to only call these routines from a single thread (i.e. by doing so within a '!$omp master' construct (or '!$omp single' construct). This means that I get the correct behavior, but the other (nthreads - 1) threads are basically sitting idle while this is happening.

I understand that I could - in principle - solve this issue by rewriting the code so that it temporarily ends the parallel region, callsssptrd and ssteqr, and then reopens the parallel region. This is probably just about doable but would be a royal pain in the butt for two reasons. First, there are very many private variables that would have to be redefined from scratch (I think) every time that I reopened the parallel region. Second, since I plan on calling these routines thousands, if not millions of times during the execution of the program there'll be a significant performance hit associated with recalculating all of these variables.

So I would like to be able to make a multi-threaded call to the MKL routines from within the parallel region. My question is can this be done? Is it possible to make a (single) call to the ssptrd and ssteqr routines from a single thread and have that thread spawn its own set of temporary threads to speed up the work while the other, original threads spin their wheels? Or is this just not possible currently?

Any ideas or help would be much appreciated. Even if the answer is "you need to close the parallel region first... etc": it would be painful to be told this, but if it's what I must do it's what I must do...

Thanks!

barragan_villanueva_ · ‎09-29-2010

Hi,

OpenMP can support nested parallelizm. So please try using

omp_set_nested function:
if the argument to omp_set_nested evaluates to true, nested parallelism is enabled

or

setOMP_NESTED environment variable:
If the environment variable is set to true, nested parallelism is enabled

jimdempseyatthecove · ‎09-30-2010

Following Victor's advice about nested parallel regions will (should) permit MKL to fan out as a nested level of the thread making the MKL call.

>>At the moment, in order to have a working copy of the code, I am making sure to only call these routines from a single thread (i.e. by doing so within a '!$omp master' construct (or '!$omp single' construct). This means that I get the correct behavior, but the other (nthreads - 1) threads are basically sitting idle while this is happening

Your concern should not be if these (outer app level) threads are idel, your concern is if the hardware threads are idel. Assume on 4 core w/ht, you have 8 hw threads. Outer layer of your app has 8 software threads, nested parallel regions enabled, MKL may use up to 7 additional software threads (15 threads on system).

The other issue to resolve is how to keep the software threads from burning up unnecessary CPU time. In particular your app threads on the way into MKL and the MKL additional threads on the way out of MKL.

The most efficient way is if your use of MKL can be performed in independent slices. In this setup you do not use nested, but you concurrently call MKL on a slice by slice basis by your app outer level threads with MKL running with single thread (per thread making call). Now this won't work if your calls have datadependencies amongst themselves .OR. if MKL uses some global data structures inhibiting reentrancy of MKL (hard to immagine this would be so).

If you are unable to do the above then

OldBlockTime = KMP_GET_BLOCKTIME()
KMP_SET_BLOCKTIME(0)
!$OMP BARRIER ! if necessary
!$OMP MASTER
call MKL(...)
!$OMP END MASTER
!$OMP BARRIER ! probably required
KMP_SET_BLOCKTIME(OldBlockTime)

Jim Dempsey

cheradenine · ‎09-30-2010

Hi Victor (and Jim),

Thanks both for your input on this - it's very helpful.

I've just tried using the "omp_set_nested" function within the code as follows:

call omp_set_nested(1)

But when I call "omp_get_nested" it always comes back with '0'.

(I've used similar calls in the past - e.g. to "mkl_set_dynamic" so I think I've got the call syntax correct).

I've also tried setting the OMP_NESTED environment variable to true but still get a '0' on return from "omp_get_nested".

I'm using the 11.0/083 release of ifort.

Any ideas where I'm going wrong?

Thanks for your patience!

jimdempseyatthecove · ‎09-30-2010

Call omp_set_nested(1) (once) before first parallel region.
e.g. stick as first statement in PROGRAM (or main)

nested is implementation dependent so check for which library and version you are linking with.

If this fails, you have one other alternative

program foo
...
! before 1st parallel region
call omp_set_nested(1)
if(omp_get_nested() == 0) then
! nesting not available, use Plan-B
global_omp_num_threads = omp_get_num_procs()
call omp_set_num_threads(global_omp_num_threads * 2)
!$OMP PARALLEL
! assure parallel region not optimized out
! issue innocuous NOOP
if(omp_get_thread_num() == -9999) write(*,*) 'Impossible'
!$OMP END PARALLEL
call omp_set_num_threads(global_omp_num_threads)
endif
...

What the above does is establish a single thread pool of 2x the number of logical processors.
Then sets the number of threads to use back at the number of logical processors.

Your code (sans MKL) will use global_omp_num_threads and with that many idel threads in the OpenMP thread pool.
When MKL is called, while you have global_omp_num_threads busy, you will have global_omp_num_threads available. and this number is the number of threads for use in team building.

You still have the issue of the KMP_BLOCKTIME as discussed in prior email.

Jim Dempsey

cheradenine · ‎10-04-2010

Thanks very much for all your help on this Jim - I wasn't able to get the omp_set_nested to work but I think your plan B is very doable so I'll give it a go when I'm next able to get back to the coding. Thanks again.

Konstantin_A_Intel · ‎10-19-2010

Above comments are a bit incomplete.

If you would like MKL will work in multithreaded mode within a parallel region you should do at least following:

1) Enable nested mode in OpenMP: omp_set_nested(1);or set OMP_NESTED=true as Victor and Jim said.

2) Disable mkl dynamic mode:mkl_set_dynamic(0); or set MKL_DYNAMIC=true

3) Set a number of threads specifically for MKL:mkl_set_num_threads(procs);

Without item 2) MKL will run in a single thread mode. And note that 1) must be done outside parallel region, before omp parallel statement.

Regards,

Konstantin

jimdempseyatthecove · ‎10-20-2010

Thanks Konstantin,

You might want to experiment with mkl_set_dynamic(0); and mkl_set_dynamic(1);
Without experimenting, I would suspect mkl_set_dynamic(1); would be preferred when called from within a parallel region. *** but where the parallel region is running with a reduced number of threads.

The particular setup would depend on:

a) if mkl is called from serial region only, or parallel region only or mixture
b) if from parallel region, then if all threads in use or some available.

I do not think there is one setting that will work best for all situations.
Some conditional code can be used during testing and profiling.

Jim Dempsey