OpenMP, MATMUL's, and the MKL

Dishaw__Jim · ‎03-21-2007

I've been reading up on the nested parallelism rules for OpenMP and was wondering how it relates to MATMUL's and MKL calls.

Specifically, will the Fortran compiler parallelize a MATMUL using OpenMP? Does it make a difference if the MATMUL is inside a OpenMP parallel block? How would the threads be allocated between the containing block and the MATMUL?

The MKL supports OpenMP, but does that apply to all the routines or to some subset (e.g. the direct sparse solvers but not the BLAS routines)? Again, how would the threads be allocated between the containing block and the MKL call?

TimP · ‎03-21-2007

I think you've touched on some controversial subjects. To narrow it down a little:

Intel OpenMP doesn't support nested parallelism. MKL ?GEMM do include OpenMP parallelism. So, if you invoke one of those MKL functions in a serial region, MKL will start up its own threads, according to problem size and OMP_NUM_THREADS environment.

If you wanted MATMUL to invoke threaded parallelism on linux, you could write your MATMUL in a gfortran subroutine compiled with -fexternal-blas, and link against MKL. To me, this seems more practical than putting worksharing directives around MATMUL (which you are welcome to try, if you don't raise your hopes too high). Ifexternal-blas became popular enough, after those versions of gfortran are released, you could submit a feature request to implementit in ifort. I couldn't figure out how to make gfortran work with MKL on Windows, so I'm stuck with plain BLAS calls in ifort.

A lot more discussion than action appears to have occurred around the OpenMP nested parallelism specification. A possible implementation would be to parallelize the inner loop only when the number of threads doesn't already exhaust the limit specified in OMP_NUM_THREADS. I guess some would argue this won't happen often enough to be worth implementing. If it were implemented, you would want some control over thread affinity, in order to optimize performance.

Dishaw__Jim · ‎03-23-2007

Tim, thanks for your reply--your response has given my some ideas on the way I need to proceed.

> Intel OpenMP doesn't support nested parallelism.
The part that confuses me is the following from the Intel documentation "OpenMP nesting and binding rules"

A PARALLEL directive dynamically inside another PARALLEL directive logically establishes a new team, which is composed of only the current thread unless nested parallelism is enabled.

Based on your comment, does that imply that nested parallelism is not enabled in Intel Fortran and cannot be enabled?

> MKL ?GEMM do include OpenMP parallelism.
Is there a listing of which MKL functions are OpenMP aware?

> If you wanted MATMUL to invoke threaded parallelism on linux...
I assume that means the Intel Fortran compiler does not use OpenMP to thread MATMUL (which is the behaviour I would prefer) when OpenMP is enabled.

> A lot more discussion than action appears to have occurred
> around the OpenMP nested parallelism specification...
I can understand why. Part of my task is to parallelize two chunks of code. Not knowing how gemv, ddiamv, and MATMUL behave when OpenMP is enabled is making the task trickier. At face value, nested parallelism seems like a good idea, however, when you consider real architectures the payoff from nested parallelism isn't much--maybe when there are hundreds or thousands of cores in one compute node the payoff will be there.

TimP · ‎03-23-2007

That's my understanding, that nested parallelism is not enabled within Intel OpenMP. You could call MKL DGEMM within a Windows threaded region, and still have it generate additional threads, since it would not be aware of the Windows threading. It would have no way of taking into account pre-existing threads when deciding how many threads to use. You could limit it by setting OMP_NUM_THREADS.

You can combine OpenMP with auto-parallel by using -Qopenmp -Qparallel. Auto-parallelization will be disabled within OpenMP parallel regions, but it might do some good for a MATMUL in a serial region. You may have to experiment with -Qpar_threshold to get much benefit from -Qparallel. It is unlikely to perform as well as MKL, even if it does go parallel, since the MATMUL code is not designed for threading.

In my own experiments, I was able to write out Fortran code which out-performed MATMUL and equalled single thread MKL DGEMM performance, when the problem is small enough not toneed cache blocking. Even though I wrote it so that it could be parallelized effectively with OpenMP, it still did not perform as well as MKL, for the cases where MKL goes parallel. The MKL team expended additional effort in 8.1 and 9.0 to make matrix multiplication efficient on the more recent Intel CPUs.

The general recommendation is to use aprofessionally optimized threaded library for threading of matrix multiplication. Thus the attraction of the schemes of other compilers to make MATMUL invoke BLAS automatically beyond a size threshold, and usetheir own relatively simple code within the threshold.