Paralleling and MKL: how to do it?

Paul_Margosian · ‎03-09-2011

Using MKL 10.3, Visual Studio 2008 C++, Windows 7 Pro 64 bit, Dell T5500 with 8 processors, 12G.
Using LAPACKE_zgesv and vslzConvExec with OK results, but single threaded.

Questions:
* Are these able to be multi-threaded, and if so how to make it happen? Link to a different library? Specify number of threads? zgesv does fairly well as is; vslzConv is in desperate need of speeding up.
* Will this work OK using the compiler that comes with VS2008 C++ or must I use the Intel compiler?
* Does MKL contain the tools to parallelize my code in a more general way (problem is reconstruction of several channels of data, a good candidate for parallel operation, e.g. one processor per channel).

Have studied the documentation and wasn't able to figure it out for a specific application. Wasn't able to distinguish between what happens by default and what requires more specific manual setup.

Just being directed to some focused writeups, and maybe even a coding example would be extremely helpful.

Paul Margosian

Konstantin_A_Intel · ‎03-09-2011

Hi Paul,

You can link MKL as for pure single-threaded use (with mkl_sequential library) as for multi-threading (with mkl_intel_thread library). For more details, please refer to MKL link adviser here:

http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

You will be able to choose any linking configuration including MS compiler.

By default (having the program linked w/ multi-threaded MKL), all MKL routines will use all physical cores of your HW system. However, you may explicitly set a number of threads using MKL_NUM_THREADS environment variable.

As for your last question, MKL doesn't provide a mechanism for automatic code parallelization - it provides ready threaded functionality that could be used inusers' applications.

Regards,

Konstantin

Paul_Margosian · ‎03-09-2011

Hi Konstantin,

Thanks for the advice. Used the recommended "link line advisor", changed a link to mkl_intel_ilp64.lib and added the indicated compiler option. I chose static linking.

Windows task manager showed all processors doing something at once .. but total execution time of this part of the work (many calls to vslzConvExec) did not change. I'm in a situation where the "DIRECT" mode (direct convolution) is slightly faster than the "FFT" mode because the convolution kernel is fairly small (no bigger than 13x5). Tried both modes: DIRECT was same as before: FFT was a little faster than before.

I'm guessing that this means I'm limited by other overhead work. Any practical hints on what to do next?

Paul Margosian

Gennady_F_Intel · ‎03-09-2011

This is the expected behaviour because the task size is very small. There are no sense to use threaded libraries in such cases, moreover using serial version may would be better.

Andrew_Smith · ‎03-11-2011

If ones own code is running !OMP$ PARALLEL and we let the team enter an MKL function together, is MKL able to use the team instead of creating its own? I would think that would speed up use of MKL for small tasks since the overhead of team creation and destruction is gone.

TimP · ‎03-11-2011

According to the default KMP_BLOCKTIME=200, if you leave your own parallel region which uses the same libiomp5, and enter the MKL parallel within 0.200 sec, the team would be used by MKL.

Andrew_Smith · ‎03-12-2011

Say I have code like below, where is the delay handling required?

real v1(10), v2(10), m(10,10)

m = 0.0
!OMP$ PARALLEL

!OMP$ DO
do i = 1, 10
v1(i) = 1.0
m(i,i) = 1.0
end do

call gemm(m,v1,v2)

!OMP$ END PARALLEL

TimP · ‎03-12-2011

You would clearly require the Intel OpenMP library for the code you have proposed, on the assumption that dgemm calls MKL. So, I don't see any reason for including the gemm call in your own OpenMP parallel region. The default KMP_BLOCKTIME would assure that the thread team persists beyond the end of your parallel region into the one created by MKL.
On the other hand, with a loop length of only 10, particularly with partial vectorization available, it makes no sense to use OpenMP at all, and MKL sgemm would almost certainly choose a single thread, yet still not approach the performance of MATMUL. So, it may not even produce a measurable loss, supposing that your code implies running the same single thread gemm call on several threads.

Andrew_Smith · ‎03-13-2011

I made up the simple example on the spur of the moment just to show what program flow I was taking about. My software does many hundreds of different do loops, calls to pardiso, dgemv and dgemm. Currently many of the larger do loops use omp parrallel do. When I call stuff from MKL my code is mostly single threaded. I am incuring overhead of starting and stopping the teams. I wondered if I could make use of directives such as master and single as required so that I could make my whole code one parallel region. So from what has been said it seams I can treat calls to MKL fuctions just like calls to my own functions and allow all threads to remain active.

Konstantin_A_Intel · ‎03-13-2011

Hi Andrew,

Calling MKL function from OMP parallel region you will get it serial by default. In order to enable MKL threading under your own parallel region, please refer to this article:

http://software.intel.com/en-us/articles/recommended-settings-for-calling-intelr-mkl-routines-from-multi-threaded-applications/

In short, you at least need to setMKL_DYNAMIC=false and OMP_NESTED=true

Regards,

Konstantin