Using MKL 10.3, Visual Studio 2008 C++, Windows 7 Pro 64 bit, Dell T5500 with 8 processors, 12G. Using LAPACKE_zgesv and vslzConvExec with OK results, but single threaded.
Questions: * Are these able to be multi-threaded, and if so how to make it happen? Link to a different library? Specify number of threads? zgesv does fairly well as is; vslzConv is in desperate need of speeding up. * Will this work OK using the compiler that comes with VS2008 C++ or must I use the Intel compiler? * Does MKL contain the tools to parallelize my code in a more general way (problem is reconstruction of several channels of data, a good candidate for parallel operation, e.g. one processor per channel).
Have studied the documentation and wasn't able to figure it out for a specific application. Wasn't able to distinguish between what happens by default and what requires more specific manual setup.
Just being directed to some focused writeups, and maybe even a coding example would be extremely helpful.
You will be able to choose any linking configuration including MS compiler.
By default (having the program linked w/ multi-threaded MKL), all MKL routines will use all physical cores of your HW system. However, you may explicitly set a number of threads using MKL_NUM_THREADS environment variable.
As for your last question, MKL doesn't provide a mechanism for automatic code parallelization - it provides ready threaded functionality that could be used inusers' applications.
Thanks for the advice. Used the recommended "link line advisor", changed a link to mkl_intel_ilp64.lib and added the indicated compiler option. I chose static linking.
Windows task manager showed all processors doing something at once .. but total execution time of this part of the work (many calls to vslzConvExec) did not change. I'm in a situation where the "DIRECT" mode (direct convolution) is slightly faster than the "FFT" mode because the convolution kernel is fairly small (no bigger than 13x5). Tried both modes: DIRECT was same as before: FFT was a little faster than before.
I'm guessing that this means I'm limited by other overhead work. Any practical hints on what to do next?
If ones own code is running !OMP$ PARALLEL and we let the team enter an MKL function together, is MKL able to use the team instead of creating its own? I would think that would speed up use of MKL for small tasks since the overhead of team creation and destruction is gone.
You would clearly require the Intel OpenMP library for the code you have proposed, on the assumption that dgemm calls MKL. So, I don't see any reason for including the gemm call in your own OpenMP parallel region. The default KMP_BLOCKTIME would assure that the thread team persists beyond the end of your parallel region into the one created by MKL. On the other hand, with a loop length of only 10, particularly with partial vectorization available, it makes no sense to use OpenMP at all, and MKL sgemm would almost certainly choose a single thread, yet still not approach the performance of MATMUL. So, it may not even produce a measurable loss, supposing that your code implies running the same single thread gemm call on several threads.
I made up the simple example on the spur of the moment just to show what program flow I was taking about. My software does many hundreds of different do loops, calls to pardiso, dgemv and dgemm. Currently many of the larger do loops use omp parrallel do. When I call stuff from MKL my code is mostly single threaded. I am incuring overhead of starting and stopping the teams. I wondered if I could make use of directives such as master and single as required so that I could make my whole code one parallel region. So from what has been said it seams I can treat calls to MKL fuctions just like calls to my own functions and allow all threads to remain active.