I have some confusion regarding how MKL execute in parallel. The problem I have is that after making some changes to a program, calls to DGBTRS, DGETRS, DGETRF and DGBTRF are no longer executed in parallel by MKL even though I am using the complier option /Qmkl:parallel.
Let me explain a little more. I have the following code structure for solving a medium size set of ODES (~5000 differential equations).
-Allocates space, etc
-save, cleanup , etc
END SUBROUTINE ODE_SOLVER
-this is where I made some changes, in particular larger vector/matrix multiplications
-Note, however, ODE system size has not changed, so ODE_SOLVER sees the same system size
that is no change has occur that ODE_SOLVER sees.
END SUBROUTINE USER_ODES
For the initial version of the program, the above LAPACK calls made within ODE_SOLVER were executed in parallel by MKL, and I got very good execution time speedup (across 8 cores). I made some changes to USER_ODES, but I did not change the size of the ODE system, so ODE_SOLVER was effectively solving the same problem. However, USER_ODES did allocate larger matrices to compute the ODES.
The problem is, after making changes to USER_ODES, calls to the LAPACK routines stopped executing in parallel (only get serial execution). If I use the Intel fortran compilier option /Qparallel, all cores become busy, but performance is terrible.
Sorry this is not much to go on. My guess is that USER_ODES is generating multiple threads now, and this prevents MKL for producing parallel threads for the LAPACK calls. Any suggestions?
Sorry, disregard the above post (I don't see a way of removing it). Turns out my changes USER_ODES were more computational time consuming than I had thought. MKL is running in parallel, it just doesn't spend much time running. Need to optimize/parallelize my own code.
How are you checking if the MKL functions are spawning the threadings? In some cases, Intel MKL functions may not create more threading. For example, if the high level code is threaded with Intel OpenMP, and MKL functions find there functions are in the OpenMP parallel region, MKL may not create the threading( to avoid over-threading there).
In your case, it looks the high level code is not threaded. True? Also, the /Qparallel, and /Qmkl:parallel are totally different. With the /Qparallel, Intel compiler may threaded some of your source with OpenMP, and /Qmkl:parallel is enabling the MKL internal threading.
Thanks for your comment. I was crudely checking thread creation by just following core activity. As I noted above, MKL was generating threads as expected, it just that my modified code, which has a lot of serial execution, was taking much longer than I had anticipated. At first appearance, I thought MKL was also executing in serial, but that was incorrect. MKL did execute code in parallel, it just did it so quickly that I missed it at first.