I am doing inverse of a matrix and see that if my program uses both OMP and LAPACK library, I can not use 100% CPU usage, just 10%. I also find out that OMP and MKL library can not use 100% CPU usage if they are used together.
I attached all options I chose. Is there any mistake I made?
program testinv implicit none integer i,j,N double precision, allocatable,dimension(:,:):: A,invA1 N=100000 allocate(A(N,N),invA1(N,N)) !$omp parallel do do j=1,N do i=1,N A(i,j)=1d0 enddo enddo !$omp end parallel do call InverseMatrixD(N, A, invA1) contains subroutine InverseMatrixD(N, A, invA) implicit none integer N, IPIV(N), INFO double precision A(N,N), invA(N,N), WORK(N) invA(:,:) = A(:,:) call DGETRF (N, N, invA(:,:), N, IPIV(:), INFO) call DGETRI (N, invA(:,:), N, IPIV(:), WORK(:), N, INFO) end subroutine InverseMatrixD end program testinv
There are some extenuating issues for your to be aware of....
MKL has two libraries: serial/sequential and threaded/parallel
MKL threaded/parallel internally uses OpenMP for parallelization.
Both MKL libraries are thread-safe (both can be used from threaded and non-threaded applications).
Note, this is counterintuitive to: Link threaded library with threaded program or sequential library with sequential program.
When a preponderantly threaded application call MKL within parallel regions, then the better choice of MKL libraries to use is the sequential MKL library. The reason being, should MKL threaded library be called from within a parallel region (or actually different thread), MKL (threaded) library will instantiate a unique (different) OpenMP thread pool for use by the calling thread(s). For example, a system capable of 16 hardware threads this could result in each of the 16 application threads call into MKL threaded library instantiating 16 different thread pool, each of 16 threads (256 threads) iow grossly over subscription.
If you have a parallel application... but only call MKL from the master thread, what you can do is link in the MKL threaded library AND set the environment variable KMP_BLOCKTIME=0 (or some small value you determine is best). With this setting, there will still be two thread pools but the spin-wait times at the ends of the parallel regions (your app and MKL) is 0, meaning at the end(s) of the parallel region(s) the non-instantiating thread(s) immediately suspends (making that hardware thread available for the other domain's parallel region(s) or other process on the system).
There are other times when you might want to specifically tune the number of threads as used by the main application and by each caller into the MKL threaded library (this gets complicated).