Lapack95 function doesn't run fast

kotochan · ‎11-03-2020

I am using MKL lapack95 function "HEEVR" to solve large complex eigen-value-problem. Recently I set up the new PC equipped with 32 core Intel CPU! I could compile the program including "HEEVR" and run it smoothly at the new PC. The resource monitor showed 32 threads works. But the computing time was almost the same as the old computer with just 4 core CPU. In case of simple matrix multiplication, the new PC runs faster more than 10 times than the old one.

How should I improve the compiling process to run the new PC faster for "HEEVR"?
The command to compile the test program is as follows
ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mkl_lapack95_lp64.lib
I tested "set MKL_NUM_THREADS="32" before running the program, but result didn't change.

AndrewG_Intel · ‎11-04-2020

Hello @kotochan

Thank you for posting on the Intel® communities.

Based on the description in your post, it seems that this is related to "Intel® Math Kernel Library - Fortran". We have a forum for those specific products and questions so we are moving it to the Intel® Fortran Compiler forum so it can get answered more quickly: https://community.intel.com/t5/Intel-Fortran-Compiler/bd-p/fortran-compiler

Best regards,

Andrew G.

Intel Customer Support Technician

Gennady_F_Intel · ‎11-16-2020

You built the code correctly and you don’t need to explicitly set up the MKL_NUM_THREADS env variable as MKL will do that automatically.

Probably the HEEVR routine is not enough threaded.

Could you set MKL_VERBOSE=1 and share the log of heevr routine when executing the same code on both of your machines?

kotochan · ‎11-17-2020

Firstly I run my test program testl95 without NKL_NUM_TREEADS on PowerShell.

clock_start = dclock()
call HEEVR(F1,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
clock_end = dclock()

PS C:\Users\中川克己\source\repos\testL95\testL95> set MKL_VERBOSE=1
PS C:\Users\中川克己\source\repos\testL95\testL95> ./testl95
Input matrix size ! 5000
...............................
calculation time = 18.3676567077637 [sec]

Resource monitor showed 35-32 threads run.

Secondary I run it with NKL_NUM_TREEADS.

clock_start = dclock()
call MKL_SET_NUM_THREADS(8) <<<<<< New!
call HEEVR(F1,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
clock_end = dclock()

PS C:\Users\中川克己\source\repos\testL95\testL95> set MKL_VERBOSE=1
PS C:\Users\中川克己\source\repos\testL95\testL95> ./testl95
Input matrix size ! 5000
...............................
calculation time = 17.6669311523438 [sec]

Resource monitor showed 11-8 threads run.
In both cases, any log wasn't output.
Did I use "MKL_VERBOSE=1" in the wrong way ?

jimdempseyatthecove · ‎11-17-2020

>>ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_intel_thread.lib ...

Is your program using OpenMP?
If yes
    are you calling MKL from within OpenMP parallel regions?
    If yes
        then link with the serial version of the MKL library
    else
        link with the threaded version of the MKL library
        set KMP_BLOCKTIME=0
    endif
else
    remove /Qopenmp
endif

IOW when using OpenMP and calling MKL from multiple threads from the application, each application (OpenMP) thread should call the serial MKL library. Calling the ..._thread.lib from within a parallel region would instantiate 32 * 32 (1024) threads.

Jim Dempsey

kotochan · ‎11-17-2020

> Is your program using OpenMP?

I don't use OpenMP code explicitly in my test program other than

"omp_set_num_threads" call just before "call HEEVR".

> remove /Qopenmp

I compared calculation times by compiling with or without /Qopenmp

of test program for 5000 x 5000 problem for various thread number(NT).

with /Qopenmp without /Qopenmp

NT = 1 32.30 sec 45.75 sec

NT = 2 23.41 sec 23.38 sec

NT = 4 15.90 sec 17.83 sec

NT = 8 15.27 sec 17.62 sec

NT =16 17.68 sec 16.23 sec

Resource monitor showed reasonable threads run for each NT.

Gennady_F_Intel · ‎11-17-2020

Example: cheevrx.f ( from mklroot\examples\\lapackf\source\)

Windows, command line:

Ifort /Qmkl cheevrx.f

set MKL_VERBOSE=1

Running the example we have to see version of mkl, number of OpenMP threads, Code Branch, execution time and many others.

>cheevrx.exe

CHEEVR Example Program Results

MKL_VERBOSE Intel(R) MKL 2020.0 Update 4 Product build 20200917 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 2.60GHz cdecl intel_thread MKL_VERBOSE CHEEVR(V,V,L,4,00007FF71AE0B000,4,00000011E495FC8C,00000011E495FC90,0,0,00000011E495FC88,0,00007FF71AE28300,00007FF71AE25E80,4,00007FF71AE28310,00007FF71AE23C80,-1,00007FF71AE25F00,-1,00007FF71AE27100,-1,0) 10.16ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2

You may also set MKL_NUM_THREADS=1

And run this code once again. You may compare the execution time.

having this info may help to understand if the threading of this function is not pretty efficient..

kotochan · ‎11-18-2020

<Policy Change>
Thank you for your suggestion about parallel efficiency of "(C)HEEVR". But HEEVR is very usable function. I've changed my policy of its utilization. So far I've tried to speed up one HEEVR calculation. But in my application, I have to solve several to several tens eigen-value-problems. Then I will challenge OpenMP programs like

!$ call omp_set_num_threads(4)
!$OMP parallel default(none) shared(...) private(...)
!$OMP do
do K = 1, 4
.........................
call HEEVR(F,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
.........................
end do
!$OMP end do
!$OMP end parallel

Here each HEEVR uses input matrix F independent each other. I expect to solve 4 eigen-value-problems at the same time as that for single one. One problem of this method will be to consume more memories but my machine is equipped with 64GB, which may be enough for a while. I am going to report the results of my trial.

kotochan · ‎11-22-2020

I gave up to run one HEEVER faster. My next strategy is to run some HEEVRs in parallel. I expected that all HEEVR calculations terminate in the time for single HEEVR. I run HEEVR test program for 5000 x 5000 matrix with various number of threads(NT).
clock_start0 = dclock()
!$ call omp_set_num_threads(NT)
!$OMP parallel do shared(NT,NS,F1) &
!$OMP private(ISUPPZ,WR,Z,F,VR,IL,IU,M,R,clock_start1,clock_end)
do K = 1, NT
allocate(ISUPPZ(2*NS),F(NS,NS))
F(:,:) = F1(K,:,:)
clock_start1 = dclock()
allocate(WR(NS),VR(NS,NS))
call HEEVR(F,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
clock_end = dclock()
deallocate(ISUPPZ,F,WR,VR)
end do
!$OMP end parallel do
clock_end = dclock()
Calculation times of each thread and overall time for NT threads were measured. Curiously they lack reproductiveness. In case A one thread terminates fast and others terminate slowly and in case B every thread terminates at the same time very slowly. Overall times of calculation by NT threads and time reduction ratio(= overall time/ NT vs single calculation time) are as follows.
overall time[sec] (time reduction ratio)
NT case A case B
1 32.3009 (1.00)
2 40.8067 (0.63)
4 55.5171 (0.43) 72.3351 (0.56)
8 129.9633 (0.50) 148.3593 (0.57)
Unfortunately many threads speed up the calculation only two times. I thought that each HEEVER doesn't interfere each other even though HEEVR isn't threaded well, but it may not be the case. It can't be the problem of HEEVR alone.
So far my new machine with 32 cores and 64GB doesn't work more powerfully than the old one with 4 cores and 8GB.

Gennady_F_Intel · ‎12-02-2020

Hello,

this is a private message. Could you give us the reproducer of this code with your inputs to check the behavior on our side?

Thanks,Gennady

kotochan · ‎12-02-2020

Thank you very much for your attention for my problem.
<The program to test HEEVRK>
program main
USE lapack95, ONLY: HEEVR
USE f95_precision, ONLY: WP => SP
USE omp_lib
implicit none

integer NS,L,K,I,J,IL,IU,M,NT
integer, allocatable :: ISUPPZ(:)
real(8) clock_start0,clock_start1,clock_end,dclock,R
real(8), allocatable :: WR(:)
complex(8) Z
complex(8), allocatable :: VR(:,:),F1(:,:,:),F(:,:)

write(*,'(a\)') 'Input thread # ! '
read(*,'(I6)') NT
write(*,'(a\)') 'Input matrix size ! '
read(*,'(I6)') NS
allocate(F1(NT,NS,NS))
do K = 1, NT
do I = 1, NS
do J = 1, I
call RANDOM_NUMBER(R)
F1(K,I,J) = cmplx(R,0.0D0)
end do
end do
do I = 1, NS
do J = I+1, NS
F1(K,I,J) = F1(K,J,I)
end do
end do!
end do

clock_start0 = dclock()
!$ call omp_set_num_threads(NT)
!$OMP parallel do shared(NT,NS,F1) &
!$OMP private(ISUPPZ,WR,Z,F,VR,IL,IU,M,R,clock_start1,clock_end)
do K = 1,NT
allocate(ISUPPZ(2*NS),F(NS,NS))
F(:,:) = F1(K,:,:)
clock_start1 = dclock()
allocate(WR(NS),VR(NS,NS))
call HEEVR(F,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
clock_end = dclock()
deallocate(ISUPPZ,F,WR,VR)
write(*,'(a,I4,a,F10.4,a)') 'Thread ', omp_get_thread_num(), &
" calc.time = ", (clock_end - clock_start1), "[sec]"
end do
!$OMP end parallel do
clock_end = dclock()
write (*,'(a,F10.4,a,a,F10.4,a)') "Total calc.time = ", (clock_end - clock_start0), "[sec]", &
" average =", (clock_end - clock_start0)/ real(NT), "[sec]"
deallocate(F1)
end program main

<Compliling option>
ifort /Qopenmp /Qmkl main.f90 /exe:testHEEVR.exe mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mkl_lapack95_lp64.lib

<Examples of test run>
PS C:\Users\Public\Documents\testHEEVR\testHEEVR> ./testHEEVR
Input thread # ! 1
Input matrix size ! 5000
Thread 0 calc.time = 30.9025[sec]
Total calc.time = 31.0463[sec] average = 31.0463[sec]

PS C:\Users\Public\Documents\testHEEVR\testHEEVR> ./testHEEVR
Input thread # ! 4
Input matrix size ! 5000
Thread 0 calc.time = 34.7663[sec]
Thread 1 calc.time = 54.0634[sec]
Thread 3 calc.time = 54.4318[sec]
Thread 2 calc.time = 54.4947[sec]
Total calc.time = 55.0960[sec] average = 13.7740[sec]