Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
213 Views

Lapack95 function doesn't run fast

I am using MKL lapack95 function "HEEVR" to solve large complex eigen-value-problem. Recently I set up the new PC equipped with 32 core Intel CPU! I could compile the program including "HEEVR" and run it smoothly at the new PC. The resource monitor showed 32 threads works. But the computing time was almost the same as the old computer with just 4 core CPU. In case of simple matrix multiplication, the new PC runs faster more than 10 times than the old one.

How should I improve the compiling process to run the new PC faster for "HEEVR"?
The command to compile the test program is as follows
ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mkl_lapack95_lp64.lib
I tested "set MKL_NUM_THREADS="32" before running the program, but result didn't change.

0 Kudos
8 Replies
Highlighted
Moderator
197 Views

Hello @kotochan

Thank you for posting on the Intel® communities.


Based on the description in your post, it seems that this is related to "Intel® Math Kernel Library - Fortran". We have a forum for those specific products and questions so we are moving it to the Intel® Fortran Compiler forum so it can get answered more quickly: https://community.intel.com/t5/Intel-Fortran-Compiler/bd-p/fortran-compiler


Best regards,

Andrew G.

Intel Customer Support Technician


0 Kudos
Highlighted
Moderator
113 Views

You built the code correctly and you don’t need to explicitly set up the MKL_NUM_THREADS env variable as MKL will do that automatically.

Probably the HEEVR routine is not enough threaded.

Could you set MKL_VERBOSE=1 and share the log of heevr routine when executing the same code on both of your machines?


0 Kudos
Highlighted
100 Views

>>ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_intel_thread.lib ...

Is your program using OpenMP?
If yes
    are you calling MKL from within OpenMP parallel regions?
    If yes
        then link with the serial version of the MKL library
    else
        link with the threaded version of the MKL library
        set KMP_BLOCKTIME=0
    endif
else
    remove /Qopenmp
endif

IOW when using OpenMP and calling MKL from multiple threads from the application, each application (OpenMP) thread should call the serial MKL library. Calling the ..._thread.lib from within a parallel region would instantiate 32 * 32 (1024) threads.

Jim Dempsey

0 Kudos
Highlighted
Beginner
85 Views

> Is your program using OpenMP?

I don't use OpenMP code explicitly in my test program other than

"omp_set_num_threads" call just before "call HEEVR".

>     remove /Qopenmp

I compared calculation times by compiling with or without /Qopenmp

of test program for 5000 x 5000 problem for various thread number(NT).

          with /Qopenmp   without /Qopenmp 

NT = 1      32.30 sec        45.75 sec       

NT = 2      23.41 sec        23.38 sec

NT = 4      15.90 sec        17.83 sec

NT = 8      15.27 sec        17.62 sec

NT =16     17.68 sec       16.23 sec

Resource monitor showed reasonable threads run for each NT.

0 Kudos
Highlighted
Beginner
77 Views

Firstly I run my test program testl95 without NKL_NUM_TREEADS on PowerShell.

clock_start = dclock()
call HEEVR(F1,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
clock_end = dclock()

PS C:\Users\中川克己\source\repos\testL95\testL95> set MKL_VERBOSE=1
PS C:\Users\中川克己\source\repos\testL95\testL95> ./testl95
Input matrix size ! 5000
...............................
calculation time = 18.3676567077637 [sec]

Resource monitor showed 35-32 threads run.

Secondary I run it with NKL_NUM_TREEADS.

clock_start = dclock()
call MKL_SET_NUM_THREADS(8) <<<<<< New!
call HEEVR(F1,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
clock_end = dclock()

PS C:\Users\中川克己\source\repos\testL95\testL95> set MKL_VERBOSE=1
PS C:\Users\中川克己\source\repos\testL95\testL95> ./testl95
Input matrix size ! 5000
...............................
calculation time = 17.6669311523438 [sec]

Resource monitor showed 11-8 threads run.
In both cases, any log wasn't output.
Did I use "MKL_VERBOSE=1" in the wrong way ?

0 Kudos
Highlighted
Moderator
74 Views

Example: cheevrx.f ( from mklroot\examples\\lapackf\source\)

Windows, command line:

Ifort /Qmkl cheevrx.f

set MKL_VERBOSE=1


Running the example we have to see version of mkl, number of OpenMP threads, Code Branch, execution time and many others.

>cheevrx.exe

CHEEVR Example Program Results

MKL_VERBOSE Intel(R) MKL 2020.0 Update 4 Product build 20200917 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 2.60GHz cdecl intel_thread  MKL_VERBOSE CHEEVR(V,V,L,4,00007FF71AE0B000,4,00000011E495FC8C,00000011E495FC90,0,0,00000011E495FC88,0,00007FF71AE28300,00007FF71AE25E80,4,00007FF71AE28310,00007FF71AE23C80,-1,00007FF71AE25F00,-1,00007FF71AE27100,-1,0) 10.16ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:2                                   


You may also set MKL_NUM_THREADS=1

And run this code once again. You may compare the execution time.

having this info may help to understand if the threading of this function is not pretty efficient..



0 Kudos
Highlighted
Beginner
53 Views

<Policy Change>
Thank you for your suggestion about parallel efficiency of "(C)HEEVR". But HEEVR is very usable function. I've changed my policy of its utilization. So far I've tried to speed up one HEEVR calculation. But in my application, I have to solve several to several tens eigen-value-problems. Then I will challenge OpenMP programs like

!$ call omp_set_num_threads(4)
!$OMP parallel default(none) shared(...) private(...)
!$OMP do
do K = 1, 4
.........................
call HEEVR(F,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
.........................
end do
!$OMP end do
!$OMP end parallel

Here each HEEVR uses input matrix F independent each other. I expect to solve 4 eigen-value-problems at the same time as that for single one. One problem of this method will be to consume more memories but my machine is equipped with 64GB, which may be enough for a while. I am going to report the results of my trial.

0 Kudos
Highlighted
Beginner
18 Views

I gave up to run one HEEVER faster. My next strategy is to run some HEEVRs in parallel. I expected that all HEEVR calculations terminate in the time for single HEEVR. I run HEEVR test program for 5000 x 5000 matrix with various number of threads(NT).
clock_start0 = dclock()
!$ call omp_set_num_threads(NT)
!$OMP parallel do shared(NT,NS,F1) &
!$OMP private(ISUPPZ,WR,Z,F,VR,IL,IU,M,R,clock_start1,clock_end)
do K = 1, NT
allocate(ISUPPZ(2*NS),F(NS,NS))
F(:,:) = F1(K,:,:)
clock_start1 = dclock()
allocate(WR(NS),VR(NS,NS))
call HEEVR(F,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
clock_end = dclock()
deallocate(ISUPPZ,F,WR,VR)
end do
!$OMP end parallel do
clock_end = dclock()
Calculation times of each thread and overall time for NT threads were measured. Curiously they lack reproductiveness. In case A one thread terminates fast and others terminate slowly and in case B every thread terminates at the same time very slowly. Overall times of calculation by NT threads and time reduction ratio(= overall time/ NT vs single calculation time) are as follows.
       overall time[sec]   (time reduction ratio)
NT        case A                    case B   
1    32.3009 (1.00)
2    40.8067 (0.63)
4    55.5171 (0.43)      72.3351 (0.56)
8  129.9633 (0.50)   148.3593 (0.57)
Unfortunately many threads speed up the calculation only two times. I thought that each HEEVER doesn't interfere each other even though HEEVR isn't threaded well, but it may not be the case. It can't be the problem of HEEVR alone.
So far my new machine with 32 cores and 64GB doesn't work more powerfully than the old one with 4 cores and 8GB.

0 Kudos