Lapack95 function doesn't run fast.

kotochan · ‎11-04-2020

I am using MKL lapack95 function "HEEVR" to solve large complex eigen-value-problem. Recently I set up the new PC equipped with 32 core CPU! I could compile a program including "HEEVR" and run it smoothly at the new PC. The resource monitor shows 32 threads works. But the computing time is almost the same as the old computer with just 4 core CPU.
In case of simple matrix multiplication, the new PC runs faster more than 10 times than the old one. How should I improve the compiling process to run the new PC faster for "HEEVR"?
The command to compile the test program is as follows
ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mkl_lapack95_lp64.lib
I tested "set MKL_NUM_THREADS="32" before running the program, but result didn't change.

Steve_Lionel · ‎11-05-2020

Duplicate and posted in the wrong forum.

kotochan · ‎11-09-2020

Thank you for your advise. At first I posted this message in the wrong forum and the moderator suggested me to re-submit it to "Intel Fortran Compiler forum" and I obeyed his suggestion. But I couldn't withdraw the first message. Does it cause any problem?

I expect advises from this forum to solve my problem and would appreciate if you teach me to improve my message.

mecej4 · ‎11-10-2020

I think that the Intel forum moderator, faced with a choice between this forum and the MKL forum, made a slight misjudgement: moving it to the MKL forum would have been more appropriate.

The BLAS95 and Lapack95 routines add interface layers in which the users' calls (often with no work arrays in argument lists and with a mix of required and optional arguments) are converted to calls to the underlying routines that do the actual work, with additional processing for allocating and de-allocating temporary variables and arrays, copying values between the two sets of arguments, error trapping, etc.

Given these circumstances, it is possible, though unlikely, that bottlenecks can arise in these conversions/software shims. To establish whether or not this is happening in your case, it would be helpful to prepare a version of your code in which calls to Lapack95 and BLAS95 are replaced by calls to Lapack and BLAS routines, with care being taken to reduce the number of times that work arrays are allocated and de-allocated, the amount of work done in copying into and out of array arguments to these routines, error checking, etc.

If you perform the investigation that I just now suggested, and find a significant speed-up, you could open a new topic in the MKL forum to have the issues addressed there.

Sigolaev__Yuriy · ‎11-11-2020

This is because of Relatively Robust Representation. I advise against using Relatively Robust Representation.

kotochan · ‎11-11-2020

Thank you for your advise! But I am a beginner. Could you explain what "Relatively Robust Representation" means?

Sigolaev__Yuriy · ‎11-12-2020

"Relatively Robust Representation" - this is a new algorithm for finding eigenvectors and eigenvalues of tridiagonal matrices, which was initiated by the Russian academician Godunov.

kotochan · ‎11-12-2020

Thank you for your detailed explanation. But today I have found a regrettable fact!
I prepared new PC with 16 core CPUx2 to speed up my research. But it can't be so powerful as I expected! Do I misunderstand the results of following experiment? Any method to improve efficiency of parallel calculation?

<Experiment>
I measured calculation times of 3 MKL functions.
1)HEEVR: complex eigen value problem for f95
2)ZHEEVR: complex eigen value problem for f77
3)SYEVR: real eigen value problem for f95
For HEEVR and SYEVR,
USE lapack95, ONLY: HEEVR(or SYEVR)
USE f95_precision, ONLY: WP => SP
are used.

ifort commands are as follows
1) ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_lapack95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib /module:"%MKLROOT%"\include\intel64/lp64 -I"%MKLROOT%"\include
2) ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib -I"%MKLROOT%"\include
3) ifort /Qopenmp /Qmkl main.f90 /exe:testSYEVR.exe mkl_lapack95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib /module:"%MKLROOT%"\include\intel64/lp64 -I"%MKLROOT%"\include
Links and Includes were decided following to Intel® Math Kernel Library Link Line Advisor.

Thread numbers(NT) were defined in each program, like
call OMP_SET_NUM_THREADS(NT)
clock_start = dclock()
call HEEVR(F1,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)
clock_end = dclock()

The calculation times of the same 5000 x 5000 eigen value problem are as follows
1)HEEVR 2)ZHEEVR 3)SYEVR
NT = 1 31.35 sec 84.91 sec 19.38 sec
NT = 2 20.44 sec 45.56 sec 8.42 sec
NT = 4 16.25 sec 41.94 sec 5.61 sec
NT = 8 15.50 sec 42.01 sec 5.87 sec
NT =16 16.56 sec 42.07 sec 6.26 sec
NT =32 17.71 sec 42.22 sec 6.92 sec
Effects of parallel calculation were saturated around NT = 4!

<My Machine and softwares>
SuperMicro 7039A-i
CPU Intel Xeon(R) Gold 6226R 16cores x 2
Windows 10 Pro 1909
Intel Parallel Studio XE 2020 Composer for Fortran Windows

Arjen_Markus · ‎11-12-2020

That is, unfortunately, not an uncommon phenomenon: adding more CPUs does not necessarily mean that the program will run faster. The reasons for that are diverse and so is the solution. Common reasons:

The processors need to use the same physical memory
Processor 1 updates some memory and processor 2 must make sure it gets the right values from that piece of memory

The only way to solve that, if at all possible, is to carefully design the algorithms and the memory access. And that is not a trivial task.

kotochan · ‎11-12-2020

Thank you for your comment. In case of simple matrix multiplication, new PC run 10 times faster than the old one. So I thought optimistically that the situation is the same for eigen value problem. The only way you suggested may be far beyond my ability. I deeply realized the difficulty of parallel calculation !

Sigolaev__Yuriy · ‎11-13-2020

The fact is that when multiplying matrices, the processor's cache is optimally used (BLAS Level 3), so it is easy to implement an algorithm that is very well parallelized. All fast linear algebra algorithms (and diagonalization too) use matrix multiplication. But there are a lot of tricky things about diagonalization that Intel still hasn't dealt with. For example, tridiagonalization of packed symmetric matrices works for me twice as fast as the best Intel MKL algorithms.

Sigolaev__Yuriy · ‎11-12-2020

Try "HEEVD" (more RAM required).

kotochan · ‎11-12-2020

Thank you for your recommendation of "HEEVD". I will try it later.
Here I'd like to introduce another experiment. Yesterday, I wondered why ZHEEVR(f77) run so slower than HEEVR(f95). Your comment gave me a hint for its reason. Dimensions of workspace arrays (lwork,lrwork,liwork)of ZHEEVR can be defined explicitly. Yesterday I defined them to be the minimums (Min) that MKL manual requires. Today I multiplied them and calculated at NT(number of threads)= 32 or 16.

ZHHVR: NT = 32
Min x 1 42.19 sec
Min x 2 33.26 sec
Min x 4 23.11 sec
Min x 6 19.57 sec
Min x 7 didn't calculate

ZHEEVR:NT = 16
Min x 7 17.28 sec
Min x 8 didn't calculate

ref) HEEVR: NT =32
17.71 sec

The more workspace was given, the shorter calculation time became. But for too much workspace, program seems to skip the calculation. The best result (NT=16,Dim x 7) is equivalent to the result of HEEVR. To calculate faster, more RAM seems to be required. HEEVR maybe adjust workspace size automatically.

kotochan · ‎11-12-2020

f95 function "HEEVD" for complex eigen value problem was investigated. HEEVD calculates all eigenvalues and optionally all eigenvectors. ifort command is the same as that for "HEEVR", which calculates all eigenvalues and selected eigenvectors.
Their calculation times for the same 5000 x 5000 problem are listed below.
HEEVD(A): all eigenvalues only
HEEVD(B): all eigenvalues and all eigenvectors
HEEVR : all eigenvalues and the lowest 10 eigenvectors.
HEEVD(A) HEEVD(B) ref) HEEVR
NT = 1 14.20 sec 60.09 sec 31.35 sec
NT = 2 9.25 sec 29.81 sec 20.44 sec
NT = 4 5.01 sec 23.16 sec 16.25 sec
NT = 8 4.03 sec 22.04 sec 15.50 sec
NT =16 3.68 sec 24.45 sec 16.56 sec
NT =32 3.78 sec 25.25 sec 17.71 sec

The linearity of parallel calculation of HEEVD(A) are still insufficient but superior to HEEVR. In my application, only the lowest several tens eigenvectors are required. If selected eigenvectors can be calculated efficiently by using obtained eigenvalues, HEEVD can be an alternative way for my application.

kotochan · ‎05-24-2021

This is the insufficient but substantial answer by myself to the problem I raised on 11-02-2020. The problem is as follows. I often solve large eigen value problems by Intel MKL function "HEEVR" in my work. Last year I started to use a 32 core machine (Super Micro's 7039A-i with two Xeon Gold 6226R) But efficiency of parallel calculation was very poor, like the function runs the fastest with 4-8 threads and runs rather slower with more threads. After some trials owing to kind advices in this forum, I guessed the function requires huge amount of memory access and the lack of memory bandwidth limits the parallel efficiency. Originally the machine was equipped with two 32GB PC4-23400 memories and they were sufficient in the way of memory amount. But I exchanged them with 12 8GB PC4-2933 memories to increase memory bandwidth. The calculation times of the same problem before and after memory exchange is as follows.
32GB x 2 8GB x 12
1 thread 47.23 sec 49.10 sec
2 threads 25.42 sec 24.51 sec
4 threads 17.55 sec 13.67 sec
8 threads 16.47 sec 7.96 sec
16 threads 18.34 sec 5.65 sec
32 threads 17.03 sec 4.09 sec
I had thought that NUMA (Non-Uniform Memory Access) mode is advantageous for parallel calculation. Actually, the function runs faster with 1-16 threads in NUMA=enable setting but runs slower with 32 threads. Therefore, I use my machine under the setting NUMA=disable and thread=32 now, which allows the fastest calculation.