Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® Fortran Compiler
- Lapack95 function doesn't run fast.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

kotochan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-04-2020
10:56 PM

303 Views

Lapack95 function doesn't run fast.

In case of simple matrix multiplication, the new PC runs faster more than 10 times than the old one. How should I improve the compiling process to run the new PC faster for "HEEVR"?

The command to compile the test program is as follows

ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_intel_thread.lib mkl_core.lib libiomp5md.lib mkl_lapack95_lp64.lib

I tested "set MKL_NUM_THREADS="32" before running the program, but result didn't change.

13 Replies

Highlighted
##

Steve_Lionel

Black Belt Retired Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-05-2020
09:06 AM

278 Views

Duplicate and posted in the wrong forum.

Highlighted
##

kotochan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-09-2020
09:24 PM

228 Views

Thank you for your advise. At first I posted this message in the wrong forum and the moderator suggested me to re-submit it to "Intel Fortran Compiler forum" and I obeyed his suggestion. But I couldn't withdraw the first message. Does it cause any problem?

I expect advises from this forum to solve my problem and would appreciate if you teach me to improve my message.

Highlighted
##

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-10-2020
03:38 AM

205 Views

I think that the Intel forum moderator, faced with a choice between this forum and the MKL forum, made a slight misjudgement: moving it to the MKL forum would have been more appropriate.

The BLAS95 and Lapack95 routines add interface layers in which the users' calls (often with no work arrays in argument lists and with a mix of required and optional arguments) are converted to calls to the underlying routines that do the actual work, with additional processing for allocating and de-allocating temporary variables and arrays, copying values between the two sets of arguments, error trapping, etc.

Given these circumstances, it is possible, though unlikely, that bottlenecks can arise in these conversions/*software shims*. To establish whether or not this is happening in your case, it would be helpful to prepare a version of your code in which calls to Lapack95 and BLAS95 are replaced by calls to Lapack and BLAS routines, with care being taken to reduce the number of times that work arrays are allocated and de-allocated, the amount of work done in copying into and out of array arguments to these routines, error checking, etc.

If you perform the investigation that I just now suggested, and find a significant speed-up, you could open a new topic in the MKL forum to have the issues addressed there.

Highlighted
##

This is because of Relatively Robust Representation. I advise against using Relatively Robust Representation.

Sigolaev__Yuriy

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-11-2020
09:23 AM

173 Views

Highlighted
##

Thank you for your advise! But I am a beginner. Could you explain what "Relatively Robust Representation" means?

kotochan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-11-2020
06:03 PM

160 Views

Highlighted
##

"Relatively Robust Representation" - this is a new algorithm for finding eigenvectors and eigenvalues of tridiagonal matrices, which was initiated by the Russian academician Godunov.

Sigolaev__Yuriy

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-12-2020
12:14 AM

149 Views

Highlighted
##

kotochan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-12-2020
02:33 AM

134 Views

Thank you for your detailed explanation. But today I have found a regrettable fact!

I prepared new PC with 16 core CPUx2 to speed up my research. But it can't be so powerful as I expected! Do I misunderstand the results of following experiment? Any method to improve efficiency of parallel calculation?

<Experiment>

I measured calculation times of 3 MKL functions.

1)HEEVR: complex eigen value problem for f95

2)ZHEEVR: complex eigen value problem for f77

3)SYEVR: real eigen value problem for f95

For HEEVR and SYEVR,

USE lapack95, ONLY: HEEVR(or SYEVR)

USE f95_precision, ONLY: WP => SP

are used.

ifort commands are as follows

1) ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_lapack95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib /module:"%MKLROOT%"\include\intel64/lp64 -I"%MKLROOT%"\include

2) ifort /Qopenmp /Qmkl main.f90 /exe:testL95.exe mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib -I"%MKLROOT%"\include

3) ifort /Qopenmp /Qmkl main.f90 /exe:testSYEVR.exe mkl_lapack95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib /module:"%MKLROOT%"\include\intel64/lp64 -I"%MKLROOT%"\include

Links and Includes were decided following to Intel® Math Kernel Library Link Line Advisor.

Thread numbers(NT) were defined in each program, like

call OMP_SET_NUM_THREADS(NT)

clock_start = dclock()

call HEEVR(F1,WR,Z=VR,ISUPPZ=ISUPPZ,IL=1,IU=10,M=M)

clock_end = dclock()

The calculation times of the same 5000 x 5000 eigen value problem are as follows

1)HEEVR 2)ZHEEVR 3)SYEVR

NT = 1 31.35 sec 84.91 sec 19.38 sec

NT = 2 20.44 sec 45.56 sec 8.42 sec

NT = 4 16.25 sec 41.94 sec 5.61 sec

NT = 8 15.50 sec 42.01 sec 5.87 sec

NT =16 16.56 sec 42.07 sec 6.26 sec

NT =32 17.71 sec 42.22 sec 6.92 sec

Effects of parallel calculation were saturated around NT = 4!

<My Machine and softwares>

SuperMicro 7039A-i

CPU Intel Xeon(R) Gold 6226R 16cores x 2

Windows 10 Pro 1909

Intel Parallel Studio XE 2020 Composer for Fortran Windows

Highlighted
##

Arjen_Markus

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-12-2020
02:47 AM

127 Views

That is, unfortunately, not an uncommon phenomenon: adding more CPUs does not necessarily mean that the program will run faster. The reasons for that are diverse and so is the solution. Common reasons:

- The processors need to use the same physical memory
- Processor 1 updates some memory and processor 2 must make sure it gets the right values from that piece of memory

The only way to solve that, if at all possible, is to carefully design the algorithms and the memory access. And that is not a trivial task.

Highlighted
##

Sigolaev__Yuriy

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-12-2020
03:56 AM

120 Views

Try "HEEVD" (more RAM required).

Highlighted
##

kotochan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-12-2020
05:18 PM

94 Views

Thank you for your recommendation of "HEEVD". I will try it later.

Here I'd like to introduce another experiment. Yesterday, I wondered why ZHEEVR(f77) run so slower than HEEVR(f95). Your comment gave me a hint for its reason. Dimensions of workspace arrays (lwork,lrwork,liwork)of ZHEEVR can be defined explicitly. Yesterday I defined them to be the minimums (Min) that MKL manual requires. Today I multiplied them and calculated at NT(number of threads)= 32 or 16.

ZHHVR: NT = 32

Min x 1 42.19 sec

Min x 2 33.26 sec

Min x 4 23.11 sec

Min x 6 19.57 sec

Min x 7 didn't calculate

ZHEEVR:NT = 16

Min x 7 17.28 sec

Min x 8 didn't calculate

ref) HEEVR: NT =32

17.71 sec

The more workspace was given, the shorter calculation time became. But for too much workspace, program seems to skip the calculation. The best result (NT=16,Dim x 7) is equivalent to the result of HEEVR. To calculate faster, more RAM seems to be required. HEEVR maybe adjust workspace size automatically.

Highlighted
##

kotochan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-12-2020
10:45 PM

84 Views

f95 function "HEEVD" for complex eigen value problem was investigated. HEEVD calculates all eigenvalues and optionally all eigenvectors. ifort command is the same as that for "HEEVR", which calculates all eigenvalues and selected eigenvectors.

Their calculation times for the same 5000 x 5000 problem are listed below.

HEEVD(A): all eigenvalues only

HEEVD(B): all eigenvalues and all eigenvectors

HEEVR : all eigenvalues and the lowest 10 eigenvectors.

HEEVD(A) HEEVD(B) ref) HEEVR

NT = 1 14.20 sec 60.09 sec 31.35 sec

NT = 2 9.25 sec 29.81 sec 20.44 sec

NT = 4 5.01 sec 23.16 sec 16.25 sec

NT = 8 4.03 sec 22.04 sec 15.50 sec

NT =16 3.68 sec 24.45 sec 16.56 sec

NT =32 3.78 sec 25.25 sec 17.71 sec

The linearity of parallel calculation of HEEVD(A) are still insufficient but superior to HEEVR. In my application, only the lowest several tens eigenvectors are required. If selected eigenvectors can be calculated efficiently by using obtained eigenvalues, HEEVD can be an alternative way for my application.

Highlighted
##

Thank you for your comment. In case of simple matrix multiplication, new PC run 10 times faster than the old one. So I thought optimistically that the situation is the same for eigen value problem. The only way you suggested may be far beyond my ability. I deeply realized the difficulty of parallel calculation !

kotochan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-12-2020
11:08 PM

80 Views

Highlighted
##

The fact is that when multiplying matrices, the processor's cache is optimally used (BLAS Level 3), so it is easy to implement an algorithm that is very well parallelized. All fast linear algebra algorithms (and diagonalization too) use matrix multiplication. But there are a lot of tricky things about diagonalization that Intel still hasn't dealt with. For example, tridiagonalization of packed symmetric matrices works for me twice as fast as the best Intel MKL algorithms.

Sigolaev__Yuriy

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-13-2020
01:57 AM

73 Views

For more complete information about compiler optimizations, see our Optimization Notice.