Solved: MKL Pardiso shows difference behavior on Windows and Linux

Hassan-Ebrahimi · ‎03-17-2023

I have used the multi-thread feature of MKL Pardiso ( Fortran 2022) in a structural code of a commercial package.
The code for Linux and Windows is exactly the same and the specification of Linux machine is even better than that of Windows, but the Linux module seems much slower as can be seen in the tables below. The low speed of single-thread bothers most.

Matrix Factorization time ( seconds)

	single-thread	2-thread	4-thread	6-thread	8-thread
Windows	294.6	169.7	103.4	87.8	88.3
Linux	1013.3	534.0	271.9	201.9	159.0

Speedup:

	single-thread	2-thread	4-thread	6-thread	8-thread
Windows	1.0	1.7	2.8	3.4	3.3
Linux	1.0	1.9	3.7	5.0	6.4

Is there any explanation for this different behavior?

The speeds get closer as the number of threads increases. This make a higher speedup ratio on Linux but the speed is still lower than Windows.

ShanmukhS_Intel · ‎04-13-2023

Hi Hassan,

Could you please let us know the Intel MKL version being used by you? If you are using an older version, kindly upgrade to the latest version (Intel MKL 2023.1) and let us know if the issue persists.

You could download the same using the below link.

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html

Best Regards,

Shanmukh.SS

View solution in original post

ShanmukhS_Intel · ‎03-20-2023

Hi Hassan,

Thanks for posting in Intel Communities.

Could you please get back to us with a sample reproducer, environment details and steps to reproduce(if any)? It helps to reproduce the issue at our end and help you accordingly.

Best Regards,

Shanmukh.SS

Hassan-Ebrahimi · ‎03-21-2023

Hi Shanmukh,

Thanks for the reply.

It is not a particular case. All large matrices that I have tried are solved slower on Linux.
In the above example, the number of unknowns is 806301.

There is a difference in < Factors L and U > statistics between Windows and Linux:
number of columns for each panel is 80 for Linux but 128 for Windows. How this is set?

Windows:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
number of equations: 806301
number of non-zeros in A: 51330744
number of non-zeros in A (%): 0.007896

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 128
number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
number of supernodes: 63570
size of largest supernode: 10290
number of non-zeros in L: 1593820616
number of non-zeros in U: 1
number of non-zeros in L+U: 1593820617
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

Linux:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
number of equations: 806301
number of non-zeros in A: 51330744
number of non-zeros in A (%): 0.007896

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 80
number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
number of supernodes: 65848
size of largest supernode: 15003
number of non-zeros in L: 1586972470
number of non-zeros in U: 1
number of non-zeros in L+U: 1586972471
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

I have compared the factorization time of a Linux machine with two windows machines.

Linux is way slower than both machines.

Matrix Factorization time (seconds)

	single-thread	2-thread	4-thread	6-thread	8-thread
Windows1	294.6	169.7	103.4	87.8	88.3
Windows2	491.8	284.4	177.7	151.5	108.5
Linux	1013.3	534.0	271.9	201.9	159.0

Speedup

	single-thread	2-thread	4-thread	6-thread	8-thread
Windows1	1.0	1.7	2.8	3.4	3.3
Windows2	1.0	1.7	2.8	3.2	4.5
Linux	1.0	1.9	3.7	5.0	6.4

Machine specifications:

Windows1: Intel(R) Xeon(R) E5-2640 CPU @ 2.40GHz

Windows2: Intel(R) Xeon(R) W-2133 CPU @ 3.60GHz
Linux: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz

Please let me know if other details are required.

Thank you.

Hassan

ShanmukhS_Intel · ‎03-27-2023

Hi Hassan,

Thanks for sharing the details.

As the issue is related to performance, We would like to reproduce the issue at our environment machines as well. Hence we would like to request you to provide us with a reproducer and steps(if any) as it helps recreate the issue at our end and assist you accordingly.

Best Regards,

shanmukh.SS

Hassan-Ebrahimi · ‎03-29-2023

Hi Shanmukh,

The setups are as follows:

**********************
subroutine mkl_factor(ja, ia, neq, a, na)
use omp_lib
IMPLICIT NONE
include 'mkl_pardiso.f'
INTEGER*8 pt(64)
INTEGER maxfct, mnum, mtype, phase, nrhs, error, msglvl
INTEGER neq, na
INTEGER iparm(64)
INTEGER ia(na)
INTEGER ja(neq+1)
REAL*8 a(na)
REAL*8 b
REAL*8 x
INTEGER i, j, idum
INTEGER perm(30)
REAL*8 tic, toc, ddum

DATA maxfct /1/, mnum /1/

pt(:)=0

nrhs=1
iparm(1) = 0
iparm(3) = 1 ! numbers of processors
error = 0 ! initialize error flag
msglvl = 0 ! print statistical information
mtype = 2 ! real symmetric positive definite
phase = 11 ! only reordering and symbolic factorization
iparm(60) = 1

tic = omp_get_wtime()
C.. Reordering.
CALL pardiso (pt, maxfct, mnum, mtype, phase, neq, a, ja, ia,
1 perm, nrhs, iparm, msglvl, b, x, error)

toc = omp_get_wtime()

WRITE(*,*) 'Reordering Time [s]: ',(toc-tic)

C.. Factorization.
tic = omp_get_wtime()
phase = 22 ! only factorization
CALL pardiso (pt, maxfct, mnum, mtype, phase, neq, a, ja, ia,
1 perm, nrhs, iparm, msglvl, ddum, ddum, error)
!WRITE(*,*) 'Factorization completed ... '
IF (error .NE. 0) THEN
WRITE(*,*) 'The following ERROR was detected: ', error
STOP 1
ENDIF

toc = omp_get_wtime()
WRITE(*,*) 'Factorization Time [s]: ',(toc-tic)

END

**********************

For the matrix you may use the sparse form of the following dense matrix:

Symmetric positive definite matrix with the lower triangle values defined as follows:

int neq=500000

int w=300

for (int j = 0; j < neq; j++) {
for (int i = j; i <min(j+w,nq); i++) {
M(i,j)= 1.0 / (1.0 + abs(i - j));

}
}

Results on my machines:

Windows:

Reordering Time [s]: 38.5397152999649
Factorization Time [s]: 7.46051340003032

Linux:

Reordering Time [s]: 38.6944050788879
Factorization Time [s]: 32.8658249378204

It is clear that the Factorization on Linux is much slower than on Windows.

Hassan

VitezslavStembera · ‎03-30-2023

I have the exactly same problem. MKL Pardiso runs 8xslower on Linux! From the discussion here i do not any conlusion or solution. The question is if the problem persists with older versions of Intel MKl Pardiso, or if the same issue hat the original Pardiso 6.0/7.2.

ShanmukhS_Intel · ‎04-05-2023

Hi Hassan,

Thanks for sharing the statistical information. Could you please get back to us with below-mentioned information?

-For Windows environment

Could you please let us know if any configuration changes need to be made for the program to execute? as we are facing errors during compilation.

It would be a great help if you share with us the VS project file if the project was been run on Visual Studio?

-For Linux environment

The command which was used to compile and run your code.

Hi VitezslavStembera,

Thanks for letting us know. It would be great help if you share more details regarding your issue via a new thread so that we could track the case easily.

Best Regards,

Shanmukh.SS

Hassan-Ebrahimi · ‎04-09-2023

Hi Shanmukh,

For Windows:

I have made a stand-alone Fortran project with only one source file. The source and the project files are attached.

For Linux:
The module is built by using the following command:
ifort pardiso_sym.f -o pardiso_sym -L/XXX/lib -mkl

XXX /lib is the location of the required Intel Fortran 2022 shared library.

The module is then executed by using the following commands:

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/XXX/lib/intel64
export LD_LIBRARY_PATH

./pardiso_sym

With the attached code and the described procedures, I get the following results:

Windows:
Reordering Time [s]: 41.1421489999630
Factorization Time [s]: 12.7260687001981

Linux:
Reordering Time [s]: 38.5098030567169
Factorization Time [s]: 34.8389120101929

Hassan-Ebrahimi · ‎04-09-2023

Hi Shanmukh,

For some reason, the project file cannot be attached. The contents are posted here:

VS project file:

<?xml version="1.0" encoding="UTF-8"?>
<VisualStudioProject ProjectCreator="Intel Fortran" Keyword="Console Application" Version="11.0" ProjectIdGuid="{44C84AB5-4DE1-46FF-A98A-83601CD95682}">
<Platforms>
<Platform Name="Win32"/>
<Platform Name="x64"/></Platforms>
<Configurations>
<Configuration Name="Debug|Win32">
<Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" DebugInformationFormat="debugEnabled" Optimization="optimizeDisabled" WarnInterfaces="true" Traceback="true" BoundsCheck="true" StackFrameCheck="true" RuntimeLibrary="rtMultiThreadedDebugDLL"/>
<Tool Name="VFLinkerTool" LinkIncremental="linkIncrementalNo" SuppressStartupBanner="true" GenerateDebugInformation="true" SubSystem="subSystemConsole"/>
<Tool Name="VFResourceCompilerTool"/>
<Tool Name="VFMidlTool" SuppressStartupBanner="true"/>
<Tool Name="VFCustomBuildTool"/>
<Tool Name="VFPreLinkEventTool"/>
<Tool Name="VFPreBuildEventTool"/>
<Tool Name="VFPostBuildEventTool"/>
<Tool Name="VFManifestTool" SuppressStartupBanner="true"/></Configuration>
<Configuration Name="Release|Win32" OutputDirectory="..\build\" IntermediateDirectory="..\build\" BuildLogFile="$(TargetDir)BuildLog.htm">
<Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" DebugInformationFormat="debugEnabled" Optimization="optimizeDisabled" ModulePath="$(TargetDir)" ObjectFile="$(TargetDir)" PdbFile="$(TargetDir)\vc150.pdb" SourceListingFile="$(TargetDir)$(InputName).lst" BuildDependenciesFile="$(TargetDir)$(TargetName).dep" ProfileDirectory="$(TargetDir)" RuntimeLibrary="rtMultiThreadedDLL"/>
<Tool Name="VFLinkerTool" OutputFile="$(TargetDir)$(ProjectName).exe" LinkIncremental="linkIncrementalNo" SuppressStartupBanner="true" ManifestFile="$(TargetDir)$(TargetName)$(TargetExt).intermediate.manifest" GenerateDebugInformation="true" SubSystem="subSystemConsole"/>
<Tool Name="VFResourceCompilerTool"/>
<Tool Name="VFMidlTool" SuppressStartupBanner="true"/>
<Tool Name="VFCustomBuildTool"/>
<Tool Name="VFPreLinkEventTool"/>
<Tool Name="VFPreBuildEventTool"/>
<Tool Name="VFPostBuildEventTool"/>
<Tool Name="VFManifestTool" SuppressStartupBanner="true" OutputManifestFile="" ResourceFile="" DependencyInfoFile=""/></Configuration>
<Configuration Name="Debug|x64">
<Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" DebugInformationFormat="debugEnabled" WarnInterfaces="true" Traceback="true" BoundsCheck="true" StackFrameCheck="true" RuntimeLibrary="rtMultiThreadedDebugDLL" UseMkl="mklParallel"/>
<Tool Name="VFLinkerTool" LinkIncremental="linkIncrementalNo" SuppressStartupBanner="true" GenerateDebugInformation="true" SubSystem="subSystemConsole"/>
<Tool Name="VFResourceCompilerTool"/>
<Tool Name="VFMidlTool" SuppressStartupBanner="true" TargetEnvironment="midlTargetAMD64"/>
<Tool Name="VFCustomBuildTool"/>
<Tool Name="VFPreLinkEventTool"/>
<Tool Name="VFPreBuildEventTool"/>
<Tool Name="VFPostBuildEventTool"/>
<Tool Name="VFManifestTool" SuppressStartupBanner="true"/></Configuration>
<Configuration Name="Release|x64" BuildLogFile="$(TargetDir)BuildLog.htm">
<Tool Name="VFFortranCompilerTool" SuppressStartupBanner="true" DebugInformationFormat="debugEnabled" ModulePath="$(TargetDir)" ObjectFile="$(TargetDir)" PdbFile="$(TargetDir)\vc150.pdb" SourceListingFile="$(TargetDir)$(InputName).lst" BuildDependenciesFile="$(TargetDir)$(TargetName).dep" ProfileDirectory="$(TargetDir)" RuntimeLibrary="rtMultiThreadedDLL" UseMkl="mklParallel"/>
<Tool Name="VFLinkerTool" OutputFile="$(TargetDir)$(ProjectName).exe" LinkIncremental="linkIncrementalNo" SuppressStartupBanner="true" ManifestFile="$(TargetDir)$(TargetName)$(TargetExt).intermediate.manifest" GenerateDebugInformation="true" SubSystem="subSystemConsole"/>
<Tool Name="VFResourceCompilerTool"/>
<Tool Name="VFMidlTool" SuppressStartupBanner="true" TargetEnvironment="midlTargetAMD64"/>
<Tool Name="VFCustomBuildTool"/>
<Tool Name="VFPreLinkEventTool"/>
<Tool Name="VFPreBuildEventTool"/>
<Tool Name="VFPostBuildEventTool"/>
<Tool Name="VFManifestTool" SuppressStartupBanner="true" OutputManifestFile="" ResourceFile="" DependencyInfoFile=""/></Configuration></Configurations>
<Files>
<Filter Name="Header Files" Filter="fi;fd;h;inc"/>
<Filter Name="Resource Files" Filter="rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe"/>
<Filter Name="Source Files" Filter="f90;for;f;fpp;ftn;def;odl;idl">
<File RelativePath="..\src\pardiso_sym.f"/></Filter></Files>
<Globals/></VisualStudioProject>

ShanmukhS_Intel · ‎04-12-2023

Hi Hassan,

Thanks for the snippet and project file. We could see the performance differences as mentioned.

We are discussing your issue internally. We will get back to you soon with an update.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎04-13-2023

Hi Hassan,

Could you please let us know the Intel MKL version being used by you? If you are using an older version, kindly upgrade to the latest version (Intel MKL 2023.1) and let us know if the issue persists.

You could download the same using the below link.

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html

Best Regards,

Shanmukh.SS

Hassan-Ebrahimi · ‎04-14-2023

Hi Shanmukh.SS,

Thanks for your quick action.

The issue is solved in MKL 2023. Linux is even faster than Windows as expected.

Linux ( MKL 2022):
Reordering Time [s]: 38.5098030567169
Factorization Time [s]: 34.8389120101929

Linux ( MKL 2023)
Reordering Time [s]: 39.6438219547272
Factorization Time [s]: 11.8002791404724

With the original problem the improvement is remarkable:

Linux ( MKL 2022):
Reordering Time [s]: 14.6
Factorization Time [s]: 1021.0

Linux ( MKL 2023)
Reordering Time [s]: 15.7
Factorization Time [s]: 266.0

Hassan-Ebrahimi · ‎04-17-2023

Hi Shanmukh.SS,

Thanks for your support all along.

Although the problem is fixed, I would like to clarify something important:

The actual issue was not in the version of the Intel MKL.

It was due to a missing so file (libmkl_avx512.so.2) in the collection of run-time shared libraries used with our module. The collection was different between Windows and Linux.

With the file missing, mkl_get_version() gives the following:

Processor optimization: Intel(R) Streaming SIMD Extensions 2 (Intel(R) SSE2) enabled processors.

With the file added, it becomes:

Processor optimization: Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost)

The latter is significantly faster on our machines.

Regards,

Hassan

ShanmukhS_Intel · ‎04-18-2023

Hi Hassan,

It was due to a missing so file (libmkl_avx512.so.2) in the collection of run-time shared libraries used with our module. The collection was different between Windows and Linux.

>>Glad to know that your issue was resolved. If you need any other information please post a new query as this will no longer be monitored by Intel. The processor optimization settings are based on the presence of libmkl_avx512.so.2.

It configures the MKL library to utilize different CPU instruction sets and optimizations, with the file's presence enabling more advanced and faster computation on processors supporting AVX-512 and Intel DL Boost.

Best Regards,

Shanmukh.SS