Re:MKL PARDISO - performance question

istefanov · ‎01-10-2022

Hello,

I am using an interior point solver (IpOpt) with MKL PARDISO as linear solver to solve series of relatively simple (often quadratic) problems. Usually (up to now), PARDISO has been far superior than MUMPS (which I also have for comparison purposes), but recently I encountered some strange behavior. On a purely quadratic problem with 378 696 variables and 378 680 constraints (IpOpt solves that in 82 iterations) each iteration with PARDISO seems to be taking nearly as much time as the whole solution process with MUMPS. I recently recompiled stuff on my side, so I may have messed up something, but not really sure what it could be.

For comparison, the solution with IpOpt/MUMPS combo is found in 520 seconds (that is 82 iterations of interior point method, on each a system is being solved).

The log from the first iteration with IpOpt/PARDISO combo looks like that:

=== PARDISO: solving a symmetric indefinite system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.187624 s
Time spent in reordering of the initial matrix (reorder) : 0.006688 s
Time spent in symbolic factorization (symbfct) : 2.648446 s
Time spent in data preparations for factorization (parlist) : 0.038590 s
Time spent in allocation of internal data structures (malloc) : 23.092874 s
Time spent in additional calculations : 1.116051 s
Total time spent : 27.090273 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
number of equations: 1096299
number of non-zeros in A: 10792217
number of non-zeros in A (%): 0.000898

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 112
number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
number of supernodes: 748178
size of largest supernode: 26102
number of non-zeros in L: 353891709
number of non-zeros in U: 1
number of non-zeros in L+U: 353891710
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

Percentage of computed non-zeros for LL^T factorization
1 % 2 % 3 % 4 % 7 % 10 % 13 % 16 % 19 % 22 % 25 % 28 % 31 % 34 % 36 % 39 % 42 % 44 % 47 % 49 % 52 % 54 % 56 % 58 % 60 % 62 % 64 % 66 % 68 % 70 % 72 % 74 % 75 % 77 % 79 % 80 % 82 % 83 % 84 % 86 % 87 % 88 % 89 % 90 % 91 % 92 % 93 % 94 % 95 % 96 % 97 % 98 % 99 % 100 %

=== PARDISO: solving a symmetric indefinite system ===
Two-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000001 s
Time spent in factorization step (numfct) : 430.960081 s
Time spent in allocation of internal data structures (malloc) : 0.047239 s
Time spent in additional calculations : 0.000080 s
Total time spent : 431.007401 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
number of equations: 1096299
number of non-zeros in A: 10792217
number of non-zeros in A (%): 0.000898

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 112
number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
number of supernodes: 748178
size of largest supernode: 26102
number of non-zeros in L: 353891709
number of non-zeros in U: 1
number of non-zeros in L+U: 353891710
gflop for the numerical factorization: 6075.185527

gflop/s for the numerical factorization: 14.096864

=== PARDISO: solving a symmetric indefinite system ===

Summary: ( solution phase )
================

Times:
======
Time spent in direct solver at solve step (solve) : 1.280442 s
Time spent in additional calculations : 0.000237 s
Total time spent : 1.280679 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
number of equations: 1096299
number of non-zeros in A: 10792217
number of non-zeros in A (%): 0.000898

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 112
number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
number of supernodes: 748178
size of largest supernode: 26102
number of non-zeros in L: 353891709
number of non-zeros in U: 1
number of non-zeros in L+U: 353891710
gflop for the numerical factorization: 6075.185527

gflop/s for the numerical factorization: 14.096864

I am wondering if the problem is just somehow weird and that's why it takes so long (it is the biggest one I have had up to now, but I originally expected it to exhibit the same behavior as the smaller ones where PARDISO is way faster than anything else) ?

Thanks in advance for any insights someone may have!

VidyalathaB_Intel · ‎01-11-2022

Hi,

Thanks for reaching out to us.

Could you please provide us with the following details so that we can work on it from our end?

> Complete sample reproducer (and steps to reproduce if any)

> oneMKL version

> OS Details

Regards,

Vidya.

istefanov · ‎01-11-2022

Hi,

It will take some time to build this, I will try to do so in the next few days as I am using the IpOpt solver mostly as a black box tool, so I will need to print what it sends to PARDISO and create an isolated example of this call only.

I am running the whole thing inside a Docker container (CentOS Linux) as part of a larger codebase that builds the problem and does some post-processing of the results, but I will be able to isolate this part.

I have compiled the IpOpt package long time ago with a rather old MKL 2019.1.144.

Thank you!

VidyalathaB_Intel · ‎01-11-2022

Hi,

Thanks for letting us know.

>> long time ago with a rather old MKL 2019.1.144.

This time maybe you can give it a try with the latest oneMKL library(2022.1) which comes with the oneAPI Base Toolkit (2022.1) and the MPI library comes with oneAPI HPC Toolkit(2022.1) which is available for downloading now.

Please find the below links to download the latest version of oneAPI Toolkits:

Link to download oneAPI Base Toolkit

https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

Link to download oneAPI HPC Toolkit:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit-download.html

Please try the latest version 2022.1 and do let us know if you still observe the same behavior, by providing us the necessary details as mentioned above.

Regards,

Vidya.

VidyalathaB_Intel · ‎01-19-2022

Hi,

Reminder:

Could you please provide us with the sample reproducer so that we can work on it from our end?

Regards,

Vidya.

istefanov · ‎01-19-2022

Hi,

Sorry for the delay, I got stuck trying to replicate the MUMPS setup externally as well (just to validate the runtime difference), but it turned out to be a bit harder to extract.

I am attaching a very simple main.cpp that runs the setup as it comes in the solver's 1st iteration. I have dumped the 3 matrix arrays in txt files - it takes a while to read them, after that it runs the solver with the same parameters as the IpOpt does.

I also tried with the 2022.1 version, but the runtime was very similar.

Thank you!

VidyalathaB_Intel · ‎01-25-2022

Hi,

Thanks for sharing the reproducer.

Could you please provide us with the steps to reproduce the issue (compilation command in particular)?

Regards,

Vidya.

istefanov · ‎01-25-2022

Hi,

Sorry for not mentioning that. So I am running it in a freshly built Docker container, but it can be reproduced anywhere with the following 3 commands:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin

g++ -I/opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/include main.cpp -o main.o -L/opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin -L/opt/intel/compilers_and_libraries_2019.1.144/linux/compiler/lib/intel64_lin -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl

./main.o

I am still using the 2019 version because that is what I had readily for this experiment, but the paths can be replaced with whatever is installed (I tried the latest one and it was the same).

Thank you!

Gennady_F_Intel · ‎01-26-2022

Hello,

We tried to run the example you shared with MKL 2019.1 as well with the latest MKL 2022 and we see

the segmentaion problem at the reordering phase.

icc -mkl=parallel main.cpp

ldd a.out

libmkl_intel_lp64.so => /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so (0x00002b9234a50000)

$ ./a.out

...phase == 11...

Segmentation fault (core dumped)

The similar segfault we could see then linking against MKL 2022.0.

Could You check how this example.zip works on your end?

-Gennady

istefanov · ‎01-26-2022

Hi,

Sorry about that, in the main.cpp there are 3 places where I read the A, IA and JA files, but due to the specifics in my setup those are /build/A.txt, /build/IA.txt and /build/JA.txt. Please change the /build/ part to correspond to where you put those (they should be in the zip file otherwise).

Sorry for the inconvenience.

Gennady_F_Intel · ‎01-27-2022

Ok. It works right now, but I see the completely different performance results on my end :

MKL v. 2019.0.1, Processor optimization: Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors

The most significant statistic:

number of equations: 1096299
number of non-zeros in A: 10792217
number of non-zeros in A (%): 0.000898

Parallel Direct Factorization is running on 4 OpenMP

Reordering -- Total time spent : 5.116731 s

Factorization -- Total time spent : 17.705746 s

with Parallel Direct Factorization is running on 44 OpenMP:

Reordering -- Total time spent : 4.956248 s

Factorization -- Total time spent : 11.099820 s

The MKL 2022 results on the same machine look very similar.

The full log file ( 2019_4thr_main2.txt ), as well as slightly modified main2.txt ( actually it is main2.cpp file but as Forum engine doesn't accept the *.cpp extention, this file has been renamed ) files, are attached.

istefanov · ‎01-28-2022

OK, thanks for checking.

I will keep digging why this behaves so differently on my machine, a very interesting point to look at is that in your log I see:

number of columns for each panel: 112
number of independent subgraphs: 0
number of supernodes: 1083362
size of largest supernode: 13052
number of non-zeros in L: 96783110
number of non-zeros in U: 1
number of non-zeros in L+U: 96783111

while when I run the same thing I get:

number of columns for each panel: 112
number of independent subgraphs: 0
number of supernodes: 1083362
size of largest supernode: 13052
number of non-zeros in L: 96782581
number of non-zeros in U: 1
number of non-zeros in L+U: 96782582

Where the number of non-zeros is somehow different.

I am also getting quite variable times, this particular run resulted in 107 seconds instead of the usual 300-something.

I will investigate further on my end as there seems to be something wrong there, thanks again for spending time on this!

mecej4 · ‎01-28-2022

Did anybody notice this very odd part of the result:

number of non-zeros in L: 353891709
number of non-zeros in U: 1
number of non-zeros in L+U: 353891710

I happened to have a program to read in the IA, JA, A arrays from text files and factor the matrix using Pardiso 6 from www.pardiso-project.org. This is what I got:

             number of nonzeros in L                         97073680
             number of nonzeros in U                         94240201
             number of nonzeros in L+U                       1.91314e+008

Compare the number of non-zeros in U from MKL and from Pardiso 6.

Kirill_V_Intel · ‎01-28-2022

Hi!

@mecej4 , if you mean the line "number of non-zeros in U: 1", this is because Intel MKL PARDISO acknowledges the fact that the matrix type has symmetry as a property and hence only one factor is stored.

Best,
Kirill

mecej4 · ‎01-28-2022

Thanks, Kirill, I missed that. So U is not the factor U in A = L.D.U, i.e., an upper triangular matrix with unit diagonal, but simply an array that does not need to be stored!

Gennady_F_Intel · ‎02-25-2022

Checking the same problem behavior on Intel DevCloud machine ( Architecture: x86_64

CPU(s): 24, Thread(s) per core: 2, Core(s) per socket: 6, Socket(s): 2, NUMA node(s): 2, Model name: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz),

I see ~ the same performance:

MKL v. 2022.0.0

Processor optimization: Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors

phase=11

Total time spent : 4.411666 s

phase == 22

Total time spent : 20.710880 s

phase==33

So, You might get access to the Intel Dev Cloud as well and make these experiments there.

https://devcloud.intel.com/oneapi/get_started/

all latest version of oneAPI software is installed there and you don't need to install the software stack by yourself. You only need to upload there your data and run the code using Jupiter notebook or through ssh session as you like.

This current thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

thanks,

Gennady