pardiso performance vs Mumps

canal_g_ · ‎01-15-2017

Hello,

Running my first Pardiso (cluster) program, and benchmark Mumps. I was expecting that pardiso would be faster or at least close enough but the result is not very encouraging.

My environment

- Ubuntu 14.04, Intel i3-3240 @ 3.4Ghz, 1 CPU (2 core), 4GB RAM
- Latest MUMPS 5.0.2, latest MKL/Pardiso 2017.0.098
- GCC 4.8.4
- MPICH2

- both programs are written in C

My data

- double precision, complex 9612*9612, total non zero 206442, symmetric

Test result

I set OMP_NUM_THREDS=2, MKL_NUM_THREADS=2, and run the program with 1 MPI process

mpirun -np 1 Program

- MUMPS took 25 sec to complete the 9612 columns

- Pardiso took around 33 sec to complete 3000 columns only (nrhs = 3000), 53 sec to complete 4806 columns (nrhs=4806). so it will be likely more than 100 sec to complete the whole matrix. That's about 4 times of what MUMPS needs.

I am not sure what slows down pardiso....I notice that the direct solver took 12 sec, but additional calculation took 24 sec. Not sure what it is, and if this can be improved ?

Here is the message output, appreciate if you know anything that I can tune (the timing above was for a run that without message output):

-------------------------------------------------------------------------------------------------------------

=== PARDISO: solving a complex symmetric system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000960 s
Time spent in reordering of the initial matrix (reorder) : 0.000016 s
Time spent in symbolic factorization (symbfct) : 0.006562 s
Time spent in data preparations for factorization (parlist) : 0.000174 s
Time spent in allocation of internal data structures (malloc) : 0.043606 s
Time spent in additional calculations : 0.009487 s
Total time spent : 0.060805 s

Statistics:
===========
Parallel Direct Factorization is running on 2 OpenMP

< Linear system Ax = b >
number of equations: 9612
number of non-zeros in A: 206442
number of non-zeros in A (%): 0.223445

number of right-hand sides: 3000

< Factors L and U >
number of columns for each panel: 80
number of independent subgraphs: 0
number of supernodes: 929
size of largest supernode: 570
number of non-zeros in L: 1642020
number of non-zeros in U: 1
number of non-zeros in L+U: 1642021

Reordering/Analysis is completed, the number of iterative steps in solve : 0, peak memory for factorization : 9329 (KB), permanent memory for factorization : 8758 (KB), memory for factorization and solve : 30858 (KB), time used 0...
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

Percentage of computed non-zeros for LL^T factorization
1 % 2 % 3 % 4 % 5 % 6 % 7 % 8 % 9 % 10 % 11 % 12 % 13 % 14 % 15 % 16 % 17 % 18 % 19 % 20 % 21 % 22 % 23 % 24 % 25 % 26 % 27 % 28 % 29 % 30 % 31 % 32 % 33 % 34 % 35 % 37 % 38 % 39 % 42 % 43 % 44 % 45 % 46 % 50 % 51 % 52 % 56 % 58 % 59 % 60 % 64 % 68 % 75 % 76 % 84 % 87 % 97 % 99 % 100 %
100 %

=== PARDISO: solving a complex symmetric system ===
Single-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 0.101974 s
Time spent in allocation of internal data structures (malloc) : 0.000089 s
Time spent in additional calculations : 0.000001 s
Total time spent : 0.102064 s

Statistics:
===========
Parallel Direct Factorization is running on 2 OpenMP

< Linear system Ax = b >
number of equations: 9612
number of non-zeros in A: 206442
number of non-zeros in A (%): 0.223445

number of right-hand sides: 3000

< Factors L and U >
number of columns for each panel: 80
number of independent subgraphs: 0
number of supernodes: 929
size of largest supernode: 570
number of non-zeros in L: 1642020
number of non-zeros in U: 1
number of non-zeros in L+U: 1642021
gflop for the numerical factorization: 1.889010

gflop/s for the numerical factorization: 18.524426

Factorization completed ... time used 0, start solve for 3000 columns
=== PARDISO: solving a complex symmetric system ===

Summary: ( solution phase )
================

Times:
======
Time spent in direct solver at solve step (solve) : 12.161831 s
Time spent in additional calculations : 24.014941 s
Total time spent : 36.176772 s

Statistics:
===========
Parallel Direct Factorization is running on 2 OpenMP

< Linear system Ax = b >
number of equations: 9612
number of non-zeros in A: 206442
number of non-zeros in A (%): 0.223445

number of right-hand sides: 3000

< Factors L and U >
number of columns for each panel: 80
number of independent subgraphs: 0
number of supernodes: 929
size of largest supernode: 570
number of non-zeros in L: 1642020
number of non-zeros in U: 1
number of non-zeros in L+U: 1642021
gflop for the numerical factorization: 1.889010

gflop/s for the numerical factorization: 18.524426

-------------------------------------------------------------------------------------------------------------

thanks

canal

Gennady_F_Intel · ‎01-16-2017

interesting results, thanks a lot. Could you check the performance with the same input with SMP version of Intel MKL PARDISO? without MPI. Could you give us the reproducer to play with it on our side?

lixin_c_ · ‎01-19-2017

Thank you. The file is not small. Is there a server I can upload ?

Tried SMP version (hope I am doing it correctly), the performance is still not matching mumps :

- nrhs = 1000

Summary: ( solution phase )

================

Times:

======

Time spent in direct solver at solve step (solve) : 2.597082 s

Time spent in additional calculations : 6.037061 s

Total time spent : 8.634143 s

- nrhs = 4806

Summary: ( solution phase )

================

Times:

======

Time spent in direct solver at solve step (solve) : 18.920489 s

Time spent in additional calculations : 41.908326 s

Total time spent : 60.828815 s

Here is the setting:

   iparm[0] = 1;
   iparm[1] = 2;
   iparm[3] = 0;
   iparm[4] = 0;
   iparm[5] = 0;
   iparm[6] = 0;
   iparm[7] = 2;
   iparm[8] = 0;
   iparm[9] = 13;
   iparm[10] = 1;
   iparm[11] = 0;
   iparm[12] = 0;
   iparm[13] = 0;
   iparm[14] = 0;
   iparm[15] = 0;
   iparm[16] = 0;
   iparm[17] = -1;
   iparm[18] = -1;
   iparm[19] = 0;

Gennady_F_Intel · ‎01-20-2017

yes, we have Intel Premier Support specifically for such goals. You may upload all private data, source code and etc.... . You may also upload this data via personal ( private communication into this thread ) or just attach example shows how do you call pardiso and input data file by attaching to this post. in that case these data would be available for all forum's visitors.