Slow Reordering in Pardiso

sagarmatha · ‎12-27-2012

Hello,
I am implementing Pardiso(direct and hybrid CGS) in legacy code to speedup. The setup is for 3D FDM with Newton method. The calling sequence is as follows: every first call in time marching with phase=13 and every Newton iterations with phase=23.

The hybrid CGS with phase=23 runs fast, however, reordering (during the call with phase 13) is very costly, for example,~ 85%(more than expected?) of the solvetime with Pardiso(see below) and thus overall runs slow. Is reordering phase parallelized in pardiso? Could you please share your suggestions for reducing reordering time in pardiso?
Thank you,
Sagar

Here are the details:
Case: Non-symmetric, 118,800 uknowns, sparse ~700,000 nnz, from 9 (2*2) block band matrix.
Machine: Intel Xeon E5-2687, 3.1 GHz, 32 GB
: Intel Composer XE(Fortran) 2011 Upgrade 11(Package 344), MKL 10.3 Update 11 and 64bit Windows 7 SP 1
Compile: /O1 /Qparallel /Qopenmp /Qmkl:parallel
Link : mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib

Pardiso Parameters:

For First call(Phase=13):
      iparm(1) = 1 ! no solver default
      iparm(2) = 3 ! fill-in reordering from METIS, 0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
    ! iparm(3) = mkl_get_max_threads() ! numbers of processors, value of MKL_NUM_THREADS
      iparm(4) = 0 ! no iterative-direct algorithm
      iparm(5) = 0 ! no user fill-in reducing permutation, return the array
      iparm(6) = 0 ! =0 solution on the first n components of x
      iparm(7) = 0 ! not in use
      iparm(8) = 0 ! numbers of iterative refinement steps
      iparm(9) = 0 ! not in use
      iparm(10) = 13 ! perturb the pivot elements with 1E-13
      iparm(11) = 0 ! use nonsymmetric permutation and scaling
      iparm(12) = 0 ! not in use
      iparm(13) = 0 ! not in use
      iparm(14) = 0 ! Output: number of perturbed pivots
      iparm(15) = 0 ! not in use
      iparm(16) = 0 ! not in use
      iparm(17) = 0 ! not in use
      iparm(18) = -1 ! Output: number of nonzeros in the factor LU
      iparm(19) = 0 ! Output: Mflops for LU factorization
      iparm(20) = 0 ! Output: Numbers of CG Iterations
      iparm(27) = 0 ! Check for the matrix, default,
      msglvl = 1 ! print statistical information, 0=no 1=yes
      mtype = 11 ! real unsymmetric
For seond call(Phase=23):
      iparm(1) = 1 ! no solver default
      iparm(2) = 3 ! fill-in reordering from METIS, 0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
    ! iparm(3) = mkl_get_max_threads() ! numbers of processors, value of MKL_NUM_THREADS
      iparm(4) = 61 ! no iterative-direct algorithm
      iparm(5) = 0 ! no user fill-in reducing permutation, use from the last one
      iparm(6) = 0 ! =0 solution on the first n components of x
      iparm(7) = 0 ! not in use
      iparm(8) = 0 ! numbers of iterative refinement steps
      iparm(9) = 0 ! not in use
      iparm(10) = 13 ! perturb the pivot elements with 1E-13
      iparm(11) = 0 ! use nonsymmetric permutation and scaling MPS
      iparm(12) = 0 ! not in use
      iparm(13) = 0 ! not in use
      iparm(14) = 0 ! Output: number of perturbed pivots
      iparm(15) = 0 ! not in use
      iparm(16) = 0 ! not in use
      iparm(17) = 0 ! not in use
      iparm(18) = -1 ! Output: number of nonzeros in the factor LU
      iparm(19) = 0 ! Output: Mflops for LU factorization
!    iparm(20) = 0 ! Output: Numbers of CG Iterations
      iparm(27) = 0 ! Check for the matrix, default
      msglvl = 1 ! print statistical information
      mtype = 11 ! real unsymmetric

Here are the results:
*******First call, phase=13********
Percentage of computed non-zeros for LL^T factorization
0%   1%   2%   3%   4%   5%   6%   7%   8%   9%   10%   11%   12%   13%   14%   15%   16%   17%   18%   19%   20%   21%   22%   23%   24%   25%   26%   27%   28%   29%   30%   31%   32%   33%   35%   37%   39%   42%   43%   44%   46%   48%   55%   56%   62%   73%   81%   88%   95%   99%   100%

=== PARDISO: solving a real nonsymmetric system ===
The local (internal) PARDISO version is                          : 103900117
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Single-level factorization algorithm is turned ON

Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.006631 s
Time spent in reordering of the initial matrix (reorder)         : 1.326915 s
Time spent in symbolic factorization (symbfct)                   : 0.025506 s
Time spent in data preparations for factorization (parlist)      : 0.001083 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 0.063961 s
Time spent in direct solver at solve step (solve)                : 0.005209 s
Time spent in allocation of internal data structures (malloc)    : 0.030586 s
Time spent in additional calculations                            : 0.029167 s
Total time spent                                                 : 1.489059 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 8
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           118800
             number of non-zeros in A:      634440
             number of non-zeros in A (): 0.004495

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs: 0
             number of supernodes:                    55108
             size of largest supernode:               646
             number of non-zeros in L:                3906892
             number of non-zeros in U:                3322300
             number of non-zeros in L+U:              7229192
             gflop   for the numerical factorization: 2.709704

             gflop/s for the numerical factorization: 42.364605

*******Second call, phase=23********

Percentage of computed non-zeros for LL^T factorization
0%   1%   2%   3%   4%   5%   6%   7%   8%   9%   10%   11%   12%   13%   14%   15%   16%   17%   18%   19%   20%   21%   22%   23%   24%   25%   26%   27%   28%   29%   30%   31%   33%   34%   35%   39%   42%   43%   44%   48%   51%   53%   59%   68%   70%   77%   84%   93%   99%   100%

=== PARDISO: solving a real nonsymmetric system ===
Single-level factorization algorithm is turned ON

Summary: ( starting phase is factorization, ending phase is solution )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 0.070308 s
Time spent in iterative solver at solve step (cgs)               : 0.013775 s cg
x iterations 1

Time spent in allocation of internal data structures (malloc)    : 0.001296 s
Time spent in additional calculations                            : 0.000001 s
Total time spent                                                 : 0.085381 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 8
< Hybrid Solver PARDISO with CGS/CG Iteration >

< Linear system Ax = b >
             number of equations:           118800
             number of non-zeros in A:      634440
             number of non-zeros in A (): 0.004495

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs: 0
             number of supernodes:                    55108
             size of largest supernode:               646
             number of non-zeros in L:                3906892
             number of non-zeros in U:                3322300
             number of non-zeros in L+U:              7229192
             gflop   for the numerical factorization: 2.709704

             gflop/s for the numerical factorization: 38.540249
iparm(20) :     1