- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am implementing Pardiso(direct and hybrid CGS) in legacy code to speedup. The setup is for 3D FDM with Newton method. The calling sequence is as follows: every first call in time marching with phase=13 and every Newton iterations with phase=23.
The hybrid CGS with phase=23 runs fast, however, reordering (during the call with phase 13) is very costly, for example,~ 85%(more than expected?) of the solvetime with Pardiso(see below) and thus overall runs slow. Is reordering phase parallelized in pardiso? Could you please share your suggestions for reducing reordering time in pardiso?
Thank you,
Sagar
Here are the details:
Case: Non-symmetric, 118,800 uknowns, sparse ~700,000 nnz, from 9 (2*2) block band matrix.
Machine: Intel Xeon E5-2687, 3.1 GHz, 32 GB
: Intel Composer XE(Fortran) 2011 Upgrade 11(Package 344), MKL 10.3 Update 11 and 64bit Windows 7 SP 1
Compile: /O1 /Qparallel /Qopenmp /Qmkl:parallel
Link : mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib
Pardiso Parameters:
For First call(Phase=13):
iparm(1) = 1 ! no solver default
iparm(2) = 3 ! fill-in reordering from METIS, 0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
! iparm(3) = mkl_get_max_threads() ! numbers of processors, value of MKL_NUM_THREADS
iparm(4) = 0 ! no iterative-direct algorithm
iparm(5) = 0 ! no user fill-in reducing permutation, return the array
iparm(6) = 0 ! =0 solution on the first n components of x
iparm(7) = 0 ! not in use
iparm(8) = 0 ! numbers of iterative refinement steps
iparm(9) = 0 ! not in use
iparm(10) = 13 ! perturb the pivot elements with 1E-13
iparm(11) = 0 ! use nonsymmetric permutation and scaling
iparm(12) = 0 ! not in use
iparm(13) = 0 ! not in use
iparm(14) = 0 ! Output: number of perturbed pivots
iparm(15) = 0 ! not in use
iparm(16) = 0 ! not in use
iparm(17) = 0 ! not in use
iparm(18) = -1 ! Output: number of nonzeros in the factor LU
iparm(19) = 0 ! Output: Mflops for LU factorization
iparm(20) = 0 ! Output: Numbers of CG Iterations
iparm(27) = 0 ! Check for the matrix, default,
msglvl = 1 ! print statistical information, 0=no 1=yes
mtype = 11 ! real unsymmetric
For seond call(Phase=23):
iparm(1) = 1 ! no solver default
iparm(2) = 3 ! fill-in reordering from METIS, 0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
! iparm(3) = mkl_get_max_threads() ! numbers of processors, value of MKL_NUM_THREADS
iparm(4) = 61 ! no iterative-direct algorithm
iparm(5) = 0 ! no user fill-in reducing permutation, use from the last one
iparm(6) = 0 ! =0 solution on the first n components of x
iparm(7) = 0 ! not in use
iparm(8) = 0 ! numbers of iterative refinement steps
iparm(9) = 0 ! not in use
iparm(10) = 13 ! perturb the pivot elements with 1E-13
iparm(11) = 0 ! use nonsymmetric permutation and scaling MPS
iparm(12) = 0 ! not in use
iparm(13) = 0 ! not in use
iparm(14) = 0 ! Output: number of perturbed pivots
iparm(15) = 0 ! not in use
iparm(16) = 0 ! not in use
iparm(17) = 0 ! not in use
iparm(18) = -1 ! Output: number of nonzeros in the factor LU
iparm(19) = 0 ! Output: Mflops for LU factorization
! iparm(20) = 0 ! Output: Numbers of CG Iterations
iparm(27) = 0 ! Check for the matrix, default
msglvl = 1 ! print statistical information
mtype = 11 ! real unsymmetric
Here are the results:
*******First call, phase=13********
Percentage of computed non-zeros for LL^T factorization
0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 35% 37% 39% 42% 43% 44% 46% 48% 55% 56% 62% 73% 81% 88% 95% 99% 100%
=== PARDISO: solving a real nonsymmetric system ===
The local (internal) PARDISO version is : 103900117
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Single-level factorization algorithm is turned ON
Summary: ( starting phase is reordering, ending phase is solution )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.006631 s
Time spent in reordering of the initial matrix (reorder) : 1.326915 s
Time spent in symbolic factorization (symbfct) : 0.025506 s
Time spent in data preparations for factorization (parlist) : 0.001083 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 0.063961 s
Time spent in direct solver at solve step (solve) : 0.005209 s
Time spent in allocation of internal data structures (malloc) : 0.030586 s
Time spent in additional calculations : 0.029167 s
Total time spent : 1.489059 s
Statistics:
===========
< Parallel Direct Factorization with number of processors: > 8
< Numerical Factorization with BLAS3 and O(n) synchronization >
< Linear system Ax = b >
number of equations: 118800
number of non-zeros in A: 634440
number of non-zeros in A (): 0.004495
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 55108
size of largest supernode: 646
number of non-zeros in L: 3906892
number of non-zeros in U: 3322300
number of non-zeros in L+U: 7229192
gflop for the numerical factorization: 2.709704
gflop/s for the numerical factorization: 42.364605
*******Second call, phase=23********
Percentage of computed non-zeros for LL^T factorization
0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 33% 34% 35% 39% 42% 43% 44% 48% 51% 53% 59% 68% 70% 77% 84% 93% 99% 100%
=== PARDISO: solving a real nonsymmetric system ===
Single-level factorization algorithm is turned ON
Summary: ( starting phase is factorization, ending phase is solution )
================
Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 0.070308 s
Time spent in iterative solver at solve step (cgs) : 0.013775 s cg
x iterations 1
Time spent in allocation of internal data structures (malloc) : 0.001296 s
Time spent in additional calculations : 0.000001 s
Total time spent : 0.085381 s
Statistics:
===========
< Parallel Direct Factorization with number of processors: > 8
< Hybrid Solver PARDISO with CGS/CG Iteration >
< Linear system Ax = b >
number of equations: 118800
number of non-zeros in A: 634440
number of non-zeros in A (): 0.004495
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 55108
size of largest supernode: 646
number of non-zeros in L: 3906892
number of non-zeros in U: 3322300
number of non-zeros in L+U: 7229192
gflop for the numerical factorization: 2.709704
gflop/s for the numerical factorization: 38.540249
iparm(20) : 1
Link Copied
![](/skins/images/7FC17B7B85029576C25F1E43CE255B51/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page