- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear fellows,
I'm optimizing the CPU-time of a mixed finite element simulation for the Darcy's equation of a particular finite-element library coded in C++. The method results in a saddle-point problem where the solution x is found by solving a linear system Ax = b. A is a sparse symmetric indefinite matrix, and b is the source term. The library' default approach uses PARDISO CGS implementation, which delivers a consistent solution most of the time (in fact after thousands of simulations, I've encountered only 1 inconsistent result). The issue with the current implementation arises after comparing the CPU-time of the statically condensed system A'x=b with the original Ax=b. In fact, the condensed system is faster for small problems, but for larger problems, the time is roughly the same, even when the number of degrees of freedom of A' is one order of magnitude lower than A. Further investigation using PARDISO messaging options, showed that although the dimensions of A and A' are a magnitude apart, the number of non-zeros in the factor L is close. It is important to remark that CGS A is previously ordered using SLOAN [SLO86] algorithm to minimize matrix' profile and wavefront and the program is supplied with a permutation vector equivalent to the identity matrix, otherwise the method either breaks or does not converge.
The PARDISO's suggested approach for saddle point problems works for small/middle-sized problems but is unstable for larger ones and, therefore, is unsuited for the general-purpose library I'm working with. For large problems the factors have so many perturbed pivots that each iteration refinement step increases the numerical error. Nonetheless, in one case, the simulation time dropped from 230 s to 8 s, the difference is big enough that optimizing the solver seems extremely important. This approach uses Bunch-Kaufman 1x1 and 2x2 pivoting and either the Minimum Degree or Multiscale Nested Dissection algorithm for reducing fill-ins, scaling and matching are enabled, a range of iteration refinement number of steps are tested, matrix A/A' is previously ordered using SLOAN algorithm, but this time the program uses its own permutation procedure.
Iparm values and Pardiso messages
The implementation of the library for symmetric indefinite systems calls pardiso_64 twice, for decomposition (phase 12) and for forward substitution (phase 33). The non-default iParm values for and the user supplied permutation vector for phase 12 in the default implementation are:
fParam[34] = 1;
fParam[59] = 0;
fParam[4] = 1;
fParam[3 ] = 10*6+2;
\\permutation vector:
fPermutation.resize(n);
for (long long i=0; i<n; i++) {
fPermutation[i] = i;
}
This implementation has little fill reduction control since fParam[4] restrict Pardiso from ordering the matrix, but when fParam[4] = 0 and fParam[10]=fParam[12] = \alpha, \alpha being either 0 or 1, the following error appears: Pardiso:: Zero pivot, numerical fact. or iterative refinement problem. The same error occurs when A is not previously reordered with the sloan algorithm. Phase 33 is pretty standard and I report its configuration isn't required.
To illustrate the problem, the output of phase 12 is reported next, alongside with the value of certain iparm outputs.
// Phase 12 Pardiso messages
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===
Percentage of computed non-zeros for LL^T factorization
1 % 2 % 3 % 4 % 5 % 6 % 7 % 8 % 9 % 10 % 11 % 12 % 13 % 14 % 15 % 16 % 17 % 18 % 19 % 20 % 21 % 22 % 23 % 24 % 25 % 26 % 27 % 28 % 29 % 30 % 31 % 32 % 33 % 34 % 35 % 36 % 37 % 38 % 39 % 40 % 41 % 42 % 43 % 44 % 45 % 46 % 47 % 48 % 49 % 50 % 51 % 52 % 53 % 54 % 55 % 56 % 57 % 58 % 59 % 60 % 61 % 62 % 63 % 64 % 65 % 66 % 67 % 68 % 69 % 70 % 71 % 72 % 73 % 74 % 75 % 76 % 77 % 78 % 79 % 80 % 81 % 82 % 83 % 84 % 85 % 86 % 87 % 88 % 89 % 90 % 91 % 92 % 93 % 94 % 95 % 96 % 97 % 98 % 99 % 100 %
=== PARDISO: solving a symmetric indefinite system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
User provided fill-in reducing permutation is turned ON
Single-level factorization algorithm is turned ON
Summary: ( starting phase is reordering, ending phase is factorization )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.007723 s
Time spent in reordering of the initial matrix (reorder) : 0.000130 s
Time spent in symbolic factorization (symbfct) : 0.165558 s
Time spent in data preparations for factorization (parlist) : 0.022134 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 2.676305 s
Time spent in allocation of internal data structures (malloc) : 0.041895 s
Time spent in additional calculations : 0.040864 s
Total time spent : 2.954609 s
Statistics:
===========
Parallel Direct Factorization is running on 16 OpenMP
< Hybrid Solver PARDISO with CGS/CG Iteration >
< Linear system Ax = b >
number of equations: 82432
number of non-zeros in A: 639744
number of non-zeros in A (%): 0.009415
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 64
number of independent subgraphs: 0
< Preprocessing with input permutation >
number of supernodes: 16519
size of largest supernode: 11
number of non-zeros in L: 35771839
number of non-zeros in U: 1
number of non-zeros in L+U: 35771840
gflop for the numerical factorization: 17.535356
gflop/s for the numerical factorization: 6.552077
// Printing some iparm outputs:
Number of perturbed pivots(iParm[13]) = 0
Number of positive eingenvalues(iParm[21]) = 16384
Number of negative eingenvalues(iParm[22]) = 66048
// Phase 33 Pardiso messages
=== PARDISO: solving a symmetric indefinite system ===
Summary: ( solution phase )
================
Times:
======
Time spent in iterative solver at solve step (cgs) : 0.217816 s cgx iterations 2
Time spent in allocation of internal data structures (malloc) : 0.000049 s
Time spent in additional calculations : 0.000000 s
Total time spent : 0.217865 s
Statistics:
===========
Parallel Direct Factorization is running on 16 OpenMP
< Hybrid Solver PARDISO with CGS/CG Iteration >
< Linear system Ax = b >
number of equations: 82432
number of non-zeros in A: 639744
number of non-zeros in A (%): 0.009415
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 64
number of independent subgraphs: 0
< Preprocessing with input permutation >
number of supernodes: 16519
size of largest supernode: 11
number of non-zeros in L: 35771839
number of non-zeros in U: 1
number of non-zeros in L+U: 35771840
gflop for the numerical factorization: 17.535356
gflop/s for the numerical factorization: 6.552077
// User print
nDofs = 82432
Computing Error
Errors =(2.57042e-05, 0.0325949, 0.0325949, 0.0325949,
Remark the number of perturbed pivots, the number of non-zeros in L.
Next, the parameters using the default configuration for saddle-point problems:
fParam[34] = 1;
fParam[59] = 0;
fParam[10] = 1;
fParam[12] = 1;
A combination of parameters iParm[20], iParm[23] and iParam[7] were tested as well. The output for this configuration is:
// Phase 12
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===
Percentage of computed non-zeros for LL^T factorization
1 % 2 % 3 % 4 % 5 % 6 % 7 % 8 % 9 % 10 % 11 % 12 % 13 % 14 % 15 % 16 % 17 % 18 % 19 % 20 % 21 % 22 % 23 % 24 % 25 % 26 % 27 % 28 % 29 % 30 % 31 % 32 % 33 % 34 % 35 % 36 % 37 % 38 % 39 % 40 % 41 % 42 % 43 % 44 % 45 % 46 % 47 % 48 % 49 % 50 % 51 % 52 % 53 % 55 % 72 % 85 % 95 % 100 %
=== PARDISO: solving a symmetric indefinite system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Single-level factorization algorithm is turned ON
Summary: ( starting phase is reordering, ending phase is factorization )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.006325 s
Time spent in reordering of the initial matrix (reorder) : 0.000187 s
Time spent in symbolic factorization (symbfct) : 0.046494 s
Time spent in data preparations for factorization (parlist) : 0.001949 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 0.064802 s
Time spent in allocation of internal data structures (malloc) : 0.003441 s
Time spent in matching/scaling : 0.397148 s
Time spent in additional calculations : 0.038770 s
Total time spent : 0.559116 s
Statistics:
===========
Parallel Direct Factorization is running on 16 OpenMP
< Linear system Ax = b >
number of equations: 82432
number of non-zeros in A: 639744
number of non-zeros in A (%): 0.009415
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 64
number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
number of supernodes: 16931
size of largest supernode: 336
number of non-zeros in L: 5181893
number of non-zeros in U: 1
number of non-zeros in L+U: 5181894
gflop for the numerical factorization: 0.752081
gflop/s for the numerical factorization: 11.605837
// Iparm outputs
Number of perturbed pivots(iParm[13]) = 283
Number of positive eingenvalues(iParm[21]) = 16404
Number of negative eingenvalues(iParm[22]) = 66028
//Phase 33
=== PARDISO: solving a symmetric indefinite system ===
Summary: ( solution phase )
================
Times:
======
Time spent in direct solver at solve step (solve) : 0.006467 s
Time spent in additional calculations : 0.014259 s
Total time spent : 0.020726 s
Statistics:
===========
Parallel Direct Factorization is running on 16 OpenMP
< Linear system Ax = b >
number of equations: 82432
number of non-zeros in A: 639744
number of non-zeros in A (%): 0.009415
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 64
number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
number of supernodes: 16931
size of largest supernode: 336
number of non-zeros in L: 5181893
number of non-zeros in U: 1
number of non-zeros in L+U: 5181894
gflop for the numerical factorization: 0.752081
gflop/s for the numerical factorization: 11.605837
Pardiso:: perturbed pivots caused iterative refinement.
// User print
nDofs = 82432
Computing Error
Errors =(2.00013, 222.035, 222.044, 222.035,
DOF = 82432
The approximation error is not right, and the error gets worse for greater iteration refinement steps.
The problem reported here is the smallest problem in which the Bunch-Kaufman pivoting is unstable. Since this problem is relatively small, the decomposition time is not very expressive, but as the problem scales, the default implementation gets relatively slower.
QUESTION
I wonder if
- Is there fill-reduction options for CGS method?
- Is there pivoting alternatives from the Bunch-Kauffman or Diagonal 1x1 methods?
- Is there other solution in PARDISO kit to accelerate the saddle-point problem or in other MKL toolkit?
With regards,
Victor Oliari
[SLO86] SLOAN S. W.: An algorithm for profile and wavefront reduction of sparse matrices. International Journal for Numerical Methods in Engineering 23, 2 (1986), 239–251.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vidya,
The outputs obtained with enhanced precision are attached here.
Is there a Pardiso configuration to make this problem run faster and reliable? Note that there is no fill-in reduction for the CGS solver, and generally if we don't provide the permutation vector (iparm[4]), the simulation crashes.
With regards,
Victor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Victor,
We are working on this issue. we will get back to you soon.
Regards,
Vidya.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »