Hi David,

Hora__David · ‎10-29-2018

Hi all

We use the PARDISO solver (Parallel Studio 2018) in our FEM code for a symmetric indefinite system and discovered that the time spent in the factorization step may vary significantly. We use the following solver parameters, all other values are set to zero:

    iparm [0] = 1;
   iparm [1] = 2;
   iparm [2] = 4;
   iparm [9] = 8;

The solver output of a slow and fast example can be found below. The total time differs almost by a factor of 2 despite similar size of the equation system. This makes PARDISO clearly less attractive than our iterative GPU solver in certain cases. Is this a performance issue/bug? Do we use bad settings? Or is it just normal behavior? Any help is highly appreciated.

Thank you very much and best regards

David

Slow example:

Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.703717 s
Time spent in reordering of the initial matrix (reorder)         : 11.242095 s
Time spent in symbolic factorization (symbfct)                   : 3.675149 s
Time spent in data preparations for factorization (parlist)      : 0.107626 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 133.596461 s
Time spent in direct solver at solve step (solve)                : 1.954850 s
Time spent in allocation of internal data structures (malloc)    : 0.250255 s
Time spent in additional calculations                            : 4.518999 s
Total time spent                                                 : 156.049152 s

Statistics:
===========
Parallel Direct Factorization is running on 6 OpenMP

< Linear system Ax = b >
             number of equations:           2956396
             number of non-zeros in A:      79437854
             number of non-zeros in A (%): 0.000909

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 96
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    338571
             size of largest supernode:               17752
             number of non-zeros in L:                3101919901
             number of non-zeros in U:                1
             number of non-zeros in L+U:              3101919902
             gflop   for the numerical factorization: 20373.427694

             gflop/s for the numerical factorization: 152.499756

Fast example:

Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.462895 s
Time spent in reordering of the initial matrix (reorder)         : 10.686642 s
Time spent in symbolic factorization (symbfct)                   : 3.656114 s
Time spent in data preparations for factorization (parlist)      : 0.108843 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 64.537849 s
Time spent in direct solver at solve step (solve)                : 1.397830 s
Time spent in allocation of internal data structures (malloc)    : 0.240506 s
Time spent in additional calculations                            : 4.692960 s
Total time spent                                                 : 85.783639 s

Statistics:
===========
Parallel Direct Factorization is running on 6 OpenMP

< Linear system Ax = b >
             number of equations:           2961080
             number of non-zeros in A:      83916044
             number of non-zeros in A (%): 0.000957

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 96
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    308944
             size of largest supernode:               6004
             number of non-zeros in L:                2958633165
             number of non-zeros in U:                1
             number of non-zeros in L+U:              2958633166
             gflop   for the numerical factorization: 8909.331438

             gflop/s for the numerical factorization: 138.048162

Spencer_P_Intel · ‎11-07-2018

Hi David,

Your question is a curious one. Let me see if I can give some general thoughts about what might be going on here and then if you are further interested, maybe afterwards we can take a closer look at your specific cases.

Before we continue, I do want to give some thoughts about the iterative (GPU) vs direct sparse solvers. IF you know some things about your matrices and can come up with a decent preconditioner, then the iterative methods are almost always going to be faster and simpler. However, when there is no known good preconditioner to help the iterative method converge or you need a certain level of accuracy beyond what is typcially available in iterative methods, the sparse direct solvers become your best options. It is up to you to decide what is best for your situation.

Now, while the two matrices appear to be similar in terms of nrows/ncols and number of non zeros (nnz), there are other aspects to consider for a sparse matrix factorization. You can see that there are some other statistics reported in your two cases. Pay attention to the gflops for the numerical factorization and the gflops/s for the numerical factorization.

Slow matrix:
gflops for num fact: 20373
gflops/s : 152.5

Fast matrix:
gflops for num fact: 8909
gflops/s : 138

In other words, there is a lot more work being done in the slow matrix and it is actually being done faster than in the fast case.

Now, this extra work is not from a larger amount of fill in since the nnz in output matrix L are fairly comparable:

slow matrix: 3101919901
fast matrix: 2958633165

So where is this extra work coming from? It is hard to answer definitively without being able to play with the matrices ourselves, but we have some ideas.

1. You report the matrices are symmetric indefinite. This means that some sort of pivoting will be happening in addition to the reordering of the matrix, to achieve stability in the solution. If the matrix truly is symmetric indefinite, then this is necessary, but if your are not sure, you could check with matrix as type symmetric positive definite. If it is not pos def, then the error message in INFO will tell you so. Symmetric Positivedefinite matrices do not require the extra pivoting for stability. You have not turned on any special pivoting and so are getting the standard diagonal pivoting. This is highly dependent on the values of the matrices. It could be that the extra work being done in the slow case comes from the diagonal pivoting having to work harder to find the right pivot.

2. The matrices have different sparsity patterns which can be seen most obviously by looking at the number of supernodes and size of the largest supernode.

slow matrix: 338571 supernodes with largest supernode containing 17752 rows
fast matrix: 308944 supernodes with largest supernode containing 6004 rows

To review a little, a supernode represents a set of rows (or columns) that have a similar sparsity pattern and can be nicely condensed and treated as a single object in the factorizations. The parallelism in the numerical factorization comes from the elimination tree of supernodes, which is essentially a tree (forest) representing order of dependence of supernodes. We start at the leaves and work our way up through the tree and nodes on separate branches can be processed independent of each other. Different sparsity patterns lead to different elimination trees and different amounts of parallelism. Running the same analysis with sequential threading would give a better comparison between the two matrices. If they still differ with sequential threading, then likely it is the pivoting that is causing the excess work, but if they become similar, then it is more likely the parallelism is better in one than in the other. Sadly, both of these cases help us understand but there isn't much we can do to change or improve them.

3. I have read a number of case studies dealing with other matrix comparison puzzles like this. In one of those cases, the matrices had exactly the same matrix profile and only the values differed. In this case, one was fast and the other slow and the root cause turned out to be that the slow matrix had values that ended up requiring use of denormals in the interior computations which were much slower. When a flush to zero compiler command was added, things got better in speed but lost some accuracy... This is a completely different case and likely not the issue here, but I do mention it because there are many things that can cause differences in performance even for very similar initial matrices.

Now, I would recommend that you try the matrix with the Variable BSR (VBSR) internal matrix format turned on (for example, iparm[37-1] = -80, meaning join block patterns that match at least 80% of the nonzeros into a dense block storage) and using a two level factorization routine (for example, iparm[24-1]=1 ). These tend to give better performance for large symmetric indefinite matrices like these ones. Both should improve, although I cannot guarantee it will even out the two matrices's times. In summary, the variation appears (without further exploration) to be an expected behaviour for the direct sparse algorithms.

Best,
Spencer

Hora__David · ‎11-12-2018

Hi Spencer

Thank you very much for your reply. We did additional tests with the following results:

1) The matrices are symmetric indefinite.

2) I did additional simulations with "MKL_NUM_THREADS = 1". The time spent in the factorization step differs here by a factor of 3 compared to 2 in the previous setup. It means also that the slow case shows a better speed-up in the parallel run (743/133=5.5 vs 243/64=3.8). Based on your explanations, I suppose that the pivoting is the main reason for the significant difference between the two cases.

Slow example:

Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 1.085540 s
Time spent in reordering of the initial matrix (reorder)         : 13.419811 s
Time spent in symbolic factorization (symbfct)                   : 11.839950 s
Time spent in data preparations for factorization (parlist)      : 0.109764 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 743.159449 s
Time spent in direct solver at solve step (solve)                : 12.795250 s
Time spent in allocation of internal data structures (malloc)    : 0.260194 s
Time spent in additional calculations                            : 5.149989 s
Total time spent                                                 : 787.819947 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
             number of equations:           2956396
             number of non-zeros in A:      79437854
             number of non-zeros in A (%): 0.000909

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    335787
             size of largest supernode:               17752
             number of non-zeros in L:                3113468317
             number of non-zeros in U:                1
             number of non-zeros in L+U:              3113468318
             gflop   for the numerical factorization: 20427.725412

             gflop/s for the numerical factorization: 27.487675

Fast example:

Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.778442 s
Time spent in reordering of the initial matrix (reorder)         : 12.954742 s
Time spent in symbolic factorization (symbfct)                   : 12.192561 s
Time spent in data preparations for factorization (parlist)      : 0.122582 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 243.043820 s
Time spent in direct solver at solve step (solve)                : 5.061005 s
Time spent in allocation of internal data structures (malloc)    : 0.226744 s
Time spent in additional calculations                            : 5.627243 s
Total time spent                                                 : 280.007139 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
             number of equations:           2961080
             number of non-zeros in A:      83916044
             number of non-zeros in A (%): 0.000957

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    306027
             size of largest supernode:               6004
             number of non-zeros in L:                2972758765
             number of non-zeros in U:                1
             number of non-zeros in L+U:              2972758766
             gflop   for the numerical factorization: 8963.768923

             gflop/s for the numerical factorization: 36.881287

3) I will test that. Update will follow...

4) The two level factorization shows a speed-up of 7% for the slow case. There is no speed-up if it's used in combination with the VBSR format.

I will test the flush to zero compiler option and come back to you. If you want to take a closer look at the matrices, we could share the data of course.

Best regards
David

Spencer_P_Intel · ‎11-28-2018

Hi David,

I apologize for not getting back sooner, I have been on vacation for a while and am just getting back into the swing of things. :) So it does look like it is related to the pivoting. If you would be willing to upload your two matrices and your reproducer code, we can look at it and see if there is something that can be done to help... This is a curious case and warrants some further investigation. I am a little surprised to hear that the VBSR + two level factorization doesn't help at all, but there could be a lot of things going on. If you are willing to share, we can take a look. If privacy is not an issue you, you can just upload to this forum. Otherwise we can send you a link for a place to submit it more privately...

Best Regards,

Spencer

PARDISO - different time in factorization step