- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi all

We use the PARDISO solver (Parallel Studio 2018) in our FEM code for a symmetric indefinite system and discovered that the time spent in the factorization step may vary significantly. We use the following solver parameters, all other values are set to zero:

iparm [0] = 1;

iparm [1] = 2;

iparm [2] = 4;

iparm [9] = 8;

The solver output of a slow and fast example can be found below. The total time differs almost by a factor of 2 despite similar size of the equation system. This makes PARDISO clearly less attractive than our iterative GPU solver in certain cases. Is this a performance issue/bug? Do we use bad settings? Or is it just normal behavior? Any help is highly appreciated.

Thank you very much and best regards

David

**Slow example:**

Summary: ( starting phase is reordering, ending phase is solution ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 0.703717 s Time spent in reordering of the initial matrix (reorder) : 11.242095 s Time spent in symbolic factorization (symbfct) : 3.675149 s Time spent in data preparations for factorization (parlist) : 0.107626 s Time spent in copying matrix to internal data structure (A to LU): 0.000000 s Time spent in factorization step (numfct) : 133.596461 s Time spent in direct solver at solve step (solve) : 1.954850 s Time spent in allocation of internal data structures (malloc) : 0.250255 s Time spent in additional calculations : 4.518999 s Total time spent : 156.049152 s Statistics: =========== Parallel Direct Factorization is running on 6 OpenMP < Linear system Ax = b > number of equations: 2956396 number of non-zeros in A: 79437854 number of non-zeros in A (%): 0.000909 number of right-hand sides: 1 < Factors L and U > number of columns for each panel: 96 number of independent subgraphs: 0 < Preprocessing with state of the art partitioning metis> number of supernodes: 338571 size of largest supernode: 17752 number of non-zeros in L: 3101919901 number of non-zeros in U: 1 number of non-zeros in L+U: 3101919902 gflop for the numerical factorization: 20373.427694 gflop/s for the numerical factorization: 152.499756

**Fast example:**

Summary: ( starting phase is reordering, ending phase is solution ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 0.462895 s Time spent in reordering of the initial matrix (reorder) : 10.686642 s Time spent in symbolic factorization (symbfct) : 3.656114 s Time spent in data preparations for factorization (parlist) : 0.108843 s Time spent in copying matrix to internal data structure (A to LU): 0.000000 s Time spent in factorization step (numfct) : 64.537849 s Time spent in direct solver at solve step (solve) : 1.397830 s Time spent in allocation of internal data structures (malloc) : 0.240506 s Time spent in additional calculations : 4.692960 s Total time spent : 85.783639 s Statistics: =========== Parallel Direct Factorization is running on 6 OpenMP < Linear system Ax = b > number of equations: 2961080 number of non-zeros in A: 83916044 number of non-zeros in A (%): 0.000957 number of right-hand sides: 1 < Factors L and U > number of columns for each panel: 96 number of independent subgraphs: 0 < Preprocessing with state of the art partitioning metis> number of supernodes: 308944 size of largest supernode: 6004 number of non-zeros in L: 2958633165 number of non-zeros in U: 1 number of non-zeros in L+U: 2958633166 gflop for the numerical factorization: 8909.331438 gflop/s for the numerical factorization: 138.048162

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi David,

Your question is a curious one. Let me see if I can give some general thoughts about what might be going on here and then if you are further interested, maybe afterwards we can take a closer look at your specific cases.

Before we continue, I do want to give some thoughts about the iterative (GPU) vs direct sparse solvers. IF you know some things about your matrices and can come up with a decent preconditioner, then the iterative methods are almost always going to be faster and simpler. However, when there is no known good preconditioner to help the iterative method converge or you need a certain level of accuracy beyond what is typcially available in iterative methods, the sparse direct solvers become your best options. It is up to you to decide what is best for your situation.

Now, while the two matrices appear to be similar in terms of nrows/ncols and number of non zeros (nnz), there are other aspects to consider for a sparse matrix factorization. You can see that there are some other statistics reported in your two cases. Pay attention to the gflops for the numerical factorization and the gflops/s for the numerical factorization.

Slow matrix:

gflops for num fact: 20373

gflops/s : 152.5

Fast matrix:

gflops for num fact: 8909

gflops/s : 138

In other words, there is a lot more work being done in the slow matrix and it is actually being done faster than in the fast case.

Now, this extra work is not from a larger amount of fill in since the nnz in output matrix L are fairly comparable:

slow matrix: 3101919901

fast matrix: 2958633165

So where is this extra work coming from? It is hard to answer definitively without being able to play with the matrices ourselves, but we have some ideas.

1. You report the matrices are symmetric indefinite. This means that some sort of pivoting will be happening in addition to the reordering of the matrix, to achieve stability in the solution. If the matrix truly is symmetric indefinite, then this is necessary, but if your are not sure, you could check with matrix as type symmetric positive definite. If it is not pos def, then the error message in INFO will tell you so. Symmetric Positivedefinite matrices do not require the extra pivoting for stability. You have not turned on any special pivoting and so are getting the standard diagonal pivoting. This is highly dependent on the values of the matrices. It could be that the extra work being done in the slow case comes from the diagonal pivoting having to work harder to find the right pivot.

2. The matrices have different sparsity patterns which can be seen most obviously by looking at the number of supernodes and size of the largest supernode.

slow matrix: 338571 supernodes with largest supernode containing 17752 rows

fast matrix: 308944 supernodes with largest supernode containing 6004 rows

To review a little, a supernode represents a set of rows (or columns) that have a similar sparsity pattern and can be nicely condensed and treated as a single object in the factorizations. The parallelism in the numerical factorization comes from the elimination tree of supernodes, which is essentially a tree (forest) representing order of dependence of supernodes. We start at the leaves and work our way up through the tree and nodes on separate branches can be processed independent of each other. Different sparsity patterns lead to different elimination trees and different amounts of parallelism. Running the same analysis with sequential threading would give a better comparison between the two matrices. If they still differ with sequential threading, then likely it is the pivoting that is causing the excess work, but if they become similar, then it is more likely the parallelism is better in one than in the other. Sadly, both of these cases help us understand but there isn't much we can do to change or improve them.

3. I have read a number of case studies dealing with other matrix comparison puzzles like this. In one of those cases, the matrices had exactly the same matrix profile and only the values differed. In this case, one was fast and the other slow and the root cause turned out to be that the slow matrix had values that ended up requiring use of denormals in the interior computations which were much slower. When a flush to zero compiler command was added, things got better in speed but lost some accuracy... This is a completely different case and likely not the issue here, but I do mention it because there are many things that can cause differences in performance even for very similar initial matrices.

Now, I would recommend that you try the matrix with the Variable BSR (VBSR) internal matrix format turned on (for example, iparm[37-1] = -80, meaning join block patterns that match at least 80% of the nonzeros into a dense block storage) and using a two level factorization routine (for example, iparm[24-1]=1 ). These tend to give better performance for large symmetric indefinite matrices like these ones. Both should improve, although I cannot guarantee it will even out the two matrices's times. In summary, the variation appears (without further exploration) to be an expected behaviour for the direct sparse algorithms.

Best,

Spencer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Spencer

Thank you very much for your reply. We did additional tests with the following results:

1) The matrices are symmetric indefinite.

2) I did additional simulations with "MKL_NUM_THREADS = 1". The time spent in the factorization step differs here by a factor of 3 compared to 2 in the previous setup. It means also that the slow case shows a better speed-up in the parallel run (743/133=5.5 vs 243/64=3.8). Based on your explanations, I suppose that the pivoting is the main reason for the significant difference between the two cases.

**Slow example:**

Summary: ( starting phase is reordering, ending phase is solution ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 1.085540 s Time spent in reordering of the initial matrix (reorder) : 13.419811 s Time spent in symbolic factorization (symbfct) : 11.839950 s Time spent in data preparations for factorization (parlist) : 0.109764 s Time spent in copying matrix to internal data structure (A to LU): 0.000000 s Time spent in factorization step (numfct) : 743.159449 s Time spent in direct solver at solve step (solve) : 12.795250 s Time spent in allocation of internal data structures (malloc) : 0.260194 s Time spent in additional calculations : 5.149989 s Total time spent : 787.819947 s Statistics: =========== Parallel Direct Factorization is running on 1 OpenMP < Linear system Ax = b > number of equations: 2956396 number of non-zeros in A: 79437854 number of non-zeros in A (%): 0.000909 number of right-hand sides: 1 < Factors L and U > number of columns for each panel: 128 number of independent subgraphs: 0 < Preprocessing with state of the art partitioning metis> number of supernodes: 335787 size of largest supernode: 17752 number of non-zeros in L: 3113468317 number of non-zeros in U: 1 number of non-zeros in L+U: 3113468318 gflop for the numerical factorization: 20427.725412 gflop/s for the numerical factorization: 27.487675

**Fast example:**

Summary: ( starting phase is reordering, ending phase is solution ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 0.778442 s Time spent in reordering of the initial matrix (reorder) : 12.954742 s Time spent in symbolic factorization (symbfct) : 12.192561 s Time spent in data preparations for factorization (parlist) : 0.122582 s Time spent in copying matrix to internal data structure (A to LU): 0.000000 s Time spent in factorization step (numfct) : 243.043820 s Time spent in direct solver at solve step (solve) : 5.061005 s Time spent in allocation of internal data structures (malloc) : 0.226744 s Time spent in additional calculations : 5.627243 s Total time spent : 280.007139 s Statistics: =========== Parallel Direct Factorization is running on 1 OpenMP < Linear system Ax = b > number of equations: 2961080 number of non-zeros in A: 83916044 number of non-zeros in A (%): 0.000957 number of right-hand sides: 1 < Factors L and U > number of columns for each panel: 128 number of independent subgraphs: 0 < Preprocessing with state of the art partitioning metis> number of supernodes: 306027 size of largest supernode: 6004 number of non-zeros in L: 2972758765 number of non-zeros in U: 1 number of non-zeros in L+U: 2972758766 gflop for the numerical factorization: 8963.768923 gflop/s for the numerical factorization: 36.881287

3) I will test that. Update will follow...

4) The two level factorization shows a speed-up of 7% for the slow case. There is no speed-up if it's used in combination with the VBSR format.

I will test the flush to zero compiler option and come back to you. If you want to take a closer look at the matrices, we could share the data of course.

Best regards

David

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi David,

I apologize for not getting back sooner, I have been on vacation for a while and am just getting back into the swing of things. :) So it does look like it is related to the pivoting. If you would be willing to upload your two matrices and your reproducer code, we can look at it and see if there is something that can be done to help... This is a curious case and warrants some further investigation. I am a little surprised to hear that the VBSR + two level factorization doesn't help at all, but there could be a lot of things going on. If you are willing to share, we can take a look. If privacy is not an issue you, you can just upload to this forum. Otherwise we can send you a link for a place to submit it more privately...

Best Regards,

Spencer

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page