I am trying to solve the large symmetric matrix problem (n = 647296, non-zeros = 343145604) with Cluster Sparse Solver and a distibuted matrix (iParam = 1). I built my test program with OpenMP threading and ILP64 used (icc 20.4). It is a very simple workflow. Each rank reads its part of matrix from test files, do reordering, factorizaton, back substitution, memory release, report final error.
The solution looks properly as the results are able to be interpreted visually. However I have three observations:
=== CPARDISO: solving a symmetric indefinite system === 1-based array indexing is turned ON CPARDISO double precision computation is turned ON Scaling is turned ON Matching is turned ON
Summary: ( reordering phase ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 5.879387 s Time spent in reordering of the initial matrix (reorder) : 0.004014 s Time spent in symbolic factorization (symbfct) : 7.478648 s Time spent in data preparations for factorization (parlist) : 0.037476 s Time spent in allocation of internal data structures (malloc) : 263.670537 s Time spent in additional calculations : 33.522760 s Total time spent : 310.592822 s
The test case is run with 8 linux (RHEL7) machines using Intel MPI 2019.9.
Please check attached cpp for other settings. Unfortunately I can not upload matrix definitions as the size of zipped files is
This is one of the smallest cases I am working with. Another case with 610561374 non-zeros (still 647296, the same matrix but denser) require 110GB on rank 0, 5GB on other ranks and 75GB for DSS. so this time cluster run is much more memory consuming. The case with 1179580274 non-zeros craches with the allocation problem on 250GB machine.
The question is if I do something wrong or is there a bug in libraries?
While I also suggest that you share your matrix with us, it would help us more specific in our answers.
I'll try to answer some of your questions or ask for more details below.
1. The message you see is a bit strange and it may very well be an error for the indexing in the output message. I believe the functionality itself works fine with both 0- and 1-based indexing.
2. Unless you use iparm = 10 a non-distributed version of reordering is used, which means that only 1 MPI is doing it.
I am not sure if I read it correctly. When you say that most of the elements are on process 0, this is not for iparm = 10, right? And the second question from you is why for iparm = 10 the nnz distribution is uneven?
Another option, which can reduce the reordering time/memory consumption is potentially VBSR format, see iparm
4. Memory consumption of DSS (I guess you mean the DSS API of PARDISO) vs. Cluster Sparse Solver: also needs a check anyway, but are you numbers the top memory consumption for one of the phases or the overall peak over all phases?
One suggestion from me: could you temporarily turn off the matching (set iparm=0) and see how your observations change? This should make situation different, assuming you don't have intersections for the rows w.r.t to the distribution over MPI processes.
Last, but not least: I officially recommend to stop using DSS API for a non-distributed direct sparse solver. Please, switch to PARDISO API. It might not make a difference for many cases but for many other cases we have done and will do nice improvements available through PARDISO API which are not and will not be (likely) available via DSS API.
Sorry for my previous post, it was written ad-hoc. Here is some additional info:
2. I though about uneven elements distribution in input matrix. This is due to my matrix build process (which is not a part of attached program). The non-uniform distribution of matrix rows over MPI processes gives usually uniform non-zeros distribution an workload for each process but not for this case. However this non uniform input is run with iparm = 10, so reordering should be also distributed.
4. Sorry for not beeing specific, I meant of course PARDISO API. Regarding memory measurements, I would say these are peaks for phase 11.
I have also run the test with turned off matching (iparm=0). I found that this time memory consumption in phase 11 is lower and redistributed: rank 7 - 11GB, rank 3 - 8GB, other 1.2-2.5GB. Sum is 30GB. It increases to 42GB in phase 22 for sum but still beter distribution.
All runs are done for OMP_NUM_THREADS=2 and 8 MPI processes.
We've received your data. I've quickly checked and can confirm your findings 1. and 3. and partially 2. (I saw the disbalance but I haven't carefully estimated it).