Cluster Sparse Solver with Distributed Matrix - reordering/memory problem

Milosz · ‎01-25-2021

I am trying to solve the large symmetric matrix problem (n = 647296, non-zeros = 343145604) with Cluster Sparse Solver and a distibuted matrix (iParam[39] = 1). I built my test program with OpenMP threading and ILP64 used (icc 20.4). It is a very simple workflow. Each rank reads its part of matrix from test files, do reordering, factorizaton, back substitution, memory release, report final error.

The solution looks properly as the results are able to be interpreted visually. However I have three observations:

Although I set iParam[34] = 1, I still get as output:

=== CPARDISO: solving a symmetric indefinite system ===
1-based array indexing is turned ON
CPARDISO double precision computation is turned ON
Scaling is turned ON
Matching is turned ON

Once the reordering is started there is allocation of a huge amount of memory on rank 0, 32GB (50GB in peaks) whereas the other ranks uses only 4-5GB. It is strange because DSS for this matrix requires 40GB of memory. I would suspect the distributed matrix approach to use about 40GB / 8 = 5GB per rank (iParam[1] = 10) in the uniform matrix element distribution case. In my case there is 41806 elements on rank 0 and 228854 elements on rank 7 (size rises with rank numbers) so first rank has the smallest portion of matrix.

The other strange thing is that report shows it spends 85% of total reordering time on memory allocation. It looks similar for cluster distributed matrix and DSS.

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 5.879387 s
Time spent in reordering of the initial matrix (reorder)         : 0.004014 s
Time spent in symbolic factorization (symbfct)                   : 7.478648 s
Time spent in data preparations for factorization (parlist)      : 0.037476 s
Time spent in allocation of internal data structures (malloc)    : 263.670537 s
Time spent in additional calculations                            : 33.522760 s
Total time spent                                                 : 310.592822 s

The test case is run with 8 linux (RHEL7) machines using Intel MPI 2019.9.

Please check attached cpp for other settings. Unfortunately I can not upload matrix definitions as the size of zipped files is

This is one of the smallest cases I am working with. Another case with 610561374 non-zeros (still n = 647296, the same matrix but denser) require 110GB on rank 0, 5GB on other ranks and 75GB for DSS. so this time cluster run is much more memory consuming. The case with 1179580274 non-zeros craches with the allocation problem on 250GB machine.

The question is if I do something wrong or is there a bug in libraries?

Gennady_F_Intel · ‎01-25-2021

Milosz,

We need to have these inputs. You may create private threads and share this input with us.

Kirill_V_Intel · ‎01-25-2021

Hello Milosz,

While I also suggest that you share your matrix with us, it would help us more specific in our answers.
I'll try to answer some of your questions or ask for more details below.

1. The message you see is a bit strange and it may very well be an error for the indexing in the output message. I believe the functionality itself works fine with both 0- and 1-based indexing.

2. Unless you use iparm[1] = 10 a non-distributed version of reordering is used, which means that only 1 MPI is doing it.
I am not sure if I read it correctly. When you say that most of the elements are on process 0, this is not for iparm[1] = 10, right? And the second question from you is why for iparm[1] = 10 the nnz distribution is uneven?

Another option, which can reduce the reordering time/memory consumption is potentially VBSR format, see iparm[36]. I am not sure though if it works together with matching (iparm[12]=1).

3. This is strange. We need to check.

4. Memory consumption of DSS (I guess you mean the DSS API of PARDISO) vs. Cluster Sparse Solver: also needs a check anyway, but are you numbers the top memory consumption for one of the phases or the overall peak over all phases?

One suggestion from me: could you temporarily turn off the matching (set iparm[12]=0) and see how your observations change? This should make situation different, assuming you don't have intersections for the rows w.r.t to the distribution over MPI processes.

Last, but not least: I officially recommend to stop using DSS API for a non-distributed direct sparse solver. Please, switch to PARDISO API. It might not make a difference for many cases but for many other cases we have done and will do nice improvements available through PARDISO API which are not and will not be (likely) available via DSS API.

Best,
Kirill

Milosz · ‎01-26-2021

Sorry for my previous post, it was written ad-hoc. Here is some additional info:
2. I though about uneven elements distribution in input matrix. This is due to my matrix build process (which is not a part of attached program). The non-uniform distribution of matrix rows over MPI processes gives usually uniform non-zeros distribution an workload for each process but not for this case. However this non uniform input is run with iparm[1] = 10, so reordering should be also distributed.
4. Sorry for not beeing specific, I meant of course PARDISO API. Regarding memory measurements, I would say these are peaks for phase 11.

I have also run the test with turned off matching (iparm[12]=0). I found that this time memory consumption in phase 11 is lower and redistributed: rank 7 - 11GB, rank 3 - 8GB, other 1.2-2.5GB. Sum is 30GB. It increases to 42GB in phase 22 for sum but still beter distribution.

All runs are done for OMP_NUM_THREADS=2 and 8 MPI processes.

Gennady_F_Intel · ‎01-26-2021

Milozs,

We took the inputs you shared with us and will check the behavior. Which version of mkl do You use?

-Gennady

Milosz · ‎01-26-2021

I use MKL 2020 u4.

Kirill_V_Intel · ‎01-27-2021

Hi Milosz,

We've received your data. I've quickly checked and can confirm your findings 1. and 3. and partially 2. (I saw the disbalance but I haven't carefully estimated it).

Best,
Kirill

Gennady_F_Intel · ‎07-03-2021

Hi Milosz,

The MKL PARDISO was incorrectly calculating and reporting the "Time spent in allocation of internal data structures (malloc)". Almost all of that time was in fact spent in matching and scaling; a new output line has been added to reflect that info, "Time spent in matching/scaling". So the new output that will be displayed to the user will be:

Summary: ( reordering phase )

================

Times:

======

Time spent in calculations of symmetric matrix portrait (fulladj): 11.541123 s

Time spent in reordering of the initial matrix (reorder) : 0.004566 s

Time spent in symbolic factorization (symbfct) : 10.948064 s

Time spent in data preparations for factorization (parlist) : 0.043690 s

Time spent in allocation of internal data structures (malloc) : 0.069911 s <==== Notice the updated time

Time spent in matching/scaling : 345.403161 s <==== New output info added

Time spent in additional calculations : 40.577674 s

Total time spent : 408.588189 s

The fix of the issue available in the official release of MKL 2021.3

Thanks,

Gennady

Gennady_F_Intel · ‎07-08-2021

The issue is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.