Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
- Cluster Sparse Solver with Distributed Matrix - reordering/memory problem

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Milosz

Novice

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-25-2021
03:45 AM

207 Views

Cluster Sparse Solver with Distributed Matrix - reordering/memory problem

I am trying to solve the large symmetric matrix problem (n = 647296, non-zeros = 343145604) with Cluster Sparse Solver and a distibuted matrix (iParam[39] = 1). I built my test program with OpenMP threading and ILP64 used (icc 20.4). It is a very simple workflow. Each rank reads its part of matrix from test files, do reordering, factorizaton, back substitution, memory release, report final error.

The solution looks properly as the results are able to be interpreted visually. However I have three observations:

- Although I set iParam[34] = 1, I still get as output:
`=== CPARDISO: solving a symmetric indefinite system === 1-based array indexing is turned ON CPARDISO double precision computation is turned ON Scaling is turned ON Matching is turned ON`

- Once the reordering is started there is allocation of a huge amount of memory on rank 0, 32GB (50GB in peaks) whereas the other ranks uses only 4-5GB. It is strange because DSS for this matrix requires 40GB of memory. I would suspect the distributed matrix approach to use about 40GB / 8 = 5GB per rank (iParam[1] = 10) in the uniform matrix element distribution case. In my case there is 41806 elements on rank 0 and 228854 elements on rank 7 (size rises with rank numbers) so first rank has the smallest portion of matrix.
- The other strange thing is that report shows it spends 85% of total reordering time on memory allocation. It looks similar for cluster distributed matrix and DSS.
`Summary: ( reordering phase ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 5.879387 s Time spent in reordering of the initial matrix (reorder) : 0.004014 s Time spent in symbolic factorization (symbfct) : 7.478648 s Time spent in data preparations for factorization (parlist) : 0.037476 s Time spent in allocation of internal data structures (malloc) : 263.670537 s Time spent in additional calculations : 33.522760 s Total time spent : 310.592822 s`

The test case is run with 8 linux (RHEL7) machines using Intel MPI 2019.9.

Please check attached cpp for other settings. Unfortunately I can not upload matrix definitions as the size of zipped files is

This is one of the smallest cases I am working with. Another case with 610561374 non-zeros (still n = 647296, the same matrix but denser) require 110GB on rank 0, 5GB on other ranks and 75GB for DSS. so this time cluster run is much more memory consuming. The case with 1179580274 non-zeros craches with the allocation problem on 250GB machine.

The question is if I do something wrong or is there a bug in libraries?

Link Copied

6 Replies

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-25-2021
10:23 AM

184 Views

Milosz,

We need to have these inputs. You may create private threads and share this input with us.

Kirill_V_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-25-2021
06:27 PM

172 Views

Hello Milosz,

While I also suggest that you share your matrix with us, it would help us more specific in our answers.

I'll try to answer some of your questions or ask for more details below.

1. The message you see is a bit strange and it may very well be an error for the indexing in the output message. I believe the functionality itself works fine with both 0- and 1-based indexing.

2. Unless you use iparm[1] = 10 a non-distributed version of reordering is used, which means that only 1 MPI is doing it.

I am not sure if I read it correctly. When you say that most of the elements are on process 0, this is not for iparm[1] = 10, right? And the second question from you is why for iparm[1] = 10 the nnz distribution is uneven?

Another option, which can reduce the reordering time/memory consumption is potentially VBSR format, see iparm

4. Memory consumption of DSS (I guess you mean the DSS API of PARDISO) vs. Cluster Sparse Solver: also needs a check anyway, but are you numbers the top memory consumption for one of the phases or the overall peak over all phases?

One suggestion from me: could you temporarily turn off the matching (set iparm[12]=0) and see how your observations change? This should make situation different, assuming you don't have intersections for the rows w.r.t to the distribution over MPI processes.

Last, but not least: I officially recommend** to stop using DSS API** for a non-distributed direct sparse solver. Please, switch to PARDISO API. It might not make a difference for many cases but for many other cases we have done and will do nice improvements available through PARDISO API which are not and will not be (likely) available via DSS API.

Best,

Kirill

Milosz

Novice

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-26-2021
03:44 AM

160 Views

Sorry for my previous post, it was written ad-hoc. Here is some additional info:

2. I though about uneven elements distribution in input matrix. This is due to my matrix build process (which is not a part of attached program). The non-uniform distribution of matrix rows over MPI processes gives usually uniform non-zeros distribution an workload for each process but not for this case. However this non uniform input is run with iparm[1] = 10, so reordering should be also distributed.

4. Sorry for not beeing specific, I meant of course PARDISO API. Regarding memory measurements, I would say these are peaks for phase 11.

I have also run the test with turned off matching (iparm[12]=0). I found that this time memory consumption in phase 11 is lower and redistributed: rank 7 - 11GB, rank 3 - 8GB, other 1.2-2.5GB. Sum is 30GB. It increases to 42GB in phase 22 for sum but still beter distribution.

All runs are done for OMP_NUM_THREADS=2 and 8 MPI processes.

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-26-2021
11:05 PM

143 Views

Milozs,

We took the inputs you shared with us and will check the behavior. Which version of mkl do You use?

-Gennady

Milosz

Novice

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-26-2021
11:45 PM

139 Views

I use MKL 2020 u4.

Kirill_V_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-27-2021
09:29 PM

128 Views

Hi Milosz,

We've received your data. I've quickly checked and can confirm your findings 1. and 3. and partially 2. (I saw the disbalance but I haven't carefully estimated it).

Best,

Kirill

For more complete information about compiler optimizations, see our Optimization Notice.