How does reordering affects locality in the cluster sparse solver?

Peschke__Hans · ‎05-05-2019

Hi, I'm using the Intel MKL cluster sparse solver in version 2019.3 with Intel MPI library 2018.5 to solve the sparse symmetric systems occuring in a levenberg-marquardt algorithm for solving large non-linear least squares problems. Therefore, I use the distributed matrix input form iparm[39]=3 and the distributed parallel nested dissection and symbolic factorization iparm[1]=10. Since the matrix to be factorized and solved against becomes too large to fit on every computer, only those parts which are provided to the cluster sparse solver are also computed and allocated locally. Currently, the order of the rows is not particularly controlled, however, since the distributed input to the cluster sparse solver needs to be in subranges of the rows per computer. the row-ranges are determined with the goal to balance the nnz in the matrix on each computer. The question is: How does the reordering effects the locality of the rows on different computers? Or, in other words: Does an unlucky initial order of the rows and hence the local part of the distributed matrix on every computer increases the need of communication and memory consumption? And if so, is it possible to use the permutation-matrix from the analysis phase or any other method to permutate the rows/cols such that locality is improved (communication need and memory consumption is reduced)?

Kirill_V_Intel · ‎05-10-2019

Hello,

Your question is quite broad but I will try to address it.

First, internal memory transfers do depend on how the initial matrix is distributed among processes. Though it is hard to predict whether communication will decrease or increase with the changes. The goal of reordering in general is to reduce fill-in, so the algorithm tries to decrease the overall memory consumption.

Second, I don't think that using the permutation returned from the analysis phase can affect performance of the entire functionality since presumably the main efforts will still be spent in the factorization and solving.

Third, your questions are quite theoretical, and we have complicated internal algorithms which are handling those thus it is hard to give any decisive answers. Is there any practical reason behind your questions? If you run our Cluster Sparse Solver, do you face any problems with memory consumption / performance / scaling?

Best,
Kirill