topic Laura, in Intel® oneAPI Math Kernel Library

cluster_sparse_solver Schur complement - how to distribute the Schur matrix?

Laura_S_3 — Wed, 25 Apr 2018 15:48:55 GMT

We need to find the Schur complement matrix of a sparse matrix A of form

A₁₁ A₁₂

A₂₁ A₂₂

I.e., we want the Schur block defined by S = A₂₂ - A₂₁ A₁₁^-1 A₁₂, which can be done by the new sparse solver update. In cluster_sparse_solver, S is stored as a dense matrix.

Our problem is, S is too large to store on a single compute node, so we would like it to be distributed across all compute nodes. We can distribute the input matrix A using the current interface, but the Schur matrix S is always returned to MPI process 0. Is there some option to make it return distributed in MKL 2018 update 2?

If there is no option, we know we can work around the problem by partitioning A₂₂ further and finding the Schur complement matrix of each section. However, this would appear to require a lot of repeated calculations. Is there some way to save intermediate calculations involving A₁₁ so that these calculations don't need repeating for subsequent Schur calculations?

Thank you,

Laura

Laura,

Spencer_P_Intel — Mon, 07 May 2018 21:05:40 GMT

Laura,

As implemented currently in MKL 2018 update 2, the Schur complement, S, is always assembled on the master MPI process. There is no option to have it returned distributed. The Schur complement is computed as a by product of the general LU factorization which is done in parallel, so it is possible it could be parallelized in the future, but at the moment it is not supported. We will look into it.

I don't really see any viable ways to avoid the recomputation of data. One idea was to break the matrix into the 4 CSR matrices by blocks and then you could construct the schur complement yourself. That is, use cluster sparse solver to solve

A_11 X = A_12

but this requires A_12 to be stored in dense format. This is probably undesirable. But if that is acceptable, then you would have to implement your own distributed SpM*M multiplication and sparse + dense addition. Intel MKL Sparse BLAS IE has a (OpenMP or TBB) parallel implementation of SpM * M (https://software.intel.com/en-us/mkl-developer-reference-c-mkl-sparse-mm) but not a distributed MPI version. This could be used internal to each MPI processor, but the communication would have to be done yourself. Likewise the sparse + dense add could be done fairly simply if you save into the dense matrix but the MPI communication would have to be implemented yourself.

I realize this is inconvenient, but it is what is available at the moment.

Best,

Spencer

Laura,

Spencer_P_Intel — Mon, 07 May 2018 22:31:26 GMT

Laura,

I thought about your issue a little more and I wanted to point out another solution that again probably doesn't really work for your case but is worth mentioning... :) In mkl 2019 beta (now generally available), we have implemented a way to output the Schur complement in sparse (csr) format instead of full dense format for the SMP version of sparse solvers (Intel MKL Pardiso) but not yet the cluster (MPI) version (Cluster Sparse Solvers). This would help if the schur complement is relatively sparse; however this is not always guaranteed to be the case. The documentation can be found in https://software.intel.com/en-us/mkl-developer-reference-c-2019-beta-pardiso-export. See the usage example at the bottom of the link. The key differences are the following:

Use iparm[36-1] = -1 or -2 for sparse instead of 1 or 2 for dense, depending on whether you just want to solve for schur or also want the full LU to be available.

If sparse output format (< 0) is chosen, then the nnz in the schur complement is filled into iparm[36-1] after the reordering step. You can then use this to create the proper sized csr schur complement arrays. You pass these arrays to pardiso using the pardiso_export() function, then after calling the factorization stage, the schur complement arrays are filled.

Like I said, this probably doesn't help you since you are using the fully distributed cluster version and so your problem is likely much larger than can be handled on a single node. But if your problem could fit onto a single node (for instance with 56 cores, or 44 cores, etc) then this is a much better solution.

Best,

Spencer