Cannot obtain scaling for cluster sparse solver with simple benchmark.

arthur_b · ‎01-15-2025

EDIT: I reuploaded the test files with only the relevant files.

Hello,

I am testing the cluster version of Pardiso (cluster_sparse_solver) on a simple benchmark, and I cannot obtain a good scaling in the symbolic factorization (phase 11) and solve (phase 33) steps.

The benchmark is solving Ax=b with A a "quasi-tridiagonal" matrix, with nonzero entries in (1,3) and (N,N-2) to have 3 nonzero per row and a RHS set to 1. The system is equally distributed between the MPI processes (following PETSc’s formula : nRowsOnProc = N/size + ((N % size) > rank)), and the rows do not overlap. This is essentially a generalization of the example file cl_solver_unsym_distr_c.c with arbitrary matrix size and MPI processes.

The system is solved on a cluster with 32 nodes available with 128Gb of memory each. I am currently only assessing the scaling with respect to the MPI processes, thus cluster Pardiso is run with 1 MPI process per node and threading is disabled. The option iparm(2) = 10 (MPI reordering) is used, as well as iparm(11) = 0 (disable scaling) and iparm(13) = 0 (disable matching) as required by the doc relative to iparm(2) (in the C file, these are thus iparm[1], iparm[10] and iparm[12]).

I used oneAPI 2022.1.2 (although the mpi directory is 2021.5.1), and link with MKL with 32 bits interface, threading disabled, CPARDISO, ScaLAPACK and BLACS enabled. I tried the combinations of gcc + openmpi and icx + intelmpi, and both yield similar results.

For two tests with a system of sizes n = 1e7 and n = 5e7 (see the attached figures), the factorization (phase 22) time scales with the number of MPI procs, but the symbolic factorization (reordering) time and the solve time both increase, so there is clearly something that I’m doing wrong, unless the benchmark itself is not ideal.

I attached a working example which I set up as follows (with icx compiler and Intel MPI) :

source <path/to/oneAPI>/setvars.sh

mkdir build && cd build

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<path/to>/icx -DMKL_DIR=<path/to/mkl>/lib/cmake/mkl -DWITH_IMPI=1 -DMKL_32=1 -DENABLE_BLACS=1 -DENABLE_CPARDISO=1 -DENABLE_SCALAPACK=1 ..

(this may need to be run twice, as MKL's ENABLE_XXX options are the ones from the MKLConfig.cmake)

make

The main C executable (distributedMatrixPardiso) is run by specifying the size of the matrix, for instance with Intel’s mpirun and a hostfile:

mpirun -n 32 -ppn 1 -f hostfile ./bin/distributedMatrixPardiso 10000000

Any advice is welcome !

Thank you for your time,
Arthur

morskaya_svinka_1 · ‎01-21-2025

I do not have experience with cluster MKL PARDISO, but this benchmark looks quite strange to me. I doubt it is supposed to be scalable with general sparse solvers, why have you chosen this one and did you try other tests with sparse matrices of more complicated structure? There are more efficient parallel methods to solve 3-diagonal problems. For testing and benchmarking MKL PARDISO I would recommend to try some matrices from SuiteSparse collection that are large enough (say 500'000 unknowns and 5'000'000 nnz).

Aleksandra_K · ‎01-27-2025

Hi,

Could you specify what the hostfile is in the context of mpirun? In the command provided, there is a reference to a hostfile, but the actual path to the hostfile is not specified. The command only includes the path to the executable program. (For more information on controlling process placement, see Controlling Process Placement with the Intel® MPI Library)

Regards,

Aleksandra

Spencer_P_Intel · ‎01-27-2025

Aleksandra,

In a multi-node case, one of the common ways to specify which nodes you are trying to run a program on with mpi, is to create a hostfile (a text file with list of IP addresses or DNS host names) and then for Intel MPI using "-f <hostfile>" or "-hostfile <hostfile>" as options along with -ppn and -n to specify count of ranks. This assumes you have placed the same application and folder structure for the application (and source code if it exists) you are running through mpi on each of those hosts (often maintained through use of rsync), so the mpi runtime can run things on each machine and connect them all.

Note: when you are working on a shared node cluster and using something like SLURM or LSF to manage your jobs and access, a hostfile is often created for you by that job manager system based on how many nodes have been allocated to you, but sometimes you need to extract it from their built in environment variables as well ...

Best Regards,

Spencer

Spencer_P_Intel · ‎01-27-2025

actually that link you provided is an alternative method that should work as well

arthur_b · ‎01-27-2025

Hello,

Yes, as Spencer_P said, the hostfile is simply a text file with the names of the nodes of the cluster as indicated in the slightly older documentation here : https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-10/running-an-mpi-program.html

Following morskaya_svinka_1's comment, here is more context. I want to run Cluster Pardiso to accelerate the resolution of the linear system in a finite element software, in which the matrices are real and not symmetric in general. I initially ran some tests with large-ish (1M to 25M rows) matrices arising from the discretization of simply a Laplace equation on unstructured meshes, but also obtained poor scaling with an MPI only configuration, so I tried with an even simpler problem (the tridiagonal case) with perfect load balancing between the processors. Solving a tridiagonal-like problem is not my end goal, otherwise I would turn to better suited solvers.

I ran more tests with matrices from SuiteSparse's database as suggested, and I managed to obtain good scaling on up to 32 nodes with an hybrid MPI/OpenMP configuration for a CFD matrix (380k rows, 37M nnz, https://sparse.tamu.edu/Fluorem/RM07R). I am not sure why the tridiagonal matrix is not a good test case however, any explanation is welcome.

What is also surprising however, is that I obtain a good scaling by setting (in C) iparm[1] = 3 (multithreaded reordering), and a much worse scaling with iparm[1] = 10 (MPI version of the reordering algorithm), all other parameters remaining the same, despite running on a cluster:

- First, setting iparm[1] = 10 is worse for the symbolic factorization time, which increases from 1 to 2 nodes with a single OpenMP thread. Similarly, symbolic factorization time is worse on 4 nodes than on 1.

- Moreover, as I understand it, iparm[1] only affects the symbolic factorization (reordering), however in my tests it has an important impact on the actual (non-symbolic) factorization time, which is dominant for the CFD matrix (see attached figures with iparm[1] = 3 or 10, matching and scaling disabled (iparm[10] = iparm[12] = 0).

I will run more tests with larger matrices, but I'm quite surprised by those results.

Thank you to everyone for your time,

Arthur

morskaya_svinka_1 · ‎01-28-2025

"I am not sure why the tridiagonal matrix is not a good test case however, any explanation is welcome."
When you solve a sparse linear system, the bottleneck is performing BLAS3 operations in supernode updates. There are 2 levels of parallelism: on the level of elimination tree (distribute BLAS3 operations between the processes) and on BLAS3 level itself. But when you create too many processes, it might be that elimination tree has little amount of independent branches, and BLAS3 operations can be of quite small size themselves for that number of processes to achieve speed-up.

"as I understand it, iparm[1] only affects the symbolic factorization (reordering), however in my tests it has an important impact on the actual (non-symbolic) factorization time"
The results of the analysis phase determine the factorization performance, and METIS reordering produce different results when you use iparm[1]=2,3 or 10.

Aleksandra_K · ‎02-13-2025

In case of only one node, when you set iparm[1] = 10, it would run the OpenMP version (iparm[1]=3). For 2 and 4 nodes for iparm[1] = 10 you indeed run the MPI version, which is not as efficient as the OpenMP version in this case.

As for the second question, that it why iparm[1] impacts not only the symbolic factorization, but also non-symbolic factorization time. Both implementations give mathematically the same results. However, they may use different data structures and ways of partitioning the jobs, which can later affect the performance of the non-symbolic factorization.

Aleksandra_K · ‎02-19-2025

Hi Arthur,

Do you have any further questions on this topic?

Aleksandra_K · ‎02-21-2025

Hello,

I hope you found the last comments useful. Since we have not received a response from you, we will no longer monitor this ticket internally.

arthur_b · ‎02-24-2025

Hello,

Thank you both for your replies and sorry for the delay. The impact of parallelization on the actual factorization indeed makes sense if different data structures are used. Why the difference between OpenMP and MPI reordering is so huge for the symbolic factorization on this test case is a bit mysterious though, but anyway.

Thanks again for your input.

Regards,

Arthur