I am trying to determine whether the Intel Direct Sparse Solver for Clusters is a good parallel solver for our application. I have implemented the sparse solver in Fortran to solve a linear FEA problem. For the call to the sparse solver, I am seeing speed-up with increasing number of MPI processes, but not seeing good speed-up with increasing number of threads per MPI process.
In this case, the A matrix is generated from a finite-difference-type grid in distributed format (DCSR). Node ordering is such that distributing the matrix results in gaps in the sparse matrix storage of each part, similar to the example given here: https://software.intel.com/en-us/articles/intel-math-kernel-library-parallel-direct-sparse-solver-for-clusters
This case is a linear time-domain problem so we factorize the matrix once and then solve it many times with evolving boundary conditions. Should I expect to see good scaling over MPI processes and OpenMP threads in the solve phase of the direct sparse solver for clusters?
I have benchmarked a 10 million DOF model on a linux cluster with the number of MPI processes ranging from 2 to 128, with 1 process per hardware node, and the number of OpenMP threads per node ranging from 2 to 16. I see speed-up in increasing the number of MPI processes up to about 32 processes, but very little improvement in using more OpenMP threads. I see about the same speed-up on 8 or 16 threads as I see on 2 threads.
I am using the following iparm variables:
iparm(1) = 1 !no default values
iparm(2) = 2
iparm(10) = 8
iparm(18) = -1
iparm(27) = 0
iparm(28) = 1
iparm(40) = 2
iparm(41) = ibegin
iparm(42) = iend
iparm(60) = 1
Is the Direct Sparse Solver for Clusters suitable here and what issues should I look at to try to improve scaling with number of OpenMP threads? Thanks for your help.
thank you a lot for report the performance here. could you please tell how do you compile your code and share you benchmark result here? or post your private result to could you please some details, like what kind of cluster https://supporttickets.intel.com/servicecenter
As I inferred, the thread scalability should mainly depend on the data setting itself, especially for sparse the memory and computing may not scalable on thread number. anyway, please share your result.
Thanks for your response and help with understanding the direct sparse solver for clusters. I am compiling with Intel Fortran 2018.3 and Intel MPI 2018 on RHEL6. I will provide the benchmark result and details of the cluster on a support ticket.
When you say the thread scalability should mainly depend on the data setting itself, are you referring to the structure of the matrix? It is symmetric, positive definite and very sparse with typically less than 28 non-zeros per row, but the initial bandwidth can be large depending on boundary conditions.
I have looked at some of the scaling plots in the papers and documentation on the cluster solver, but I wasn't always certain whether the reported timing was for the full reordering, factorization and solve or just one of the phases. In a 2013 report by Kalinkin, an example "3Dspectralwave" problem shows reduction in compute time with number of nodes for both factorization and solve with the statement "Factorization and solving steps scale well in terms of memory and performance", but doesn't show the scaling in compute time with number of threads per node. I'm trying to determine whether I should expect to see good scaling with increasing OpenMP threads. For the non-cluster Intel Pardiso solver we do see speed-up with increasing OpenMP threads on this problem, but not with the cluster solver. Is it recommended to use particular KMP_AFFINITY settings with the cluster solver?
Thanks for your help.