I am using Pardiso parallel solver from Intel MKL (cluster_sparse_solver) and I have some problems with achieving good efficiency. The computations are run on Intel Xeon E5-2650 v2 8C 2.6GHz CPUs (co-processors are not used). The sparse system has 888 195 DOF, I do not go to systems with bigger size due to specifications of the problem (the solution of the sparse system is small part of the whole algorithm which involves sparse and dense matrices). Because I have several solutions of sparse systems with cluster_sparse_solver, at the end of the algorithm I get bad scalability, because of the MKL solver.
I would like to ask if someone knows if the whole cluster_sparse_solver is implemented for parallel runs? From the output of the cluster_sparse_solver I see the the CPU time of the first part of the algorithm does not change with increase of the processors:
Strong scalability of cluster_sparse_solver, mesh with 296 065 nodes, system with 888195 DOF
number of processes 1 2 4 8 16 32 64
11 Analysis 5.9624 5.8961 5.8318 5.8405 5.8652 5.9099 6.0160
22 Numerical factorization 14.5922 7.7329 4.6724 2.9844 1.9513 1.5424 1.2885
33 Solve, iterative refinement 1.3819 0.8207 0.5122 0.3700 0.3089 0.2710 0.5799
I have contacted developers from Intel, but unfortunately I did not get an answer from them. I would be grateful if someone share his experience with scalability of cluster_sparse_solver.
Stanislav, the fully distributed reordering in CPardiso has been added since MKL 2017 update 1 ( released Nov1st'16). You may take the eval version and check how it will work on your side with those workloads.
Thank you for your reply. My colleagues installed the new version of MKL (MKL 2017 update 1) and I checked again the scalability of cluster_sparse_solver. Unfortunately the results are similar to the ones with MKL 2016. I generated sparse matrix with higher dimension, currently I have 3.5 millions DOF. Here are the results of strong scalability of cluster_sparse_solver (1 OpenMP and several MPI processes).
number of processes 1 2 4 8 16
11 Analysis 26.4461 26.2817 25.8782 26.5270 26.0985
22 Numerical factorization 123.822 65.1132 38.7532 23.6410 16.1666
33 Solve, iterative refinement 4.7206 3.7750 2.5918 1.9842 1.9570
Again it seems that the first phase of the algorithm is not implemented for parallel computations.
I would like ask you if you can provide me with example where the first phase has scalability. Then I can modify my code to obtain also scalability in my algorithm.