Solved: Cluster sparse solver slower when using three machines instead of two

segmentation_fault · ‎10-28-2021

My machines have 40 physical cores and on an infiniband network. I get excellent speedup when going from one to two machines ( ~30% ). But going from two to three machines there is no speedup. In fact the factorization takes a tiny bit longer.. But then if I use four machines I get a decent speedup compared to 2 machines ( ~20% ).

See my factorization times below:

Larger matrix ( 4.7 million equations )

1 machine: 31 s
2 machines: 18 s
3 machines: 19 s
4 machines: 12 s

Small matrix ( 1.2 million equations )

1 machine : 3.5 s
2 machines: : 2.6 s
3 machines: 2.9 s
4 machines: 2.4 s

I see there have been some other threads about this below. Is this expected behavior or am I doing something wrong ?

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/No-speedup-of-cluster-sparse-solver-beyond-32-cpus/m-p/1082933#M22872

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Direct-Sparse-Solver-for-Clusters-poor-scaling/m-p/1147400#M26817

Kirill_V_Intel · ‎10-28-2021

Hello!

To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).

So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.

As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.

With the information you provided and without diving deep into the actual case, it's hard to say more.

Best,
Kirill

View solution in original post

Kirill_V_Intel · ‎10-28-2021