Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.
6680 Discussions

Cluster sparse solver slower when using three machines instead of two

segmentation_fault
New Contributor I
356 Views

My machines have 40 physical cores and on an infiniband network. I get excellent speedup when going from one to two machines ( ~30% ). But going from two to three machines there is no speedup. In fact the factorization takes a tiny bit longer.. But then if I use four machines I get a decent speedup compared to 2 machines ( ~20% ).

 

See my factorization times below:

 

Larger matrix ( 4.7 million equations )

1 machine: 31 s
2 machines: 18 s
3 machines:  19 s
4 machines: 12 s

 

Small matrix ( 1.2 million equations )

1 machine : 3.5 s
2 machines: : 2.6 s
3 machines:  2.9 s
4 machines: 2.4 s

 

I see there have been some other threads about this below. Is this expected behavior or am I doing something wrong ?

 

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/No-speedup-of-cluster-sparse-solver-...

 

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Direct-Sparse-Solver-for-Clusters-po...

 

0 Kudos
1 Solution
Kirill_V_Intel
Employee
335 Views

Hello!

To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).

So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.

As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.

With the information you provided and without diving deep into the actual case, it's hard to say more.

Best,
Kirill

View solution in original post

1 Reply
Kirill_V_Intel
Employee
336 Views

Hello!

To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).

So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.

As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.

With the information you provided and without diving deep into the actual case, it's hard to say more.

Best,
Kirill

Reply