- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

My machines have 40 physical cores and on an infiniband network. I get excellent speedup when going from one to two machines ( ~30% ). But going from two to three machines there is no speedup. In fact the factorization takes a tiny bit longer.. But then if I use four machines I get a decent speedup compared to 2 machines ( ~20% ).

See my factorization times below:

Larger matrix ( 4.7 million equations )

1 machine: 31 s

2 machines: 18 s

3 machines: 19 s

4 machines: 12 s

Small matrix ( 1.2 million equations )

1 machine : 3.5 s

2 machines: : 2.6 s

3 machines: 2.9 s

4 machines: 2.4 s

I see there have been some other threads about this below. Is this expected behavior or am I doing something wrong ?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello!

To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).

So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.

As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.

With the information you provided and without diving deep into the actual case, it's hard to say more.

Best,

Kirill

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello!

To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).

So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.

As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.

With the information you provided and without diving deep into the actual case, it's hard to say more.

Best,

Kirill

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page