- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My machines have 40 physical cores and on an infiniband network. I get excellent speedup when going from one to two machines ( ~30% ). But going from two to three machines there is no speedup. In fact the factorization takes a tiny bit longer.. But then if I use four machines I get a decent speedup compared to 2 machines ( ~20% ).
See my factorization times below:
Larger matrix ( 4.7 million equations )
1 machine: 31 s
2 machines: 18 s
3 machines: 19 s
4 machines: 12 s
Small matrix ( 1.2 million equations )
1 machine : 3.5 s
2 machines: : 2.6 s
3 machines: 2.9 s
4 machines: 2.4 s
I see there have been some other threads about this below. Is this expected behavior or am I doing something wrong ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).
So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.
As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.
With the information you provided and without diving deep into the actual case, it's hard to say more.
Best,
Kirill
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
To be honest, I don't think your finding strikes me as unexpected. Parallelism inside the direct sparse solvers is very complex and it is known to use a sort of elimination tree, which is somehow distributed among parallel workers (threads, processes). As a typical binary tree structure, it's easier to balance it once one has a power of two for the number of workers (with the outermost level having MPI processes as workers).
So I suspect that certain imbalance is hindering the performance in case of 3 MPIs. Partially, it is supported by the fact that going from 2 to 4 nodes didn't scale perfectly which might hint that the parallelism is getting limited and hence any imbalance will become more visible as a side-effect.
As for how to evaluate quantitively the imbalance, there are no special features in the cluster sparse solver. Some sort of general purpose profiler would show it I believe.
With the information you provided and without diving deep into the actual case, it's hard to say more.
Best,
Kirill
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page