No speedup of cluster_sparse_solver beyond 32 cpus

Ferris_H_ · ‎11-10-2016

My cluster has 16 cpus/node. My matrix is symmetric positive definite and size is ~2 million by 2 million with ~4 million non-zero entries. My factorization times are:

16 cpus - 84 seconds

32 cpus - 44 seconds

48 cpus - 48 seconds ?!

The factorization takes longer with 48 cpus compared to 32 cpus.

I have tried with smaller matrix and get the same results. There is no speedup beyond 32 cpus. Is this a known limitation of cluster_sparse_solver or a problem with my cluster? If a cluster problem, any suggestions on how can I narrow down the bottleneck?

Gennady_F_Intel · ‎11-11-2016

Ferris, could you check the scalability with larger problem size?

Ferris_H_ · ‎11-11-2016

Unfortunately, I do not have any larger matrixes to test . The size I am testing is around the largest I would see in my area. Are there any public benchmark matrixes I could download to test? If not , I can create an example code that reads in my matrix for you to test on your cluster.

Ferris_H_ · ‎11-21-2016

I created an example file that can reproduce the issue. Download cl_solver_sym_sp_0_based_c.c from here:

https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0

Edit all the occurences of *.txt to the path where the files are on your system.

ia, ja, a and b data in text files are all here:

https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0

Curious what kind of performance improvement you get when running with MPI on 16, 32, 48, and 72 cpus!

Gennady_F_Intel · ‎11-22-2016

Ferris, do you have access to the 64 cores system? i am currently not, if you have, could you please try and give us the results? The scalability may be different if the number of nodes will be power of 2.

Ferris_H_ · ‎11-28-2016

Gennady F. (Intel) wrote:

Ferris, do you have access to the 64 cores system? i am currently not, if you have, could you please try and give us the results? The scalability may be different if the number of nodes will be power of 2.

Hi Gennady,

As requested, I solved my model on a larger 4-node 60 core cluster with 15 cores/node each. Below are the factorization times:

15 cores - 70 seconds

30 cores - 41 seconds

45 cores - 42 seconds

60 cores - 36 seconds

So seems there is some improvement when the number of nodes is 4. But with 3 nodes it shows same solve times as 2 nodes. Does the number of nodes always have to be a power of 2? Or could there be some problem with my cluster?