topic No speedup of cluster_sparse_solver beyond 32 cpus in Intel® oneAPI Math Kernel Library

No speedup of cluster_sparse_solver beyond 32 cpus

Ferris_H_ — Fri, 11 Nov 2016 03:58:04 GMT

My cluster has 16 cpus/node. My matrix is symmetric positive definite and size is ~2 million by 2 million with ~4 million non-zero entries. My factorization times are:

16 cpus - 84 seconds

32 cpus - 44 seconds

48 cpus - 48 seconds ?!

The factorization takes longer with 48 cpus compared to 32 cpus.

I have tried with smaller matrix and get the same results. There is no speedup beyond 32 cpus. Is this a known limitation of cluster_sparse_solver or a problem with my cluster? If a cluster problem, any suggestions on how can I narrow down the bottleneck?

Ferris, could you check the

Gennady_F_Intel — Fri, 11 Nov 2016 09:17:32 GMT

Ferris, could you check the scalability with larger problem size?

Unfortunately, I do not have

Ferris_H_ — Fri, 11 Nov 2016 15:31:04 GMT

Unfortunately, I do not have any larger matrixes to test . The size I am testing is around the largest I would see in my area. Are there any public benchmark matrixes I could download to test? If not , I can create an example code that reads in my matrix for you to test on your cluster.

I created an example file

Ferris_H_ — Mon, 21 Nov 2016 16:21:42 GMT

I created an example file that can reproduce the issue. Download cl_solver_sym_sp_0_based_c.c from here:

https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0

Edit all the occurences of *.txt to the path where the files are on your system.

ia, ja, a and b data in text files are all here:

https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0

Curious what kind of performance improvement you get when running with MPI on 16, 32, 48, and 72 cpus!

Ferris, do you have access to

Gennady_F_Intel — Tue, 22 Nov 2016 11:05:01 GMT

Ferris, do you have access to the 64 cores system? i am currently not, if you have, could you please try and give us the results? The scalability may be different if the number of nodes will be power of 2.

Quote:Gennady F. (Intel)

Ferris_H_ — Tue, 29 Nov 2016 03:34:22 GMT

Gennady F. (Intel) wrote:

Ferris, do you have access to the 64 cores system? i am currently not, if you have, could you please try and give us the results? The scalability may be different if the number of nodes will be power of 2.

Hi Gennady,

As requested, I solved my model on a larger 4-node 60 core cluster with 15 cores/node each. Below are the factorization times:

15 cores - 70 seconds

30 cores - 41 seconds

45 cores - 42 seconds

60 cores - 36 seconds

So seems there is some improvement when the number of nodes is 4. But with 3 nodes it shows same solve times as 2 nodes. Does the number of nodes always have to be a power of 2? Or could there be some problem with my cluster?