My cluster has 16 cpus/node. My matrix is symmetric positive definite and size is ~2 million by 2 million with ~4 million non-zero entries. My factorization times are:
16 cpus - 84 seconds
32 cpus - 44 seconds
48 cpus - 48 seconds ?!
The factorization takes longer with 48 cpus compared to 32 cpus.
I have tried with smaller matrix and get the same results. There is no speedup beyond 32 cpus. Is this a known limitation of cluster_sparse_solver or a problem with my cluster? If a cluster problem, any suggestions on how to narrow down the problem?
I created an example file that can reproduce the issue. Download cl_solver_sym_sp_0_based_c.c from here:
Edit all the occurences of *.txt to the path where the files are on your system.
ia, ja, a and b data in text files are all here:
Curious what kind of performance improvement you get when running with MPI on 16, 32, 48, and 72 cpus!