Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- No speedup of cluster_sparse_solver beyond 32 cpus

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Ferris_H_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-10-2016
07:58 PM

87 Views

No speedup of cluster_sparse_solver beyond 32 cpus

My cluster has 16 cpus/node. My matrix is symmetric positive definite and size is ~2 million by 2 million with ~4 million non-zero entries. My factorization times are:

16 cpus - 84 seconds

32 cpus - 44 seconds

48 cpus - 48 seconds ?!

The factorization takes longer with 48 cpus compared to 32 cpus.

I have tried with smaller matrix and get the same results. There is no speedup beyond 32 cpus. Is this a known limitation of cluster_sparse_solver or a problem with my cluster? If a cluster problem, any suggestions on how can I narrow down the bottleneck?

Link Copied

5 Replies

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-11-2016
01:17 AM

87 Views

Ferris, could you check the scalability with larger problem size?

Ferris_H_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-11-2016
07:31 AM

87 Views

Ferris_H_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2016
08:21 AM

87 Views

I created an example file that can reproduce the issue. Download cl_solver_sym_sp_0_based_c.c from here:

https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0

Edit all the occurences of *.txt to the path where the files are on your system.

ia, ja, a and b data in text files are all here:

https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0

Curious what kind of performance improvement you get when running with MPI on 16, 32, 48, and 72 cpus!

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-22-2016
03:05 AM

87 Views

Ferris_H_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-28-2016
07:34 PM

87 Views

Gennady F. (Intel) wrote:

Hi Gennady,

As requested, I solved my model on a larger 4-node 60 core cluster with 15 cores/node each. Below are the factorization times:

15 cores - 70 seconds

30 cores - 41 seconds

45 cores - 42 seconds

60 cores - 36 seconds

So seems there is some improvement when the number of nodes is 4. But with 3 nodes it shows same solve times as 2 nodes. Does the number of nodes always have to be a power of 2? Or could there be some problem with my cluster?

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.