Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
- Why some parallel functions scale well with the number of cores, but the others are not.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Hi, everyone!

woshiwuxin

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-30-2011
11:14 PM

6 Views

Why some parallel functions scale well with the number of cores, but the others are not.

I noticed that some parallel functions in MKL scale well with the number of cores. For example, ?gemm can be 2x if I setup two threads on two cores, and 6x if six threads on six cores. But some others, e.g. ?gemv, ?syev etc, scale less good with the number of cores. Why it is that?

Thanks in advance!

2 Replies

Highlighted
##

Hi, thank you for good question!

You're right, the scalability may vary depending on the algorithm the function implements.

For example, let's compare ?gemm vs. ?gemv for square NxN matrices and N vector:

gemmshould load3*N*N elements (matrices A, B and C) and perform ~2*N*N*N add/mul operations.

On the other hand, gemv should load N*N+2*N (matrix A and vectors X, Y), but perform ~2*N*N operations.

So, ?gemv is memory-limited operation (having not alldata in cache) and its perfromance depends more on the throughput of memory subsystem than on the number of cores.

Regards,

Konstantin

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-30-2011
11:31 PM

6 Views

You're right, the scalability may vary depending on the algorithm the function implements.

For example, let's compare ?gemm vs. ?gemv for square NxN matrices and N vector:

gemmshould load3*N*N elements (matrices A, B and C) and perform ~2*N*N*N add/mul operations.

On the other hand, gemv should load N*N+2*N (matrix A and vectors X, Y), but perform ~2*N*N operations.

So, ?gemv is memory-limited operation (having not alldata in cache) and its perfromance depends more on the throughput of memory subsystem than on the number of cores.

Regards,

Konstantin

Highlighted
##

Thank you very much!

woshiwuxin

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-01-2011
12:30 AM

6 Views

To summarize, the compute-to-memory ratio in ?gemm is 2*N/3. If N is infinite, the ratio is 2 in ?gemv. Thus, ?gemm is compute bound, and ?gemv is bandwidth bound.

For more complete information about compiler optimizations, see our Optimization Notice.