Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- Batched dgemm performance plateaus?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Palmer__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-20-2019
01:24 PM

117 Views

Batched dgemm performance plateaus?

I have a problem where I need to compute many (1e4 - 1e6) small matrix-matrix and matrix-vector products (matrix dimensions around ~15 - 35). This problem seems "embarrassingly parallel" to me, and so I am confused as to why I am seeing the following performance issue: on a Google Cloud compute server with 48 physical cores (96 logical cores), performance plateaus at 10-16 threads. Adding additional threads does not reduce computation time. I have tried several different approaches: (1) cblas_dgemm_batch; (2) calling cblas_dgemm within a tbb::parallel_for, with both sequential and TBB-threaded MKL; (3) JIT-compiled problem-specific dgemm kernel (created with mkl_jit_create_dgemm) within a parallel_for; (4) mkl_dgemm_compact (along with mkl_dgepack and mkl_dgeunpack).

All of these yield roughly comparable performance (except for the compact functions--there, packing and unpacking time completely dominates computation time), but none of them seems to yield performance that scales linearly with the number of threads I specify, as I would expect. The maximum performance I see is around 50 GFLOPS on a system capable of around 1-2 TFLOPS. (Indeed, multiplying two large matrices achieves performance in the teraflop range.) Is this the best I can expect? Why do I not see performance scaling linearly with thread count on this embarrassingly parallel problem?

Link Copied

3 Replies

Hinds__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-23-2019
11:29 PM

117 Views

Palmer__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-24-2019
10:13 PM

117 Views

Hinds__David

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-26-2019
10:43 AM

117 Views

The matrices will fit in cache but it is expensive to get them there.

Is there any re-use of the small matrices and vectors, or re-use of the results, that you can take advantage of? After one of these is used , it will be in cache and subsequent uses on that core will be much faster. You might need to reorganize your algorithm to exploit this -- i.e. when you compute a result, use it immediately, don't compute a bunch of results and then move to the next step where you consume them.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.