Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
- Optimal usage of MKL for small matrices

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

erling_andersen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-01-2013
12:59 AM

22 Views

Optimal usage of MKL for small matrices

I doing a lot of dgemm calls to compute

C = A*B

The matrices are fairly small i.e. C might have 1000 rows and less than 200 columns. Typically I do

- Store A in row major form.
- Store B in column major form
- Choose C have 56 columns.
- Make sure everything is aligned.

Believing this leads to a good performance. In fact I can control how many columns C has so I could make 64 or 128 for instance. So now my questions are:

- What is optimal blocking i.e. how many columns in C is optimal? Can the blocking be determined algorithmicly
- How should I formulate the matrix multiplication so MKL can work directly with the data, so the overhead of buffer management etc. is avoided?

Your documentation does not seem to answer such questions. Well, I might have overlooked something.

4 Replies

Highlighted
##

Zhang_Z_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-01-2013
03:00 PM

22 Views

GEMM in MKL has already employed data blocking, prefetching, and many other more advanced performance optimization techniques. In most cases, you just call DGEMM with the original matrices and you should get far better performance than rolling your own matrix multiplication. One problem with rolling your own solution is that it's not likely to be portable. For example, when you move your code to another archetecture with different cache sizes, or when you change from single precision to double precision, or change the number of threads, etc., then you will have to block data differently.

If you do observe that calling GEMM on your matrices does not work as fast as you want, then please share with us your measured performance, your expectation, and your system configuration info. We will investigate to see if this is a performance bug.

Highlighted
##

erling_andersen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-04-2013
11:56 AM

22 Views

I just wanted wanted some insights into how I got the best performance out MKL if my matrices are smallish and I can determine size myself.

My conclusion from what you say is that it is hard to say in general. In particular it cannot be determined algroithmically so I can adapt my code on runtime to inner workings of MKL. I guess I have to do some expriments with MKL if I want to know.

Highlighted
##

Zhang_Z_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-04-2013
01:10 PM

22 Views

Please do let us know the results of your experiments. We believe MKL is optimized for a wide range of matrix sizes, across all supported systems. If your results indicate the performance isn't as good as we'd expect for a particular range of sizes, then we will get it fixed.

erling_andersen wrote:

I just wanted wanted some insights into how I got the best performance out MKL if my matrices are smallish and I can determine size myself.

My conclusion from what you say is that it is hard to say in general. In particular it cannot be determined algroithmically so I can adapt my code on runtime to inner workings of MKL. I guess I have to do some expriments with MKL if I want to know.

Highlighted
##

erling_andersen

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-04-2013
11:00 PM

22 Views

Will do.

For more complete information about compiler optimizations, see our Optimization Notice.