Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
- chain matrix vector multiplications

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

utab

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-14-2012
11:36 AM

19 Views

chain matrix vector multiplications

I extensively do some kind of chained block matrix sparse matrix operations a lot, this is a projection in mathematical sense.

Say R is a dense matrix m by n and A is a sparse or dense matrix m by m I was wondering the most efficient way to compute multiplications like

R^T A R

Now I create a temporary matrix for the multiplication result AR and then multiplying the result by R^T from left.

Any ideas to make it more efficient without a temporary?

Best regards,

Umut

Link Copied

3 Replies

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-15-2012
02:42 AM

19 Views

For example, you can produce one column of the product A R at a time, and multiply that column by R

utab

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-15-2012
08:38 AM

19 Views

styc

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-15-2012
06:48 PM

19 Views

^{T} and m-by-B blocks of R. For all such combinations, take the corresponding B-by-m block from A to form part of the product. (If you use the CSR format to store A, this B-by-m block would be contiguous in memory.) The tricky thing is to produce a good choice of B. Probably you want to choose B such that (3 * B * B + m * B) * sizeof(float or double) = L2$ capacity. You may also want to use NTA prefetching to avoid letting A pollute the cache hierarchy. Finally, you can utilize L2$ from multiple cores by threading the algorithm so that each core handles a separate m-by-B block of R. Threading in BLAS is likely to be not worthwhile because B is expected to be quite small.

For more complete information about compiler optimizations, see our Optimization Notice.