Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

chain matrix vector multiplications

utab
Beginner
108 Views
Dear all,

I extensively do some kind of chained block matrix sparse matrix operations a lot, this is a projection in mathematical sense.

Say R is a dense matrix m by n and A is a sparse or dense matrix m by m I was wondering the most efficient way to compute multiplications like

R^T A R

Now I create a temporary matrix for the multiplication result AR and then multiplying the result by R^T from left.

Any ideas to make it more efficient without a temporary?

Best regards,
Umut
0 Kudos
3 Replies
mecej4
Black Belt
108 Views
You can arrange the computation in several mathematically equivalent ways. The alternatives and their implications for computational efficiency are covered in, for example, Matrix Computations by Golub and van Loan, ISBN-13: 978-0801854149.

For example, you can produce one column of the product A R at a time, and multiply that column by RT to obtain the corresponding column of RTA R.
utab
Beginner
108 Views
Well ok, but Block BLAS operations are in general much faster and better optimized for architectures, doing one vector at a time might not buy you much. However, I am not sure completely, correct me if I am wrong.
styc
Beginner
108 Views

If you insist in using GEMM from BLAS3, one possibly strategy is to block for B-by-B blocks of RT and m-by-B blocks of R. For all such combinations, take the corresponding B-by-m block from A to form part of the product. (If you use the CSR format to store A, this B-by-m block would be contiguous in memory.) The tricky thing is to produce a good choice of B. Probably you want to choose B such that (3 * B * B + m * B) * sizeof(float or double) = L2$ capacity. You may also want to use NTA prefetching to avoid letting A pollute the cache hierarchy. Finally, you can utilize L2$ from multiple cores by threading the algorithm so that each core handles a separate m-by-B block of R. Threading in BLAS is likely to be not worthwhile because B is expected to be quite small.

Reply