topic chain matrix vector multiplications in Intel® oneAPI Math Kernel Library

chain matrix vector multiplications

utab — Wed, 14 Mar 2012 18:36:18 GMT

Dear all,

I extensively do some kind of chained block matrix sparse matrix operations a lot, this is a projection in mathematical sense.

Say R is a dense matrix m by n and A is a sparse or dense matrix m by m I was wondering the most efficient way to compute multiplications like

R^T A R

Now I create a temporary matrix for the multiplication result AR and then multiplying the result by R^T from left.

Any ideas to make it more efficient without a temporary?

Best regards,
Umut

chain matrix vector multiplications

mecej4 — Thu, 15 Mar 2012 09:42:48 GMT

You can arrange the computation in several mathematically equivalent ways. The alternatives and their implications for computational efficiency are covered in, for example, Matrix Computations by Golub and van Loan, ISBN-13: 978-0801854149.

For example, you can produce one column of the product A R at a time, and multiply that column by R^T to obtain the corresponding column of R^TA R.

chain matrix vector multiplications

utab — Thu, 15 Mar 2012 15:38:37 GMT

Well ok, but Block BLAS operations are in general much faster and better optimized for architectures, doing one vector at a time might not buy you much. However, I am not sure completely, correct me if I am wrong.

chain matrix vector multiplications

styc — Fri, 16 Mar 2012 01:48:05 GMT

If you insist in using GEMM from BLAS3, one possibly strategy is to block for B-by-B blocks of R^T and m-by-B blocks of R. For all such combinations, take the corresponding B-by-m block from A to form part of the product. (If you use the CSR format to store A, this B-by-m block would be contiguous in memory.) The tricky thing is to produce a good choice of B. Probably you want to choose B such that (3 * B * B + m * B) * sizeof(float or double) = L2$ capacity. You may also want to use NTA prefetching to avoid letting A pollute the cache hierarchy. Finally, you can utilize L2$ from multiple cores by threading the algorithm so that each core handles a separate m-by-B block of R. Threading in BLAS is likely to be not worthwhile because B is expected to be quite small.