Community
cancel
Showing results for 
Search instead for 
Did you mean: 
utab
Beginner
19 Views

chain matrix vector multiplications

Dear all,

I extensively do some kind of chained block matrix sparse matrix operations a lot, this is a projection in mathematical sense.

Say R is a dense matrix m by n and A is a sparse or dense matrix m by m I was wondering the most efficient way to compute multiplications like

R^T A R

Now I create a temporary matrix for the multiplication result AR and then multiplying the result by R^T from left.

Any ideas to make it more efficient without a temporary?

Best regards,
Umut
0 Kudos
3 Replies
mecej4
Black Belt
19 Views

You can arrange the computation in several mathematically equivalent ways. The alternatives and their implications for computational efficiency are covered in, for example, Matrix Computations by Golub and van Loan, ISBN-13: 978-0801854149.

For example, you can produce one column of the product A R at a time, and multiply that column by RT to obtain the corresponding column of RTA R.
utab
Beginner
19 Views

Well ok, but Block BLAS operations are in general much faster and better optimized for architectures, doing one vector at a time might not buy you much. However, I am not sure completely, correct me if I am wrong.
styc
Beginner
19 Views

If you insist in using GEMM from BLAS3, one possibly strategy is to block for B-by-B blocks of RT and m-by-B blocks of R. For all such combinations, take the corresponding B-by-m block from A to form part of the product. (If you use the CSR format to store A, this B-by-m block would be contiguous in memory.) The tricky thing is to produce a good choice of B. Probably you want to choose B such that (3 * B * B + m * B) * sizeof(float or double) = L2$ capacity. You may also want to use NTA prefetching to avoid letting A pollute the cache hierarchy. Finally, you can utilize L2$ from multiple cores by threading the algorithm so that each core handles a separate m-by-B block of R. Threading in BLAS is likely to be not worthwhile because B is expected to be quite small.