I extensively do some kind of chained block matrix sparse matrix operations a lot, this is a projection in mathematical sense.
Say R is a dense matrix m by n and A is a sparse or dense matrix m by m I was wondering the most efficient way to compute multiplications like
R^T A R
Now I create a temporary matrix for the multiplication result AR and then multiplying the result by R^T from left.
Any ideas to make it more efficient without a temporary?
For example, you can produce one column of the product A R at a time, and multiply that column by RT to obtain the corresponding column of RTA R.
If you insist in using GEMM from BLAS3, one possibly strategy is to block for B-by-B blocks of RT and m-by-B blocks of R. For all such combinations, take the corresponding B-by-m block from A to form part of the product. (If you use the CSR format to store A, this B-by-m block would be contiguous in memory.) The tricky thing is to produce a good choice of B. Probably you want to choose B such that (3 * B * B + m * B) * sizeof(float or double) = L2$ capacity. You may also want to use NTA prefetching to avoid letting A pollute the cache hierarchy. Finally, you can utilize L2$ from multiple cores by threading the algorithm so that each core handles a separate m-by-B block of R. Threading in BLAS is likely to be not worthwhile because B is expected to be quite small.