I have a problem where I need to compute many (1e4 - 1e6) small matrix-matrix and matrix-vector products (matrix dimensions around ~15 - 35). This problem seems "embarrassingly parallel" to me, and so I am confused as to why I am seeing the following performance issue: on a Google Cloud compute server with 48 physical cores (96 logical cores), performance plateaus at 10-16 threads. Adding additional threads does not reduce computation time. I have tried several different approaches: (1) cblas_dgemm_batch; (2) calling cblas_dgemm within a tbb::parallel_for, with both sequential and TBB-threaded MKL; (3) JIT-compiled problem-specific dgemm kernel (created with mkl_jit_create_dgemm) within a parallel_for; (4) mkl_dgemm_compact (along with mkl_dgepack and mkl_dgeunpack).
All of these yield roughly comparable performance (except for the compact functions--there, packing and unpacking time completely dominates computation time), but none of them seems to yield performance that scales linearly with the number of threads I specify, as I would expect. The maximum performance I see is around 50 GFLOPS on a system capable of around 1-2 TFLOPS. (Indeed, multiplying two large matrices achieves performance in the teraflop range.) Is this the best I can expect? Why do I not see performance scaling linearly with thread count on this embarrassingly parallel problem?
You are likely limited by memory bandwidth? This is common for small matrix operations, where the ratio of FLOPS to memory transfers is much lower than for large matrix multiplication.
Interesting. Shouldn't the small matrices fit in the per-core L1 caches (though I suppose that is more related to latency)? Why would the memory bandwidth be insufficient to supply all the cores? Are there any tricks I can use (e.g., alignment, prefetching, thread affinity) to improve the achieved bandwidth?
The matrices will fit in cache but it is expensive to get them there.
Is there any re-use of the small matrices and vectors, or re-use of the results, that you can take advantage of? After one of these is used , it will be in cache and subsequent uses on that core will be much faster. You might need to reorganize your algorithm to exploit this -- i.e. when you compute a result, use it immediately, don't compute a bunch of results and then move to the next step where you consume them.