I'm having trouble with matrix multiplication on Sparse Blas. I am trying to multiply 2 huge matrices and compare multithreaded and single threaded performance on an quad core AMD Phenom II 940 with 4GB of DDR3 RAM.
I am using mkl_scsrmm. On benchmark, I repeat the call to mkl_scsrmm a hundred times and I compute the total time in seconds. The matrices have 700,000 (dense) and 20,000 (sparse) elements.
The problem is that the multithreaded performance is only 5% better than single threaded sequential performance.
What is happening?