dgemm: slow performance

vahid_s_ · ‎12-04-2013

I am using dgemm function from Math Kernel Library in my FORTRAN source code to do the following matrix calculation:

X = A - transpose(B) *C

A is 400*400 dense matrix, B is 10000*400 sparse matrix and C is 10000*400 matrix.

CALL dgemm('T', 'N', 400, 400, 10000, -1.d0, . B, 10000, C, 10000, 1.d0, A, 400)

This operation takes about 3.5 seconds! which is a lot in my program! 1.Is 3.5 seconds a reasonable amount of time for this operation? 2. Is there any way to speed up the process?

I am using a dell computer which has Intel Core 2 Duo CPU and my MKL version is 10.0.1 which is relatively old. My last question is if I switch to the most recent MKL can I see significant improvement in the performance for matrix multiplication?

Zhang_Z_Intel · ‎12-04-2013

A few suggestions:

MKL 10.0.1 is 5 years old. Please upgrade to the latest version (MKL 11.1.1) and you should see significant performance improvement across the board.
If 'B' is sparse then why you use DGEMM to do transpose(B)*C? Do you actually store 'B' with the full-storage format? Can you try a sparse matrix storage format (e.g. CSR, BSR, COO, etc.) for B and call one of the sparse-matrix matrix multiply functions? Find more details about these functions here: http://software.intel.com/en-us/node/468534
Are you using sequential MKL or parallel MKL? How many threads do you use? Is the memory space for input data (A, B and C) aligned? Follow this link for performance tips: http://software.intel.com/en-us/node/438624

vahid_s_ · ‎12-05-2013

Hi Zhang Z,

Thanks for your reply and good suggestions.

I tried both sequential and parallel forms. But I could not see significant improvement in performance.

I also change number of threads from default which is maximum number of threads possible to 1 thread. But nothing significant happened!

I store B as full-storage format! I am definitely going to try the sparse multiply function.

I found out that the speed is a direct function of number of columns of C matrix (I changed number of rows too but it was not really major factor). With number of columns equal to 11000 I get the following results:

Number of rows = 363, Time=3.34 seconds

Number of rows = 120 Time=0.3 seconds

Number of rows = 30 Time=0.016 seconds

Would you please explain me why timing is so different based on number of rows? Does it make sense for you? Is the new version of MKL improved in this part?

Zhang_Z_Intel · ‎12-06-2013

First, please try the latest MKL 11.1.1. Second, if you still see slow performance then post your test code here to facilitate further investigation and discussion.