I have been experimenting trying to replace my own matrix-vector multiplications in the code I am working on with MKL gemv calls. My code is in Fortran and matrices are in CSR/BSR (one-based indexing but C-style row-major order within blocks). I use mkl_dcsrgemv for CSR and mkl_cspblas_dbsrgemv after subtracting 1 from ia and ja for BSR. There are about 6 non-zero elements per matrix row - generated from a Cartesian grid with a non-zero if the elements share a boundary.
The results of my experiments are somewhat surprising. I ran on two machines - with E5-2680v2 and E5-2667v3 and compiled with –O3 –xCORE-AVX-I to rule out any AVX2 effects on E5-2667v3. Using MKL_NUM_THREADS 1, square 12750x12750 matrix with 71950 non-zero elements I got the following:
Can you please take a look at the attached test code and help me understand my results? Why such a difference between the CPUs? More importantly, why MKL seems to be slower in most cases than a simple multiplication? Please share any suggestions regarding speeding up that code.
I am using ifort 16.0.0 and MKL 11.3.
Thank you very much for your help.
! own mat-vec multiplication module mlt contains subroutine matvec( n, blksize, x, y, a, ja, ia ) real*8 :: x(:), y(:), a(:) integer :: n, blksize, ja(:), ia(:) real*8 :: t, tn(blksize) integer :: i, j, k, iv, jv, ks, iblk, jblk if ( blksize==1 ) then ! CSR do i = 1, n t = 0.0d0 do k = ia(i), ia(i+1)-1 t = t + a(k)*x(ja(k)) enddo y(i) = t enddo else ! BSR do i = 1, n iv = (i-1)*blksize tn(:) = 0.0d0 do k = ia(i), ia(i+1)-1 j = ja(k) ks = (k-1)*blksize*blksize jv = (j-1)*blksize do iblk = 1, blksize do jblk = 1, blksize tn(iblk) = tn(iblk) + a(ks+(iblk-1)*blksize+jblk)*x(jv+jblk) enddo enddo enddo do iblk = 1, blksize y(iv+iblk) = tn(iblk) enddo enddo endif return end subroutine matvec