BSR gemv slower than simple Fortran matrix-vector multiplication

Marcin_R_ · ‎04-14-2016

Hello,

I have been experimenting trying to replace my own matrix-vector multiplications in the code I am working on with MKL gemv calls. My code is in Fortran and matrices are in CSR/BSR (one-based indexing but C-style row-major order within blocks). I use mkl_dcsrgemv for CSR and mkl_cspblas_dbsrgemv after subtracting 1 from ia and ja for BSR. There are about 6 non-zero elements per matrix row - generated from a Cartesian grid with a non-zero if the elements share a boundary.

The results of my experiments are somewhat surprising. I ran on two machines - with E5-2680v2 and E5-2667v3 and compiled with –O3 –xCORE-AVX-I to rule out any AVX2 effects on E5-2667v3. Using MKL_NUM_THREADS 1, square 12750x12750 matrix with 71950 non-zero elements I got the following:

with block size 1 (that is, CSR), using mkl_dcsrgemv:
- on E5-2680v2, MKL consistently outperforms my own code by 9-10%
- on E5-2667v3 MKL is slower than the same code by 4-5%

with block size 3:
- on E5-2680v2 MKL's mkl_cspblas_dbsrgemv is 4-5% slower than simple code
- on E5-2667v3 MKL's mkl_cspblas_dbsrgemv is over 20% slower than simple code
- I also tried making all the values inside the blocks equal (so that row-major and column-major are the same), so that I could use one-based indexing and mkl_dbsrgemv. MKL was still slower than own code, but on E5-2667v3 it was significantly faster than mkl_cspblas_dbsrgemv.

Can you please take a look at the attached test code and help me understand my results? Why such a difference between the CPUs? More importantly, why MKL seems to be slower in most cases than a simple multiplication? Please share any suggestions regarding speeding up that code.

I am using ifort 16.0.0 and MKL 11.3.

Thank you very much for your help.

      ! own mat-vec multiplication
      module mlt
      contains
      subroutine matvec( n, blksize, x, y, a, ja, ia )
      
      real*8  :: x(:), y(:), a(:)
      integer :: n, blksize, ja(:), ia(:)
      real*8  :: t, tn(blksize)
      integer :: i, j, k, iv, jv, ks, iblk, jblk

      if ( blksize==1 ) then
         ! CSR
         do i = 1, n
            t = 0.0d0
            do k = ia(i), ia(i+1)-1
               t = t + a(k)*x(ja(k))
            enddo
            y(i) = t
         enddo

      else
         ! BSR
         do i = 1, n
            iv = (i-1)*blksize
            tn(:) = 0.0d0
            do k = ia(i), ia(i+1)-1
               j = ja(k)
               ks = (k-1)*blksize*blksize
               jv = (j-1)*blksize
               do iblk = 1, blksize
                  do jblk = 1, blksize
                     tn(iblk) = tn(iblk) + a(ks+(iblk-1)*blksize+jblk)*x(jv+jblk)
                  enddo
               enddo
            enddo
            do iblk = 1, blksize
               y(iv+iblk) = tn(iblk)
            enddo
         enddo

      endif

      return
      end subroutine matvec

Regards,

Marcin

501613