Question about DGEMM

Xuan_Z_ · ‎12-14-2015

I am using DGEMM from MKL to do multiplication between matrix and vectors.

I found when I test in simple program, just calling DGEMM 200000 times to compute 256*256 matrix times 256*1 vector, it takes only about 7 seconds (nthreads=8).

I my real poisson solver, which need this multiplication 200000 times, still 256*256 matrix times 256*1 vector, it takes 2 min, which is much much slower than in simple test.

Could anyone suggest any reason about this low performance? My poisson solver is openmp code.

Thanks in advance!

Sincerely,

Xuan

mecej4 · ‎12-14-2015

Your poisson solver probably does something more than just calling DGEMM, so it is not clear what part of the 2 minutes was actually used up by DGEMM. If you are concerned about the time used up in DGEMM, you should take steps to meter the time spent in DGEMM separately from the time spent elsewhere.

Ying_H_Intel · ‎12-14-2015

Hi Xuan,

You may try export or set MKL_VERBOSE=1 in two cases as https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112 and let us know the output result.

In addition, could you please let us know some basic informations, like CPU/OS, MKL version, Compiler version, C or fortran, 32bit or 64bit etc?

You mentioned My poisson solver is openmp code, so are you using GNU omp or libiomp5.so(Intel OpenMP)? and the 200000 times to compute 256*256 matrix times 256*1 vector is in the Openmp loop (so loop nested)? it would be helpful if you provide a reproduced sample code

Best Regards,
Ying H.
Intel MKL Support

Xuan_Z_ · ‎12-16-2015

Hi mecej4,

I did performance profile, and found:

Most of the time is spent by '_kmp_launch_thread' and '_kmp_execute_tasks'. Do you know what are these subroutine used for?

Thank you!

mecej4 wrote:

Your poisson solver probably does something more than just calling DGEMM, so it is not clear what part of the 2 minutes was actually used up by DGEMM. If you are concerned about the time used up in DGEMM, you should take steps to meter the time spent in DGEMM separately from the time spent elsewhere.

Xuan_Z_ · ‎12-16-2015

Hi Ying,

Thank you for the comments!

My Compiler is ifort of version 11.1, fortran, 64 bit.

I use libiopm5.so, the 200000 operations of matrix-vector multiplication is not within the Openmp loop.

Here is the simple test code for multiplication:

program matrixmul

implicit none

integer*4 :: i,j,k,n,nthreads

real*8, allocatable, dimension(:,:) :: a
real*8, allocatable, dimension(:) :: v,v2

n=256

nthreads=8

allocate( a(n,n) )
allocate( v(n))
allocate( v2(n))

call OMP_SET_NUM_THREADS(nthreads)

do j=1,n
v(j)=j*2.0
do i=1,n
a(i,j) = j*1.0
end do
end do

do i=1,200000
call DGEMM("N","N",n,1,n,1.d0,a,n,v,n,0.d0,v2,n)
end do

print *,'done'

It is faster than my openmp version of multiplication.

program matrixmul
implicit none

integer*4 :: i,j,k,n,nthreads, chunk,m

real*8, pointer :: a(:,:)
real*8, pointer :: v(:),v2(:)

n=256

nthreads=8
chunk=10

call OMP_SET_NUM_THREADS(nthreads)

allocate( a(n,n) )
allocate( v(n) )
allocate( v2(n) )

do j=1,n
v(j)=j*2.0
do i=1,n
a(i,j) = j*1.0

end do
end do

do m=1,200000

!$OMP PARALLEL SHARED(A,v,v2,CHUNK) PRIVATE(I,K)

!$OMP DO SCHEDULE(STATIC, CHUNK)

do i = 1, n
do k = 1, n
v2(i) = v2(i) + a(i,k) * v(k)
end do
end do

!$OMP END PARALLEL

enddo

print *,'done'

Then I tested in the real solver.

Here is part of my solver in which multiplications are done:
! initialize x

x(0:,1,0:)=0.0

! call mul2(dpinv1(0:,0:,m),cp(0:,1,m),n,x(0:,1,m),nthread)

call DGEMM("N","N",n+1,1,n+1,1.d0,dpinv1(0:,0:,m),n+1,cp(0:,1,m),n+1,0.d0,x(0:,1,m),n+1)

! solve for x from the p
do k = m-1, 0, -1

! call mul2(p1(0:,0:,k),x(0:,1,k+1),n,tmp(0:,1,k),nthread)

call DGEMM("N","N",n+1,1,n+1,1.d0,p1(0:,0:,k),n+1,x(0:,1,k+1),n+1,0.d0,tmp(0:,1,k),n+1)

! call mul2(dpinv1(0:,0:,k),cp(0:,1,k),n,tmp2(0:,1,k),nthread)

call DGEMM("N","N",n+1,1,n+1,1.d0,dpinv1(0:,0:,k),n+1,cp(0:,1,k),n+1,0.d0,tmp2(0:,1,k),n+1)

x(0:,1,k) = tmp2(0:,1,k)-tmp(0:,1,k)

tmp=0.0
tmp2=0.0

end do

psol(0:n,0:m)= x(0:n,1,0:m)

And the other version is using 'call mul2(...)' to do multiplication and I compared the performance between them.

In the real solver, it takes 1m53s in mkl version, while in openmp it takes 51s. Performance profile shows that, most of the time is spent on two subroutines: "_kmp_launch_thread" and "_kml_execute_tests". Do you know why these subroutines come up? Thank you!

Sincerely,

Xuan

Ying H (Intel) wrote:

Hi Xuan,

You may try export or set MKL_VERBOSE=1 in two cases as https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112  and let us know the output result.

In addition, could you please let us know some basic informations, like CPU/OS,  MKL version, Compiler version, C or fortran, 32bit or 64bit etc?

You mentioned My poisson solver is openmp code, so are you using GNU omp or libiomp5.so(Intel OpenMP)? and the  200000 times to compute 256*256 matrix times 256*1 vector is in the Openmp loop (so loop nested)? it would be helpful if you provide a reproduced sample code

Best Regards,
Ying H.
Intel MKL Support

Ying_H_Intel · ‎12-22-2015

Hi Xuan,

The "_kmp_launch_thread" and "_kml_execute_tests" is from Intel OpenMP thread library, which used by the OpenMP directive

and MKL internally.

I can't see the profile image you attached (maybe because of internal network issue), do you know how many OpenMP in the application? you can try the command:

export OMP_AFFINITY=verbose to see if there is over or (nested) threads?

compute 256*256 matrix times 256*1 vector, it should be dgemv operation, right? I recalled we should improve the

multi-thread performance of for such operations in later version,

So I may suggest you to try the latest MKL version, MKL 11.3.1, you can apply and get it from

http://software.intel.com/sites/campaigns/nest/ freely.

Best Regards,
Ying

McCalpinJohn · ‎12-28-2015

It is always a good idea to have some idea of how long a computation *should* take.... Part of that requires that we know what hardware is being used, which still has not been specified.

As noted above, the 256x256 matrix multiplying a 256x1 vector is a DGEMV operation. Although it is a valid DGEMM operation, the DGEMM code is heavily optimized for the data re-use that is available with multiple columns in both arguments, so it is probably not the most efficient way of handling this case -- perhaps by a fairly large ratio.

It is possible to estimate how long a DGEMV of this size should run....

The 256x256 array of "real*8" variables occupies 512 KiB, which means that it is larger than L2, but smaller than L3.
But in the OpenMP version above, each of the 8 threads will only access 1/8 of the array in order to update their portion of the output.
- This means that each thread will only access 512/8=64 KiB of data from the "a" array, which should fit easily into the 256 KiB private L2 cache of most recent Intel processors.
The "v" and "v2" arrays are only 4KiB each, so they should stay in the L1 Data Cache.

On Sandy Bridge and Ivy Bridge cores, I typically measure L2 bandwidth at about 14 Bytes/cycle per core. With 8 threads, each thread will need to read 64 KiB (1/8 of the rows of array "a") each iteration, so 200,000 iterations means that each thread must read 13.1e9 bytes. At 14 Bytes/cycle, this is just under 1 billion cycles, which is less than 1/2 second on a system running at 2 GHz or faster. Haswell cores have about twice the L2 bandwidth, so should take about half as long.

TimP · ‎12-28-2015

As you're using ifort, if you don't care to think about which BLAS function fits your matrix multiplication requirements, you can use the Fortran intrinsic MATMUL with the compile option -opt-matmul (implied by -O3) to have the choice of library entry points made by the compiler.

I guess the corresponding option in gfortran will never call dgemv, so there may be a case for choosing it explicitly.

When I saw this thread, I assumed a choice of dgemm on account of having multiple vectors (stored as a matrix) to process at one time.