Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Xuan_Z_
Beginner
49 Views

Question about DGEMM

I am using DGEMM from MKL to do multiplication between matrix and vectors.

I found when I test in simple program, just calling DGEMM 200000 times to compute 256*256 matrix times 256*1 vector, it takes only about 7 seconds (nthreads=8).

I my real poisson solver, which need this multiplication 200000 times, still 256*256 matrix times 256*1 vector, it takes 2 min, which is much much slower than in simple test. 

Could anyone suggest any reason about this low performance? My poisson solver is openmp code.

Thanks in advance!

Sincerely,

Xuan

 

0 Kudos
7 Replies
mecej4
Black Belt
49 Views

Your poisson solver probably does something more than just calling DGEMM, so it is not clear what part of the 2 minutes was actually used up by DGEMM. If you are concerned about the time used up in DGEMM, you should take steps to meter the time spent in DGEMM separately from the time spent elsewhere. 

Ying_H_Intel
Employee
49 Views

Hi Xuan, 

You may try export or set  MKL_VERBOSE=1 in two cases as https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112  and let us know the output result.  

In addition, could you please let us know some basic informations, like  CPU/OS,  MKL version, Compiler version, C or fortran, 32bit or 64bit etc? 

You mentioned My poisson solver is openmp code, so are you using GNU omp or libiomp5.so(Intel OpenMP)?  and the  200000 times to compute 256*256 matrix times 256*1 vector is in the Openmp loop (so loop nested)?  it would be helpful if you provide  a reproduced sample code   

Best Regards,
Ying H.
Intel MKL Support 

 

Xuan_Z_
Beginner
49 Views

 

 

Hi  mecej4,

I did performance profile, and found:

Most of the time is spent by '_kmp_launch_thread' and '_kmp_execute_tasks'. Do you know what are these subroutine used for?

Thank you!

 

 

mecej4 wrote:

Your poisson solver probably does something more than just calling DGEMM, so it is not clear what part of the 2 minutes was actually used up by DGEMM. If you are concerned about the time used up in DGEMM, you should take steps to meter the time spent in DGEMM separately from the time spent elsewhere. 

 

 

 

Xuan_Z_
Beginner
49 Views

Hi Ying,

 

Thank you for the comments!

My Compiler is ifort of version 11.1, fortran, 64 bit.

I use libiopm5.so, the 200000 operations of matrix-vector multiplication is not within the Openmp loop.  

Here is the simple test code for multiplication:

      program matrixmul

       implicit none

       integer*4 :: i,j,k,n,nthreads

       real*8, allocatable, dimension(:,:) :: a
       real*8, allocatable, dimension(:) :: v,v2

       

      n=256

      nthreads=8


       allocate( a(n,n) )
       allocate( v(n))
       allocate( v2(n))

        call OMP_SET_NUM_THREADS(nthreads)


        do j=1,n
         v(j)=j*2.0
         do i=1,n
            a(i,j) = j*1.0
           end do
       end do 

do i=1,200000
      call DGEMM("N","N",n,1,n,1.d0,a,n,v,n,0.d0,v2,n)
end do

print *,'done'

It is faster than my openmp version of multiplication.

       program matrixmul
       implicit none

       integer*4 :: i,j,k,n,nthreads, chunk,m


         real*8, pointer :: a(:,:)
         real*8, pointer :: v(:),v2(:) 

  n=256

  nthreads=8
          chunk=10

        call OMP_SET_NUM_THREADS(nthreads)


       allocate( a(n,n) )
       allocate( v(n) )
       allocate( v2(n) )


        do j=1,n
                 v(j)=j*2.0
         do i=1,n
            a(i,j) = j*1.0

           end do
       end do 

 

do m=1,200000


!$OMP PARALLEL SHARED(A,v,v2,CHUNK) PRIVATE(I,K) 
 


!$OMP DO SCHEDULE(STATIC, CHUNK)


    do i = 1, n
      do k = 1, n
        v2(i) = v2(i) + a(i,k) * v(k)
      end do
    end do

!$OMP END PARALLEL
      
enddo


      print *,'done'

Then I tested in the real solver.

Here is part of my solver in which multiplications are done: 
! initialize x

         x(0:,1,0:)=0.0

!         call mul2(dpinv1(0:,0:,m),cp(0:,1,m),n,x(0:,1,m),nthread)
          

         call DGEMM("N","N",n+1,1,n+1,1.d0,dpinv1(0:,0:,m),n+1,cp(0:,1,m),n+1,0.d0,x(0:,1,m),n+1)

! solve for x from the p
         do k = m-1, 0, -1

!         call mul2(p1(0:,0:,k),x(0:,1,k+1),n,tmp(0:,1,k),nthread)

         call DGEMM("N","N",n+1,1,n+1,1.d0,p1(0:,0:,k),n+1,x(0:,1,k+1),n+1,0.d0,tmp(0:,1,k),n+1)

!         call mul2(dpinv1(0:,0:,k),cp(0:,1,k),n,tmp2(0:,1,k),nthread)

         call DGEMM("N","N",n+1,1,n+1,1.d0,dpinv1(0:,0:,k),n+1,cp(0:,1,k),n+1,0.d0,tmp2(0:,1,k),n+1)

         x(0:,1,k) = tmp2(0:,1,k)-tmp(0:,1,k)

         tmp=0.0
         tmp2=0.0  
     
         end do

        psol(0:n,0:m)= x(0:n,1,0:m)

And the other version is using 'call mul2(...)' to do multiplication and I compared the performance between them.

In the real solver, it takes 1m53s in mkl version, while in openmp it takes 51s. Performance profile shows that, most of the time is spent on two subroutines: "_kmp_launch_thread" and "_kml_execute_tests". Do you know why these subroutines come up? Thank you!

Sincerely,

Xuan

 

 

 

 

Ying H (Intel) wrote:

Hi Xuan, 

You may try export or set  MKL_VERBOSE=1 in two cases as https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112  and let us know the output result.  

In addition, could you please let us know some basic informations, like  CPU/OS,  MKL version, Compiler version, C or fortran, 32bit or 64bit etc? 

You mentioned My poisson solver is openmp code, so are you using GNU omp or libiomp5.so(Intel OpenMP)?  and the  200000 times to compute 256*256 matrix times 256*1 vector is in the Openmp loop (so loop nested)?  it would be helpful if you provide  a reproduced sample code   

Best Regards,
Ying H.
Intel MKL Support 

 

Ying_H_Intel
Employee
49 Views

Hi Xuan,

The "_kmp_launch_thread" and "_kml_execute_tests" is from Intel OpenMP thread library, which used by the OpenMP directive

and MKL internally.

I can't see the profile image you attached (maybe because of internal network issue), do you know how many OpenMP in the application? you can try the command:

 export OMP_AFFINITY=verbose to see if there is over or (nested) threads?

compute 256*256 matrix times 256*1 vector, it should be dgemv  operation, right?   I  recalled we should improve the

multi-thread performance of for such operations in later version,

So I may suggest you to try  the latest MKL version, MKL 11.3.1, you can apply and get it from

http://software.intel.com/sites/campaigns/nest/ freely.

Best Regards,
Ying

McCalpinJohn
Black Belt
49 Views

It is always a good idea to have some idea of how long a computation *should* take....  Part of that requires that we know what hardware is being used, which still has not been specified.

As noted above, the 256x256 matrix multiplying a 256x1 vector is a DGEMV operation.   Although it is a valid DGEMM operation, the DGEMM code is heavily optimized for the data re-use that is available with multiple columns in both arguments, so it is probably not the most efficient way of handling this case -- perhaps by a fairly large ratio. 

It is possible to estimate how long a DGEMV of this size should run....

  • The 256x256 array of "real*8" variables occupies 512 KiB, which means that it is larger than L2, but smaller than L3.  
  • But in the OpenMP version above, each of the 8 threads will only access 1/8 of the array in order to update their portion of the output. 
    • This means that each thread will only access 512/8=64 KiB of data from the "a" array, which should fit easily into the 256 KiB private L2 cache of most recent Intel processors. 
  • The "v" and "v2" arrays are only 4KiB each, so they should stay in the L1 Data Cache.

On Sandy Bridge and Ivy Bridge cores, I typically measure L2 bandwidth at about 14 Bytes/cycle per core.  With 8 threads, each thread will need to read 64 KiB (1/8 of the rows of array "a") each iteration, so 200,000 iterations means that each thread must read 13.1e9 bytes.  At 14 Bytes/cycle, this is just under 1 billion cycles, which is less than 1/2 second on a system running at 2 GHz or faster.    Haswell cores have about twice the L2 bandwidth, so should take about half as long.

 

TimP
Black Belt
49 Views

As you're using ifort, if you don't care to think about which BLAS function fits your matrix multiplication requirements, you can use the Fortran intrinsic MATMUL with the compile option -opt-matmul (implied by -O3) to have the choice of library entry points made by the compiler.

I guess the corresponding option in gfortran will never call dgemv, so there may be a case for choosing it explicitly.

When I saw this thread, I assumed a choice of dgemm on account of having multiple vectors (stored as a matrix) to process at one time.