Intel MKL performance question

Wich · ‎05-14-2011

Hello!

I need to solve a very big linear system and I was thinking about using the MKL library.

So just to get started I decided to make practice with a very simple problem just to get the feel. This is the

source code:

[fortran]program Main

! Equazione di Laplace su un dominio quadrato di lato L
! L*L equazioni lineari in altrettante incognite
! Matrice simmetrica sparsa

	implicit none
	integer,parameter:: L = 100
	integer,parameter	:: N = L*L
	
	real(8)				:: A(N,N),Mat(N,N)
	real(8)				:: b(N),x(N)
	integer				:: i,j
	integer				:: IPIV(N)
	integer				:: INFO
	real				:: start,end
	
	integer :: k, row, col, ind
	real(8),parameter :: H = 3
	real(8)	:: bndrow0(L), bndrowL1(L), bndcol0(L), bndcolL1(L)
	real(8) :: val,y(N),ALPHA,BETA,maxerr,minerr
	
	! Condizioni al contorno
	do k=1,L
		val = H * (k - 1) / (L - 1)
		bndrow0(k) = val
		bndrowL1(L-k+1) = val
		bndcol0(k) = val
		bndcolL1(L-k+1) =	val
	enddo
	
	! Inizializzo il termine noto
	b(indx(1, 1)) = - bndrow0(1) - bndcol0(1)
	do col=2,L-1
		b(indx(1,col)) = - bndrow0(col)
	enddo
	b(indx(1,L)) = -bndrow0(L) - bndcolL1(1)
	do row=2,L-1
		b(indx(row,1)) = - bndcol0(row)
		do col=2,L-1
			b(indx(row,col)) = 0
		enddo
		b(indx(row,L)) = - bndcolL1(row)
	enddo
	b(indx(L,1)) = - bndrowL1(1) - bndcol0(L)
	do col=2,L-1
		b(indx(L,col)) = - bndrowL1(col)
	enddo
	b(indx(L,L)) = - bndrowL1(L) - bndcolL1(L)
	
	! Inizializzo la matrice
	row = 1
	col = 1
	ind = indx(row, col)
	A(ind, ind)= -4
	A(ind, indx(row+1,col)) = 1
	A(ind, indx(row,col+1)) = 1
	
	do col=2,L-1
		ind = indx(row,col)
		A(ind,ind) = -4
		A(ind,indx(row,col-1)) = 1
		A(ind,indx(row+1,col)) = 1
		A(ind,indx(row,col+1)) = 1
	enddo
	
	col = L
	ind = indx(row,col)
	A(ind,ind) = -4
	A(ind,indx(row,col-1)) = 1
	A(ind,indx(row+1,col)) = 1
	
	do row=2,L-1
		col = 1
		ind = indx(row,col)
		A(ind,ind) = -4
		A(ind,indx(row-1,col)) = 1
		A(ind,indx(row+1,col)) = 1
		A(ind,indx(row,col+1)) = 1
		do col=2,L-1
			ind = indx(row,col)
			A(ind,ind) = -4
			A(ind,indx(row-1,col)) = 1
			A(ind,indx(row,col-1)) = 1
			A(ind,indx(row+1,col)) = 1
			A(ind,indx(row,col+1)) = 1
		enddo
		col = L
		ind = indx(row,col)
		A(ind,ind) = -4
		A(ind,indx(row-1,col)) = 1
		A(ind,indx(row,col-1)) = 1
		A(ind,indx(row+1,col)) = 1
	enddo
	
	row = L
	col = 1
	ind = indx(row,col)
	A(ind,ind) = -4
	A(ind,indx(row,col+1)) = 1
	A(ind,indx(row-1,col)) = 1
	
	do col=2,L-1
		ind = indx(row,col)
		A(ind,ind) = -4
		A(ind,indx(row,col+1)) = 1
		A(ind,indx(row-1,col)) = 1
		A(ind,indx(row,col-1)) = 1
	enddo
	
	col = L
	ind = indx(row,col)
	A(ind,ind) = -4
	A(ind, indx(row-1,col)) = 1
	A(ind,indx(row,col-1)) = 1
	
	Mat = A
	x = b
	

	write (*,*) 'Chiamo DGESV'
	call CPU_TIME(start)
	call DGESV(N,1,Mat,N,IPIV,x,N,INFO)
	call CPU_TIME(end)
	write(*,*) INFO			! Verifica correttezza
	write(*,*) end-start	! Tempo passato in secondi
	
	y=b
	ALPHA=1
	BETA=-1
	call DGEMV('n',N,N,ALPHA,A,N,x,1,BETA,y,1)
	maxerr=maxval(y)
	minerr=minval(y)
	write (*,*) 'Errori:',minerr,maxerr
	
!	do row=1,L
!		do col=1,L
!		 write (*,*) row,col,x(indx(row,col))
!		enddo
!	enddo
	
contains
integer function indx(r, c)
! Fornisce l'indice della riga di A corrispondente all'equazione nel punto
! r, c del dominio, che  anche l'indice della colonna di A corrispondente
! al valore della funzione in quel punto.
integer::r, c
indx = (r - 1) * L + c
end function indx

end program Main[/fortran]

So if I compile using the MKL library(I used the online tool to generate the options) i get this result:

[bash]MacBook-Pro:Prova Luca$ ifort Main.f90 /opt/intel/composerxe-201libmkl_intel_lp64.a /opt/intel/composerxe-2011.3.167/mkl/lib/libmkl_sequential.a /opt/intel/composerxe-2011.3.167/mkl/lib/libmkl_core.a -mmacosx-version-min=10.6
MacBook-Pro:Prova Luca$ ./a.out 
 Chiamo DGESV
           0
   62.53065    
 Errori: -1.110223024625157E-014  1.154631945610163E-014[/bash]

while if I compile using the BLAS/LAPACK routines taken from the netlib site(I put them in the BLAS.f and LAPACK.f files) I get this:

[bash]MacBook-Pro:Prova Luca$ ifort Main.f90 BLAS.f LAPACK.f -mmacosx-version-min=10.6
MacBook-Pro:Prova Luca$ ./a.out 
 Chiamo DGESV
           0
   5.780609    
 Errori: -1.376676550535194E-014  1.554312234475219E-014[/bash]

which seems to me a huge difference in execution speed(62 vs 6 seconds).

So I am doing something wrong here?

If it helps I am using Mac OS X 10.6.7 with the compiler/MKL update 3.

Cheers!

Luca

Gennady_F_Intel · ‎05-15-2011

Luca, all are ok at the first glance. can give these files (I mean blas.f & lapack.f) to reproduce the problem?

Wich · ‎05-15-2011

Hello!

Just noticed that there is a blunder in the command I posted when linking against the MKL library, the correct command that I'm using is:

[bash]ifort Main.f90 /opt/intel/composerxe-2011.3.167/mkl/lib/libmkl_intel_lp64.a /opt/intel/composerxe-2011.3.167/mkl/lib/libmkl_sequential.a /opt/intel/composerxe-2011.3.167/mkl/lib/libmkl_core.a -mmacosx-version-min=10.6[/bash]

otherwise it returns an error.

I'm attaching to this post the two requested files(as I said I've simply taken the necessary routines/functions from the netlib site).

Cheers!

Luca

Tamara_K_Intel · ‎05-17-2011

Luca,

Comparison of MKL and netlib versions of codes is not fair and below Ill try to explain it. Shortly, despite the codes coincide by functionality and interface but they have different features.

You use DGESV (solver for system of linear algebraic equations) for dense matrices although your matrix is sparse.

Number of FP operations (num_op) of DGESV is equal 2/3*N^3 + N^2 ( N dimension of matrix) if number of right hand sidesis equal to 1.

Formula to compute Performance in Gflops:

Gflops = num_op/time(sec)/10^9;

It is easy to compute MKL Performance in Gflops for your matrix ( N= 10000, time - 62.53065 ) - 10.66 Gflops. It is nice Performance.

Netlib spends less time for solving your system because Netlib skipsmultiplication operations with ZEROs and as a result Netlib executes less FP operations. But MKL executes all multiplication operations including ZEROs.

If you add perturbation instead of ZEROs in your matrix you will able to get real Netlib Performance for dense matrices.

For sparse matrices you can use MKL PARDISO. If you need to solve Laplace Equation with boundary conditions you can use also Poisson Solver from MKLwhich is designed exactly to such problems.

P.S.

Code below from your BLAS.f skips multiplication operations with ZEROs:

DO 160 L = 1,K

IF (B(J,L).NE.ZERO) THEN

TEMP = ALPHA*B(J,L)

DO 150 I = 1,M

C(I,J) = C(I,J) + TEMP*A(I,L)

150 CONTINUE

END IF

mecej4 · ‎05-17-2011

It is worth noting that the lines of code that are shown in #3 occur in the heavily used DGEMM subroutine.

Wich · ‎05-30-2011

Hello!
Ah nice catch, indeed if I "perturb" the matrix the performance of the netlib versions drops significantly(while the the performance of the MKL versions remains basically the same):

[bash]MacBook-Pro:Desktop Luca$ ifort Main.f90 /opt/intel/composerxe-2011.4.184/mkl/lib/libmkl_intel_lp64.a /opt/intel/composerxe-2011.4.184/mkl/lib/libmkl_sequential.a /opt/intel/composerxe-2011.4.184/mkl/lib/libmkl_core.a
MacBook-Pro:Desktop Luca$ ./a.out
 Chiamo DGESV
           0
   61.69716   
 Errori: -1.209111109845562E-011  1.323935752695071E-012
MacBook-Pro:Desktop Luca$ rm a.out
MacBook-Pro:Desktop Luca$ ifort Main.f90 BLAS.f LAPACK.f
MacBook-Pro:Desktop Luca$ ./a.out
 Chiamo DGESV
           0
   286.2718   
 Errori: -9.265057471230520E-012  7.393842482716906E-013[/bash]

Cheers!

Luca