Solved: slow openMP matrix multiplication

Don_Ritchie · ‎09-21-2011

Visual Fortran 2011 and openMP are pretty new to me; I've been using C++ and C# for parallel programming on my system: Dell Studio XPS w/Intel i7 860 quad core running Windows 7-64 bit.

For a 1024x1024 (or larger) matrix multiplication test, I'm finding that a Fortran openMP routine runs considerably slower than the same sequential routine. The openMP routine appears to be properly using 8 threads when I look at Windows Task Manager during execution, and the results of the multiplication look OK.

Could someone please take a look at the following code and tell me if I'm using openMP correctly?

!****************************************************************************

!

! PROGRAM: openMP_speed_test

!

! PURPOSE: Test use of openMP

!

!****************************************************************************

subroutine mat_mult_trans(A,B,C,N)

!Use transpose of A in multiplication to make better use of cache

integer N

real*8 A(N,N),B(N,N),C(N,N),temp

!transpose A

do i=1,N-1

do j=i+1,N

temp=A(i,j)

A(i,j)=A(j,i)

A(j,i)=temp

end do

!do the multiplication

do i=1,N

do j=1,N

temp=0.0D0

do k=1,N

temp=temp+A(k,i)*B(k,j)

end do

C(i,j)=temp

end do

!restore A

do i=1,N-1

do j=i+1,N

temp=A(i,j)

A(i,j)=A(j,i)

A(j,i)=temp

end do

end subroutine

subroutine par_mat_mult_trans(A,B,C,N)

!same as mat_mult_trans, but using openMP

integer N,nthreads,TID,chunk

Integer OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

real*8 A(N,N),B(N,N),C(N,N),temp

chunk=32

!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK)

TID = OMP_GET_THREAD_NUM()

IF (TID .EQ. 0) THEN

NTHREADS = OMP_GET_NUM_THREADS()

PRINT *, 'Starting matrix multiply with ', NTHREADS,' threads'

END IF

!$OMP DO SCHEDULE(DYNAMIC, CHUNK)

do i=1,N-1

do j=i+1,N

temp=A(i,j)

A(i,j)=A(j,i)

A(j,i)=temp

end do

!$OMP DO SCHEDULE(DYNAMIC, CHUNK)

do i=1,N

do j=1,N

temp=0.0D0

do k=1,N

temp=temp+A(k,i)*B(k,j)

end do

C(i,j)=temp

end do

!$OMP DO SCHEDULE(DYNAMIC, CHUNK)

do i=1,N-1

do j=i+1,N

temp=A(i,j)

A(i,j)=A(j,i)

A(j,i)=temp

end do

!$omp end parallel

end subroutine

program openMP_speed_test

Integer N, max_threads

real(kind=8), allocatable :: A(:,:),B(:,:),C(:,:),D(:)

real(kind=8), allocatable :: c2(:,:)

real*8 t1,t2,num,diff

character(len=1) ans

ans='y'

do while((ans.eq.'y').or.(ans.eq.'Y'))

print *,"Matrix size?"

READ *,N

allocate(A(N,N),B(N,N),C(N,N),c2(N,N),D(N))

do i=1,N

do j=1,N

call random_number(num)

if (num<0.5D0) then

num=-1.0D0

else

num=1.0D0

end if

call random_number(A(i,j))

A(i,j)=num*A(i,j)

call random_number(num)

if (num<0.5D0) then

num=+1.0D0

else

num=-1.0D0

end if

call random_number(B(i,j))

B(i,j)=num*B(i,j)

C(i,j)=0.0D0

c2(i,j)=0.0D0

end do

!run standard matrix multiplication using transpose(A) for

!better cache use

PRINT *,"Starting mat_mult_trans."

call cpu_time(t1)

call Mat_mult_trans(A,B,C,N)

call cpu_time(t2)

t2=t2-t1

print 120, t2

120 format("mat_mult_trans execution time=",f6.3," secs")

!Run Fortran matmul function

PRINT *,"Starting Fortran matmul."

call cpu_time(t1)

c2= matmul(A,B)

call cpu_time(t2)

t2=t2-t1

print 140, t2

140 format("canned matmul execution time=",f6.3," secs")

!Check results vs. 'standard'

diff=0.0D0

do i=1,N

do j=1,N

if (abs(C(i,j)-c2(i,j))>diff) then

diff=abs(C(i,j)-c2(i,j))

end if

end do

print 121, diff

121 format("Maximum error=",e14.6)

!Run openMP version of 'standard' with transpose(A)

PRINT *,"Starting par_mat_mult_trans"

call cpu_time(t1)

call par_mat_mult_trans(A,B,c2,N)

call cpu_time(t2)

t2=t2-t1

print 131, t2

131 format("par_mat_mult_trans execution time=",f6.3," secs")

!Check results vs 'standard'

diff=0.0D0

do i=1,N

do j=1,N

if (abs(C(i,j)-c2(i,j))>diff) then

diff=abs(C(i,j)-c2(i,j))

end if

end do

print 121, diff

!Check for more work

print *,"More?"

read *,ans

deallocate(A,B,C,c2,D)

end do

end program

!********************************************************************************

In addition to the speed question, times appear very variable. For three successive runs on 1024x1024 matrices, I get the following times (in secs.)

mat_mult_trans: 0.546 0.671 0.686

Fortran matmul: 0.234 0.234 0.234

par_mat_mult_trans: 0.983 1.310 1.420

TIA for any help,

Don Ritchie

TimP · ‎09-21-2011

OpenMP Fortran makes the loop indices private by default. However, you must declare temp as private, unless you are simply hoping to get lucky with register optimizations.
Dynamic scheduling may be OK on a single CPU with shared last level cache, but there's no reason to try it here.
The OpenMP clause stops the compiler from optimizing loop nesting, so you must do that yourself, arranging that the outer segments are distinct and don't cause multiple threads to access the same cache lines. As it is, you should time each parallel region separately; it looks like even after you supply the private declarations and deal with the remaining race conditions you will see your transposes not working nearly as well as a plain Fortran single threaded transpose intrinsic.
It would be interesting if the compiler could duplicate the months of work which goes into MKL for each new architecture.
As cpu_time adds up the time spent by all the threads, this time will almost certainly increase with number of threads. You would use system_clock or omp_get_wtime to test the effectiveness of threaded parallelism.

View solution in original post

Andrew_Smith · ‎09-21-2011

A few things I spotted are:

1. Variables are shared by default so the inner loop variables j and temp need to be declared private; the shared statement you have is not needed.

2. Dynamic scheduling has an overhead in time so don't use it unless you expect loops to vary in amount of work.

3. Are you sure transposing A twice is faster than leaving it alone and swapping the index order in the multiple loop?

There will be many more suggestions I am sure.

However you code is a of time except as a learning exercise since Intel provide the highly optimised MKL routines to do this job which are parallelized. Even the intrinsic MATMUL could be used and a compiler switch will make it use the MKL library.

TimP · ‎09-21-2011

OpenMP Fortran makes the loop indices private by default. However, you must declare temp as private, unless you are simply hoping to get lucky with register optimizations.
Dynamic scheduling may be OK on a single CPU with shared last level cache, but there's no reason to try it here.
The OpenMP clause stops the compiler from optimizing loop nesting, so you must do that yourself, arranging that the outer segments are distinct and don't cause multiple threads to access the same cache lines. As it is, you should time each parallel region separately; it looks like even after you supply the private declarations and deal with the remaining race conditions you will see your transposes not working nearly as well as a plain Fortran single threaded transpose intrinsic.
It would be interesting if the compiler could duplicate the months of work which goes into MKL for each new architecture.
As cpu_time adds up the time spent by all the threads, this time will almost certainly increase with number of threads. You would use system_clock or omp_get_wtime to test the effectiveness of threaded parallelism.

Don_Ritchie · ‎09-21-2011

Andrew:

Thanks very much for your comments.

>>1. Variables are shared by default so the inner loop variables j and temp need to be declared private;

Tried that. Makes no differences in speed or accuracy of results. I understand that it SHOULD, and will remember, henceforth,to put such things in 'private'.

>>2. Dynamic scheduling has an overhead in time so don't use it unless you expect loops to vary in amount of >>work.

In repeated timings (correct ones; see Tim's post anbd my reply to it), use of STATIC is slightly (probably not significantly) slower than DYNAMIC.

>>3. Are you sure transposing A twice is faster than leaving it alone and swapping the index order in the >>multiple loop?

Yes, but the differences are small in Visual Fortran. In C++ and C# the transpose is nearly twice as fast (noting, of course, that it's B that is transposed there).

This program was written for the sole purpose of learning to use openMP. The mkl/lapack dgemm is, indeed, nearly twice as fast as Fortran's matmul, which is, as shown in my original post, 2-3 times faster than my mat_mul_trans. Those LAPACK guys are real*8 professionals.

Don_Ritchie · ‎09-21-2011

>>As cpu_time adds up the time spent by all the threads, this time will almost certainly increase with number of >>threads. You would use system_clock or omp_get_wtime to test the effectiveness of threaded parallelism.

Oooops .... that explains a LOT. I modified the code to call the mkl second() for timings, and things look much more sensible. Three consecutive runs on 1024x1024 give times of

mat_mult_trans: 0.520 0.703 0.707
Fortran matmul: 0.246 0.246 0.246
par_mat_mult_trans: 0.180 0.215 0.223

This program was for the sole purpose of learning openMP. I don't think I'm ready (yet) to put MKL/LAPACK out of business :)

Many thanks!!!

John_Campbell · ‎09-21-2011

Don,

I tested your approach on another fortran 95 compiler, which does not use OpenMP or vectorisation.
Your approachworks !!

I rewrote your mat_mul_trans as summarised below:

subroutine mat_mul_trans (a,b,c,n)

call transpose_a (a, n)
call matmul_AtB (a,b,c,n)
call transpose_a (a, n)
end subroutine mat_mul_trans

subroutine matmul_AtB (a,b,c,n)

do i = 1,n
do j = 1,n
c(i,j) = vecsum (a(1,i), b(1,j), n)
end do
end do
end subroutine matmul_AtB

I found that for n=200 the run times (using system_clock) are about the same, but there is a significant improvement for n=1000.
I used vecsum to replace the inner loop addressing calculation. You could replace "vecsum" with the intrinsic dot_product, but I find that array sections can have an overhead. The effectiveness of this change might also be affected by the level of optimisation selected.

You could also test differing levels of optimisation to determine if the cache effect that you have identified can apply to ifort, as it does on some other fortran compilers.

Thanks for the idea.

John

Don_Ritchie · ‎09-22-2011

John:

>>Thanks for the idea.

It's not really mine; if you haven't already seen it, you might be interested in taking a look at:

http://lwn.net/Archives/GuestIndex/

Look under Drepper. There are nine parts dealing with "What Every Programmer Should Know About Memory". Part 5 deals specifically with matrix multiplication

By the way, just out of curiosity I looked at MKL/parallel dgemm. It does the 1024x1024 multiplication in something about 0.05 secs!!

Regards,
Don Ritchie