- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Visual Fortran 2011 and openMP are pretty new to me; I've been using C++ and C# for parallel programming on my system: Dell Studio XPS w/Intel i7 860 quad core running Windows 7-64 bit.
For a 1024x1024 (or larger) matrix multiplication test, I'm finding that a Fortran openMP routine runs considerably slower than the same sequential routine. The openMP routine appears to be properly using 8 threads when I look at Windows Task Manager during execution, and the results of the multiplication look OK.
Could someone please take a look at the following code and tell me if I'm using openMP correctly?
!****************************************************************************
!
! PROGRAM: openMP_speed_test
!
! PURPOSE: Test use of openMP
!
!****************************************************************************
subroutine mat_mult_trans(A,B,C,N)
!Use transpose of A in multiplication to make better use of cache
integer N
real*8 A(N,N),B(N,N),C(N,N),temp
!transpose A
do i=1,N-1
do j=i+1,N
temp=A(i,j)
A(i,j)=A(j,i)
A(j,i)=temp
end do
end do
!do the multiplication
do i=1,N
do j=1,N
temp=0.0D0
do k=1,N
temp=temp+A(k,i)*B(k,j)
end do
C(i,j)=temp
end do
end do
!restore A
do i=1,N-1
do j=i+1,N
temp=A(i,j)
A(i,j)=A(j,i)
A(j,i)=temp
end do
end do
end subroutine
subroutine par_mat_mult_trans(A,B,C,N)
!same as mat_mult_trans, but using openMP
integer N,nthreads,TID,chunk
Integer OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
real*8 A(N,N),B(N,N),C(N,N),temp
chunk=32
!$OMP PARALLEL SHARED(A,B,C,NTHREADS,CHUNK)
TID = OMP_GET_THREAD_NUM()
IF (TID .EQ. 0) THEN
NTHREADS = OMP_GET_NUM_THREADS()
PRINT *, 'Starting matrix multiply with ', NTHREADS,' threads'
END IF
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do i=1,N-1
do j=i+1,N
temp=A(i,j)
A(i,j)=A(j,i)
A(j,i)=temp
end do
end do
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do i=1,N
do j=1,N
temp=0.0D0
do k=1,N
temp=temp+A(k,i)*B(k,j)
end do
C(i,j)=temp
end do
end do
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do i=1,N-1
do j=i+1,N
temp=A(i,j)
A(i,j)=A(j,i)
A(j,i)=temp
end do
end do
!$omp end parallel
end subroutine
program openMP_speed_test
Integer N, max_threads
real(kind=8), allocatable :: A(:,:),B(:,:),C(:,:),D(:)
real(kind=8), allocatable :: c2(:,:)
real*8 t1,t2,num,diff
character(len=1) ans
ans='y'
do while((ans.eq.'y').or.(ans.eq.'Y'))
print *,"Matrix size?"
READ *,N
allocate(A(N,N),B(N,N),C(N,N),c2(N,N),D(N))
do i=1,N
do j=1,N
call random_number(num)
if (num<0.5D0) then
num=-1.0D0
else
num=1.0D0
end if
call random_number(A(i,j))
A(i,j)=num*A(i,j)
call random_number(num)
if (num<0.5D0) then
num=+1.0D0
else
num=-1.0D0
end if
call random_number(B(i,j))
B(i,j)=num*B(i,j)
C(i,j)=0.0D0
c2(i,j)=0.0D0
end do
end do
!run standard matrix multiplication using transpose(A) for
!better cache use
PRINT *,"Starting mat_mult_trans."
call cpu_time(t1)
call Mat_mult_trans(A,B,C,N)
call cpu_time(t2)
t2=t2-t1
print 120, t2
120 format("mat_mult_trans execution time=",f6.3," secs")
!Run Fortran matmul function
PRINT *,"Starting Fortran matmul."
call cpu_time(t1)
c2= matmul(A,B)
call cpu_time(t2)
t2=t2-t1
print 140, t2
140 format("canned matmul execution time=",f6.3," secs")
!Check results vs. 'standard'
diff=0.0D0
do i=1,N
do j=1,N
if (abs(C(i,j)-c2(i,j))>diff) then
diff=abs(C(i,j)-c2(i,j))
end if
end do
end do
print 121, diff
121 format("Maximum error=",e14.6)
!Run openMP version of 'standard' with transpose(A)
PRINT *,"Starting par_mat_mult_trans"
call cpu_time(t1)
call par_mat_mult_trans(A,B,c2,N)
call cpu_time(t2)
t2=t2-t1
print 131, t2
131 format("par_mat_mult_trans execution time=",f6.3," secs")
!Check results vs 'standard'
diff=0.0D0
do i=1,N
do j=1,N
if (abs(C(i,j)-c2(i,j))>diff) then
diff=abs(C(i,j)-c2(i,j))
end if
end do
end do
print 121, diff
!Check for more work
print *,"More?"
read *,ans
deallocate(A,B,C,c2,D)
end do
end program
!********************************************************************************
In addition to the speed question, times appear very variable. For three successive runs on 1024x1024 matrices, I get the following times (in secs.)
mat_mult_trans: 0.546 0.671 0.686
Fortran matmul: 0.234 0.234 0.234
par_mat_mult_trans: 0.983 1.310 1.420
TIA for any help,
Don Ritchie
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dynamic scheduling may be OK on a single CPU with shared last level cache, but there's no reason to try it here.
The OpenMP clause stops the compiler from optimizing loop nesting, so you must do that yourself, arranging that the outer segments are distinct and don't cause multiple threads to access the same cache lines. As it is, you should time each parallel region separately; it looks like even after you supply the private declarations and deal with the remaining race conditions you will see your transposes not working nearly as well as a plain Fortran single threaded transpose intrinsic.
It would be interesting if the compiler could duplicate the months of work which goes into MKL for each new architecture.
As cpu_time adds up the time spent by all the threads, this time will almost certainly increase with number of threads. You would use system_clock or omp_get_wtime to test the effectiveness of threaded parallelism.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. Variables are shared by default so the inner loop variables j and temp need to be declared private; the shared statement you have is not needed.
2. Dynamic scheduling has an overhead in time so don't use it unless you expect loops to vary in amount of work.
3. Are you sure transposing A twice is faster than leaving it alone and swapping the index order in the multiple loop?
There will be many more suggestions I am sure.
However you code is a of time except as a learning exercise since Intel provide the highly optimised MKL routines to do this job which are parallelized. Even the intrinsic MATMUL could be used and a compiler switch will make it use the MKL library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dynamic scheduling may be OK on a single CPU with shared last level cache, but there's no reason to try it here.
The OpenMP clause stops the compiler from optimizing loop nesting, so you must do that yourself, arranging that the outer segments are distinct and don't cause multiple threads to access the same cache lines. As it is, you should time each parallel region separately; it looks like even after you supply the private declarations and deal with the remaining race conditions you will see your transposes not working nearly as well as a plain Fortran single threaded transpose intrinsic.
It would be interesting if the compiler could duplicate the months of work which goes into MKL for each new architecture.
As cpu_time adds up the time spent by all the threads, this time will almost certainly increase with number of threads. You would use system_clock or omp_get_wtime to test the effectiveness of threaded parallelism.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks very much for your comments.
>>1. Variables are shared by default so the inner loop variables j and temp need to be declared private;
Tried that. Makes no differences in speed or accuracy of results. I understand that it SHOULD, and will remember, henceforth,to put such things in 'private'.
>>2. Dynamic scheduling has an overhead in time so don't use it unless you expect loops to vary in amount of >>work.
In repeated timings (correct ones; see Tim's post anbd my reply to it), use of STATIC is slightly (probably not significantly) slower than DYNAMIC.
>>3. Are you sure transposing A twice is faster than leaving it alone and swapping the index order in the >>multiple loop?
Yes, but the differences are small in Visual Fortran. In C++ and C# the transpose is nearly twice as fast (noting, of course, that it's B that is transposed there).
This program was written for the sole purpose of learning to use openMP. The mkl/lapack dgemm is, indeed, nearly twice as fast as Fortran's matmul, which is, as shown in my original post, 2-3 times faster than my mat_mul_trans. Those LAPACK guys are real*8 professionals.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oooops .... that explains a LOT. I modified the code to call the mkl second() for timings, and things look much more sensible. Three consecutive runs on 1024x1024 give times of
mat_mult_trans: 0.520 0.703 0.707
Fortran matmul: 0.246 0.246 0.246
par_mat_mult_trans: 0.180 0.215 0.223
This program was for the sole purpose of learning openMP. I don't think I'm ready (yet) to put MKL/LAPACK out of business :)
Many thanks!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Don,
I tested your approach on another fortran 95 compiler, which does not use OpenMP or vectorisation.
Your approachworks !!
I rewrote your mat_mul_trans as summarised below:
subroutine mat_mul_trans (a,b,c,n)
call transpose_a (a, n)
call matmul_AtB (a,b,c,n)
call transpose_a (a, n)
end subroutine mat_mul_trans
subroutine matmul_AtB (a,b,c,n)
do i = 1,n
do j = 1,n
c(i,j) = vecsum (a(1,i), b(1,j), n)
end do
end do
end subroutine matmul_AtB
I found that for n=200 the run times (using system_clock) are about the same, but there is a significant improvement for n=1000.
I used vecsum to replace the inner loop addressing calculation. You could replace "vecsum" with the intrinsic dot_product, but I find that array sections can have an overhead. The effectiveness of this change might also be affected by the level of optimisation selected.
You could also test differing levels of optimisation to determine if the cache effect that you have identified can apply to ifort, as it does on some other fortran compilers.
Thanks for the idea.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Thanks for the idea.
It's not really mine; if you haven't already seen it, you might be interested in taking a look at:
http://lwn.net/Archives/GuestIndex/
Look under Drepper. There are nine parts dealing with "What Every Programmer Should Know About Memory". Part 5 deals specifically with matrix multiplication
By the way, just out of curiosity I looked at MKL/parallel dgemm. It does the 1024x1024 multiplication in something about 0.05 secs!!Regards,
Don Ritchie

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page