Re: Visual Fortran and Open MP

ekeom · ‎03-31-2008

Dear All,

I'm trying to use parallel computation with Visaul Intel Fortran +Open MP on matrices multiplication on a 2 core computer. Time computation does not change (not divide by about 2)!!! See souce code below.

Best regards,

Didace

program prod

use dfport

implicit none

integer :: n, i, j
parameter (n = 1000)

double precision :: time_end, time_begin

complex*16, dimension(:,:), allocatable :: a, b

allocate(a(n,n))
allocate(b(n,n))

a = dcmplx(0.d+00,0.d+00)
b = dcmplx(0.d+00,0.d+00)

do j=1,n

do i=1,n

a(i,j) = cmplx(rand(),rand())
b(i,j) = cmplx(rand(),rand())

enddo

enddo

call cpu_time(time_begin)

call prod_mat(a,b,n)

call cpu_time(time_end)

write(*,*)
write(*,*) ' CPU time : ', time_end -time_begin
write(*,*)

deallocate(a)
deallocate(b)

end program prod

!
! ---------------------------------------------------- !
!

subroutine prod_mat (a,b,n)

implicit none

integer, intent(in) :: n

integer :: i, j

complex*16, dimension(n,n) :: a, b

complex*16, dimension( : ), allocatable :: v

allocate(v(n))

do i=1,n

do j=1,n

v(j) = a(i,j)

enddo

! OMP PARALLEL DO SCHEDULE(DYNAMIC,2)

do j=1,n

a(i,j) = sum(v(:)*b(:,j))

enddo

! OMP PARALLEL END DO

enddo

deallocate(v)

end subroutine prod_mat

!
! ---------------------------------------------------- !
!

TimP · ‎03-31-2008

The difficulty of coming up with efficient new ways of performing matrix multiplication is among the motivations for use of a pre-built library, such as Intel MKL (included in ifort Professional). It should be improved, if you were to push the i loop inside the parallized j loop. Static scheduling, with large chunk, is probably more appropriate to the situation you quote. With OpenMP, the compiler is not entitled to perform such optimizations.
As you have written it, the operation of setting up v(:) is likely to be time consuming. Then, you have both threads writing to the same cache line (false sharing).
If you want the compiler to perform loop nesting optimizations for parallelism, the -Qparallel option may work better than OpenMP.
Questions about efficient OpenMP programming might be more appropriate to the Threading forum.

jimdempseyatthecove · ‎03-31-2008

Didace,

Consider the following change:

subroutine prod_mat (a,b,n)
 implicit none
 integer, intent(in) :: n
!$OMP PARALLEL
 call prod_mat_parallel(a,b,n)
!$OMP END PARALLEL
end subroutine prod_mat

subroutine prod_mat_parallel (a,b,n)
 use omp_lib
 implicit none
 integer, intent(in) :: n
 integer :: i, j
 integer :: iNumThreads, iThreadNum

 complex*16, dimension(n,n) :: a, b

 complex*16, dimension( : ), allocatable :: v

 allocate(v(n)) ! seperate instance of v per thread

 iNumThreads = OMP_GET_NUM_THREADS()
 if(iNumThreads .eq. 0) then
 iNumThreads = 1
 iThreadNum = 0
 else
 iThreadNum = OMP_GET_THREAD_NUM()
 endif

 do i=1+iThreadNum, n, iNumThreads
 do j=1,n
 v(j) = a(i,j)
 enddo
 do j=1,n
 a(i,j) = sum(v(:)*b(:,j))
 enddo
 enddo
 deallocate(v)
end subroutine prod_mat_parallel

Note,

The above code does not optimize the cache line usage (or avoid false sharing) but should give better performance. There are examples of tileing available on the internet.

Also, as a furtheroptimization, I wouild suggest that you move the allocation of array v to outside prod_mat_parallel (pass in as arg) and make arrays of array v static (one for each thread) and then only when v for thread is not allocated or size of v less than n would you perform the allocation or reallocation.

Something along the lines of the following untested code

module mod_prod_mat
 type type_mod_prod_mat_v
 complex*16, dimension( : ), allocatable :: v
 end type type_mod_prod_mat_v

 type(type_mod_prod_mat_v), dimension( : ), allocatable :: av
end module mod_prod_mat

subroutine prod_mat (a,b,n)
 use mod_prod_mat
 implicit none
 integer, intent(in) :: n
 integer :: iNumThreads, iThreadNum

 iNumThreads = max(OMP_GET_NUM_THREADS(),1)

 if(allocated(av)) then
 if(size(av) .lt. iNumThreads) then
 do iThreadNum=0, size(av)-1
 if(allocated(av(iThreadNum)%v)) deallocate(av(iThreadNum)%v)
 enddo
 deallocate av
 allocate(av(0:iNumThreads-1))
 do iThreadNum=0, iNumThreads-1

 allocate(av(iThreadNum)%v(n))
 enddo
 else
 if(size(av%v(0)) .lt. n) then
 do iThreadNum=0, iNumThreads-1
 deallocate(av(iThreadNum)%v)
 allocate(av(iThreadNum)%v(n))
 enddo
 endif
 else
 allocate(av(0:iNumThreads-1))
 do iThreadNum=0, iNumThreads-1
 allocate(av(iThreadNum)%v(n))
 enddo
 endif
 
!$OMP PARALLEL default(shared) private(iThreadNum)
 iThreadNum = OMP_GET_THREAD_NUM()
 call prod_mat_parallel(a,b,n,av(iThreadNum)%v,iNumThreads, iThreadNum)
!$OMP END PARALLEL
end subroutine prod_mat

subroutine prod_mat_parallel (a,b,n,v,iNumThreads, iThreadNum)
 use omp_lib
 implicit none
 integer, intent(in) :: n
 complex*16, dimension(n,n) :: a, b
 complex*16, dimension( : ), allocatable :: v
 integer :: iNumThreads, iThreadNum
 integer :: i, j

 do i=1+iThreadNum, n, iNumThreads
 do j=1,n
 v(j) = a(i,j)
 enddo
 do j=1,n
 a(i,j) = sum(v(:)*b(:,j))
 enddo
 enddo
end subroutine prod_mat_parallel

Jim Dempsey

jimdempseyatthecove · ‎03-31-2008

Oops, minor bug. use

 if(size(av%v(0)) .lt. n) then
 do iThreadNum=0, size(av)-1
 deallocate(av(iThreadNum)%v)
 allocate(av(iThreadNum)%v(n))
 enddo
 endif

That covers the case where you reduced the numbers of threads.

Also note, the example code is not suitable for nested parallel coding because OMP_GET_THREAD_NUM() returns thread team member number and not a global thread number.

Jim Dempsey

ekeom · ‎03-31-2008

Thank You very much.

Didace

izryu · ‎04-03-2008

You are using the wrong function to measure the efficiency of parallel code. cpu_time measures cumulative time for all processors. For example if your program runs for 1 min on a double CPU system with 100% load of both processors cpu_time will return 2 min (1 min on CPU1 + 1 min on CPU2). In fact the documentation about the behavior of cpu_time on multyprocessor system is vague and you are not the only one who was bitten by it.