Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Visual Fortran and Open MP

ekeom
Novice
432 Views
Dear All,

I'm trying to use parallel computation with Visaul Intel Fortran +Open MP on matrices multiplication on a 2 core computer. Time computation does not change (not divide by about 2)!!! See souce code below.

Best regards,

Didace



program prod

use dfport

implicit none

integer :: n, i, j
parameter (n = 1000)

double precision :: time_end, time_begin

complex*16, dimension(:,:), allocatable :: a, b

allocate(a(n,n))
allocate(b(n,n))

a = dcmplx(0.d+00,0.d+00)
b = dcmplx(0.d+00,0.d+00)

do j=1,n

do i=1,n

a(i,j) = cmplx(rand(),rand())
b(i,j) = cmplx(rand(),rand())

enddo

enddo

call cpu_time(time_begin)

call prod_mat(a,b,n)

call cpu_time(time_end)

write(*,*)
write(*,*) ' CPU time : ', time_end -time_begin
write(*,*)

deallocate(a)
deallocate(b)

end program prod

!
! ---------------------------------------------------- !
!

subroutine prod_mat (a,b,n)

implicit none

integer, intent(in) :: n

integer :: i, j

complex*16, dimension(n,n) :: a, b

complex*16, dimension( : ), allocatable :: v

allocate(v(n))

do i=1,n

do j=1,n

v(j) = a(i,j)

enddo

! OMP PARALLEL DO SCHEDULE(DYNAMIC,2)

do j=1,n

a(i,j) = sum(v(:)*b(:,j))

enddo

! OMP PARALLEL END DO

enddo

deallocate(v)

end subroutine prod_mat

!
! ---------------------------------------------------- !
!

0 Kudos
5 Replies
TimP
Honored Contributor III
432 Views
The difficulty of coming up with efficient new ways of performing matrix multiplication is among the motivations for use of a pre-built library, such as Intel MKL (included in ifort Professional). It should be improved, if you were to push the i loop inside the parallized j loop. Static scheduling, with large chunk, is probably more appropriate to the situation you quote. With OpenMP, the compiler is not entitled to perform such optimizations.
As you have written it, the operation of setting up v(:) is likely to be time consuming. Then, you have both threads writing to the same cache line (false sharing).
If you want the compiler to perform loop nesting optimizations for parallelism, the -Qparallel option may work better than OpenMP.
Questions about efficient OpenMP programming might be more appropriate to the Threading forum.
0 Kudos
jimdempseyatthecove
Honored Contributor III
432 Views

Didace,

Consider the following change:

subroutine prod_mat (a,b,n)
implicit none
integer, intent(in) :: n
!$OMP PARALLEL
call prod_mat_parallel(a,b,n)
!$OMP END PARALLEL
end subroutine prod_mat
subroutine prod_mat_parallel (a,b,n)
use omp_lib
implicit none
integer, intent(in) :: n
integer :: i, j
integer :: iNumThreads, iThreadNum
 complex*16, dimension(n,n) :: a, b
 complex*16, dimension( : ), allocatable :: v
 allocate(v(n)) ! seperate instance of v per thread
 iNumThreads = OMP_GET_NUM_THREADS()
if(iNumThreads .eq. 0) then
iNumThreads = 1
iThreadNum = 0
else
iThreadNum = OMP_GET_THREAD_NUM()
endif
 do i=1+iThreadNum, n, iNumThreads
do j=1,n
v(j) = a(i,j)
enddo
do j=1,n
a(i,j) = sum(v(:)*b(:,j))
enddo
enddo
deallocate(v)
end subroutine prod_mat_parallel

Note,

The above code does not optimize the cache line usage (or avoid false sharing) but should give better performance. There are examples of tileing available on the internet.

Also, as a furtheroptimization, I wouild suggest that you move the allocation of array v to outside prod_mat_parallel (pass in as arg) and make arrays of array v static (one for each thread) and then only when v for thread is not allocated or size of v less than n would you perform the allocation or reallocation.

Something along the lines of the following untested code

module mod_prod_mat
type type_mod_prod_mat_v
complex*16, dimension( : ), allocatable :: v
end type type_mod_prod_mat_v
 type(type_mod_prod_mat_v), dimension( : ), allocatable :: av
end module mod_prod_mat
subroutine prod_mat (a,b,n)
use mod_prod_mat
implicit none
integer, intent(in) :: n
integer :: iNumThreads, iThreadNum
 iNumThreads = max(OMP_GET_NUM_THREADS(),1)
 if(allocated(av)) then
if(size(av) .lt. iNumThreads) then
do iThreadNum=0, size(av)-1
if(allocated(av(iThreadNum)%v)) deallocate(av(iThreadNum)%v)
enddo
deallocate av
allocate(av(0:iNumThreads-1))
do iThreadNum=0, iNumThreads-1
allocate(av(iThreadNum)%v(n))
enddo
else
if(size(av%v(0)) .lt. n) then
do iThreadNum=0, iNumThreads-1
deallocate(av(iThreadNum)%v)
allocate(av(iThreadNum)%v(n))
enddo
endif
else
allocate(av(0:iNumThreads-1))
do iThreadNum=0, iNumThreads-1
allocate(av(iThreadNum)%v(n))
enddo
endif

!$OMP PARALLEL default(shared) private(iThreadNum)
iThreadNum = OMP_GET_THREAD_NUM()
call prod_mat_parallel(a,b,n,av(iThreadNum)%v,iNumThreads, iThreadNum)
!$OMP END PARALLEL
end subroutine prod_mat
subroutine prod_mat_parallel (a,b,n,v,iNumThreads, iThreadNum)
use omp_lib
implicit none
integer, intent(in) :: n
complex*16, dimension(n,n) :: a, b
complex*16, dimension( : ), allocatable :: v
integer :: iNumThreads, iThreadNum
integer :: i, j
 do i=1+iThreadNum, n, iNumThreads
do j=1,n
v(j) = a(i,j)
enddo
do j=1,n
a(i,j) = sum(v(:)*b(:,j))
enddo
enddo
end subroutine prod_mat_parallel

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
432 Views

Oops, minor bug. use

 if(size(av%v(0)) .lt. n) then
do iThreadNum=0, size(av)-1
deallocate(av(iThreadNum)%v)
allocate(av(iThreadNum)%v(n))
enddo
endif

That covers the case where you reduced the numbers of threads.

Also note, the example code is not suitable for nested parallel coding because OMP_GET_THREAD_NUM() returns thread team member number and not a global thread number.

Jim Dempsey

0 Kudos
ekeom
Novice
432 Views
Thank You very much.

Didace
0 Kudos
izryu
Beginner
432 Views
You are using the wrong function to measure the efficiency of parallel code. cpu_time measures cumulative time for all processors. For example if your program runs for 1 min on a double CPU system with 100% load of both processors cpu_time will return 2 min (1 min on CPU1 + 1 min on CPU2). In fact the documentation about the behavior of cpu_time on multyprocessor system is vague and you are not the only one who was bitten by it.
0 Kudos
Reply