- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to use parallel computation with Visaul Intel Fortran +Open MP on matrices multiplication on a 2 core computer. Time computation does not change (not divide by about 2)!!! See souce code below.
Best regards,
Didace
program prod
use dfport
implicit none
integer :: n, i, j
parameter (n = 1000)
double precision :: time_end, time_begin
complex*16, dimension(:,:), allocatable :: a, b
allocate(a(n,n))
allocate(b(n,n))
a = dcmplx(0.d+00,0.d+00)
b = dcmplx(0.d+00,0.d+00)
do j=1,n
do i=1,n
a(i,j) = cmplx(rand(),rand())
b(i,j) = cmplx(rand(),rand())
enddo
enddo
call cpu_time(time_begin)
call prod_mat(a,b,n)
call cpu_time(time_end)
write(*,*)
write(*,*) ' CPU time : ', time_end -time_begin
write(*,*)
deallocate(a)
deallocate(b)
end program prod
!
! ---------------------------------------------------- !
!
subroutine prod_mat (a,b,n)
implicit none
integer, intent(in) :: n
integer :: i, j
complex*16, dimension(n,n) :: a, b
complex*16, dimension( : ), allocatable :: v
allocate(v(n))
do i=1,n
do j=1,n
v(j) = a(i,j)
enddo
! OMP PARALLEL DO SCHEDULE(DYNAMIC,2)
do j=1,n
a(i,j) = sum(v(:)*b(:,j))
enddo
! OMP PARALLEL END DO
enddo
deallocate(v)
end subroutine prod_mat
!
! ---------------------------------------------------- !
!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As you have written it, the operation of setting up v(:) is likely to be time consuming. Then, you have both threads writing to the same cache line (false sharing).
If you want the compiler to perform loop nesting optimizations for parallelism, the -Qparallel option may work better than OpenMP.
Questions about efficient OpenMP programming might be more appropriate to the Threading forum.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Didace,
Consider the following change:
subroutine prod_mat (a,b,n)
implicit none
integer, intent(in) :: n
!$OMP PARALLEL
call prod_mat_parallel(a,b,n)
!$OMP END PARALLEL
end subroutine prod_mat
subroutine prod_mat_parallel (a,b,n)
use omp_lib
implicit none
integer, intent(in) :: n
integer :: i, j
integer :: iNumThreads, iThreadNum
complex*16, dimension(n,n) :: a, b
complex*16, dimension( : ), allocatable :: v
allocate(v(n)) ! seperate instance of v per thread
iNumThreads = OMP_GET_NUM_THREADS()
if(iNumThreads .eq. 0) then
iNumThreads = 1
iThreadNum = 0
else
iThreadNum = OMP_GET_THREAD_NUM()
endif
do i=1+iThreadNum, n, iNumThreads
do j=1,n
v(j) = a(i,j)
enddo
do j=1,n
a(i,j) = sum(v(:)*b(:,j))
enddo
enddo
deallocate(v)
end subroutine prod_mat_parallel
Note,
The above code does not optimize the cache line usage (or avoid false sharing) but should give better performance. There are examples of tileing available on the internet.
Also, as a furtheroptimization, I wouild suggest that you move the allocation of array v to outside prod_mat_parallel (pass in as arg) and make arrays of array v static (one for each thread) and then only when v for thread is not allocated or size of v less than n would you perform the allocation or reallocation.
Something along the lines of the following untested code
module mod_prod_mat
type type_mod_prod_mat_v
complex*16, dimension( : ), allocatable :: v
end type type_mod_prod_mat_v
type(type_mod_prod_mat_v), dimension( : ), allocatable :: av
end module mod_prod_mat
subroutine prod_mat (a,b,n)
use mod_prod_mat
implicit none
integer, intent(in) :: n
integer :: iNumThreads, iThreadNum
iNumThreads = max(OMP_GET_NUM_THREADS(),1)
if(allocated(av)) then
if(size(av) .lt. iNumThreads) then
do iThreadNum=0, size(av)-1
if(allocated(av(iThreadNum)%v)) deallocate(av(iThreadNum)%v)
enddo
deallocate av
allocate(av(0:iNumThreads-1))
do iThreadNum=0, iNumThreads-1
allocate(av(iThreadNum)%v(n))
enddo
else
if(size(av%v(0)) .lt. n) then
do iThreadNum=0, iNumThreads-1
deallocate(av(iThreadNum)%v)
allocate(av(iThreadNum)%v(n))
enddo
endif
else
allocate(av(0:iNumThreads-1))
do iThreadNum=0, iNumThreads-1
allocate(av(iThreadNum)%v(n))
enddo
endif
!$OMP PARALLEL default(shared) private(iThreadNum)
iThreadNum = OMP_GET_THREAD_NUM()
call prod_mat_parallel(a,b,n,av(iThreadNum)%v,iNumThreads, iThreadNum)
!$OMP END PARALLEL
end subroutine prod_mat
subroutine prod_mat_parallel (a,b,n,v,iNumThreads, iThreadNum)
use omp_lib
implicit none
integer, intent(in) :: n
complex*16, dimension(n,n) :: a, b
complex*16, dimension( : ), allocatable :: v
integer :: iNumThreads, iThreadNum
integer :: i, j
do i=1+iThreadNum, n, iNumThreads
do j=1,n
v(j) = a(i,j)
enddo
do j=1,n
a(i,j) = sum(v(:)*b(:,j))
enddo
enddo
end subroutine prod_mat_parallel
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oops, minor bug. use
if(size(av%v(0)) .lt. n) then
do iThreadNum=0, size(av)-1
deallocate(av(iThreadNum)%v)
allocate(av(iThreadNum)%v(n))
enddo
endif
That covers the case where you reduced the numbers of threads.
Also note, the example code is not suitable for nested parallel coding because OMP_GET_THREAD_NUM() returns thread team member number and not a global thread number.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Didace
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page