- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am willing to parallelize an inner loop with openMP.
The code is similar to the one below: a do loop that runs in parallel and contains a long subroutine here replaced by dummy calculations (the scope of the parallelization would be to parallelize the time-consuming subroutine).
program main use omp_lib implicit none double precision :: wtime, a, b, c double precision, allocatable :: res(:) integer ::i, j, i_max, j_max i_max = 10000 allocate(res(i_max)) call omp_set_num_threads (4) wtime = omp_get_wtime ( ) !$omp parallel default(private) shared(res, i_max) !$omp do do i = 1, i_max ! long subroutine, here replaced by dummy calculations a = real(i) b = real(i)**2 c = b**2+a**2 res(i) = real(c) end do !$omp end do !$omp end parallel wtime = omp_get_wtime ( ) - wtime print*, 'Elapsed time parallel simulation : ', wtime wtime = omp_get_wtime ( ) do i = 1, i_max ! long subroutine, here replaced by dummy calculations a = real(i) b = real(i)**2 c = b**2+a**2 res(i) = real(c) end do wtime = omp_get_wtime ( ) - wtime print*, 'Elapsed time serial simulation : ', wtime end program
What the program prints out is:
Elapsed time parallel simulation : 2.046999987214804E-003
Elapsed time serial simulation : 2.710008993744850E-005
That is the opposite of what I'd expect.
For the dummy code the compiler could have taken some shortcut to make sequential code faster, but similar result happens when I run the code with the long subroutine instead of the dummy 'res(i) = c': in this case the parallelization doesn't slow the code that much, but it is still slower or, at best, no gain.
I compiled it with intel fortran compiler using Microsoft VS.
Why is parallelization inefficient in my code? I could not find any documentation (or rules of thumb) on whether a code could gain or not from parallelization. Could be a Microsoft VS project option instead?
I will greatly appreciate your contribution.
Thanks a lot.
Matteo
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You have a number of problems with this example;
1) omp_get_wtime may not be accurate enough. I use System_Clock with integer*8 arguments to get better precision.
2) your loop does not have enough calculation to offset the OMP region overheads ( ~ 5E-6 seconds ) in comparison to serial loop. It is a common problem of !$OMP trivial examples that they don't demonstrate the basic requirement of !$OMP to improve clock time performance.
3) don't use too much optimisation that trivialises your test loop.
4) I introduced count_id to confirm the expected thread usage.
I modified your example to hopefully vary the loop workload and demonstrate improved performance in some cases. This can vary with processor or compiler optimisation.
program main
use omp_lib
implicit none
double precision :: wtime, a, b, c
double precision, allocatable :: res(:)
integer :: i, j, i_max, j_max, N, id, count_id(0:11)
double precision, external :: delta_seconds
!
call omp_set_num_threads (4)
!
do j = 1,2
if (j==1) j_max = 1 ! not enough work
if (j==2) j_max = 10000 ! more work
DO N = 1,3
i_max = 10000 * 10**(n-1)
allocate(res(i_max))
write (*,*) 'i_max, j_max =',i_max, j_max
!
count_id = 0
wtime = delta_seconds ( )
!$omp parallel do default(private) shared(res, i_max, count_id)
do i = 1, i_max
! long subroutine, here replaced by dummy calculations
id = omp_get_thread_num()
count_id(id) = count_id(id) + 1
call do_more_work (j_max, a)
a = real(i)
b = real(i)**2
c = b**2+a**2
res(i) = real(c)
end do
!$omp end parallel do
wtime = delta_seconds ( )
print*, 'Elapsed time parallel simulation : ', wtime, count_id(0)
count_id = 0
wtime = delta_seconds ( )
do i = 1, i_max
! long subroutine, here replaced by dummy calculations
id = omp_get_thread_num()
count_id(id) = count_id(id) + 1
call do_more_work (j_max, a)
a = real(i)
b = real(i)**2
c = b**2+a**2
res(i) = real(c)
end do
wtime = delta_seconds ( )
print*, 'Elapsed time serial simulation : ', wtime, count_id(0)
deallocate(res)
!
END DO ! N
end do ! j
end program
double precision function Delta_Seconds ()
integer*8 :: clock1 = 0
integer*8 :: clock2, rate
double precision :: dt
CALL SYSTEM_CLOCK (clock2,rate)
dt = (clock2-clock1)/DBLE(rate)
clock1 = clock2
Delta_Seconds = dt
end function Delta_Seconds
subroutine do_more_work (j_max, a)
double precision :: a, b
integer :: i, j_max
a = 1.0
do i = 1,j_max
b = log(a)
a = a + b
end do
end subroutine do_more_work
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My apologies, as I did not declare j_max as shared. ( I don't normally use "default(private)" but prefer to declare all variables explicitly as private or shared )
I did some other minor changes to compare the serial and parallel results.
Now, for small i_max,j_max, parallel is slower than serial, but for increased workload with larger i_max, j_max, parallel is faster.
The revised example is:
program main
use omp_lib
implicit none
double precision :: wtime, a, b, c, so,ss
double precision, allocatable :: res(:)
integer :: i, j, i_max, j_max, N, id, count_id(0:11)
double precision, external :: delta_seconds
!
call omp_set_num_threads (4)
!
do j = 1,2
if (j==1) j_max = 1 ! not enough work
if (j==2) j_max = 1000 ! more work, but not too much
DO N = 1,3
i_max = 1000 * 10**(n-1)
allocate(res(i_max))
write (*,10) 'i_max, j_max = ',i_max, j_max
!
count_id = 0
so = 0
wtime = delta_seconds ( )
!$omp parallel do default(private) shared(res, i_max, j_max, count_id) REDUCTION(+ : so)
do i = 1, i_max
! long subroutine, here replaced by dummy calculations
id = omp_get_thread_num()
count_id(id) = count_id(id) + 1
call do_more_work (j_max, a)
b = real(i)**2
c = sqrt(b)+a**2
res(i) = real(c)
so = so + c
end do
!$omp end parallel do
wtime = delta_seconds ( )
write (*,11) 'Elapsed time parallel simulation : ', wtime, count_id(0), so ! report sum
count_id = 0
ss = 0
wtime = delta_seconds ( )
do i = 1, i_max
! long subroutine, here replaced by dummy calculations
id = omp_get_thread_num()
count_id(id) = count_id(id) + 1
call do_more_work (j_max, a)
b = real(i)**2
c = sqrt(b)+a**2
res(i) = real(c)
ss = ss + c
end do
wtime = delta_seconds ( )
write (*,11) 'Elapsed time serial simulation : ', wtime, count_id(0), ss, so-ss ! compare sums
deallocate(res)
!
END DO ! N
end do ! j
10 format (/a,i0,1x,i0)
11 format (a,f10.6,1x,i8,2es12.4)
end program
double precision function Delta_Seconds ()
integer*8 :: clock1 = 0
integer*8 :: clock2, rate
double precision :: dt
CALL SYSTEM_CLOCK (clock2,rate)
dt = (clock2-clock1)/DBLE(rate)
clock1 = clock2
Delta_Seconds = dt
end function Delta_Seconds
subroutine do_more_work (j_max, a)
double precision :: a, b
integer :: i, j_max
a = 1.1 ! b now varies through loop
do i = 1,j_max
b = log(a)
a = a + b
end do
a = b
end subroutine do_more_work
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page