- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am willing to parallelize an inner loop with openMP.
The code is similar to the one below: a do loop that runs in parallel and contains a long subroutine here replaced by dummy calculations (the scope of the parallelization would be to parallelize the time-consuming subroutine).
program main use omp_lib implicit none double precision :: wtime, a, b, c double precision, allocatable :: res(:) integer ::i, j, i_max, j_max i_max = 10000 allocate(res(i_max)) call omp_set_num_threads (4) wtime = omp_get_wtime ( ) !$omp parallel default(private) shared(res, i_max) !$omp do do i = 1, i_max ! long subroutine, here replaced by dummy calculations a = real(i) b = real(i)**2 c = b**2+a**2 res(i) = real(c) end do !$omp end do !$omp end parallel wtime = omp_get_wtime ( ) - wtime print*, 'Elapsed time parallel simulation : ', wtime wtime = omp_get_wtime ( ) do i = 1, i_max ! long subroutine, here replaced by dummy calculations a = real(i) b = real(i)**2 c = b**2+a**2 res(i) = real(c) end do wtime = omp_get_wtime ( ) - wtime print*, 'Elapsed time serial simulation : ', wtime end program
What the program prints out is:
Elapsed time parallel simulation : 2.046999987214804E-003
Elapsed time serial simulation : 2.710008993744850E-005
That is the opposite of what I'd expect.
For the dummy code the compiler could have taken some shortcut to make sequential code faster, but similar result happens when I run the code with the long subroutine instead of the dummy 'res(i) = c': in this case the parallelization doesn't slow the code that much, but it is still slower or, at best, no gain.
I compiled it with intel fortran compiler using Microsoft VS.
Why is parallelization inefficient in my code? I could not find any documentation (or rules of thumb) on whether a code could gain or not from parallelization. Could be a Microsoft VS project option instead?
I will greatly appreciate your contribution.
Thanks a lot.
Matteo
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You have a number of problems with this example;
1) omp_get_wtime may not be accurate enough. I use System_Clock with integer*8 arguments to get better precision.
2) your loop does not have enough calculation to offset the OMP region overheads ( ~ 5E-6 seconds ) in comparison to serial loop. It is a common problem of !$OMP trivial examples that they don't demonstrate the basic requirement of !$OMP to improve clock time performance.
3) don't use too much optimisation that trivialises your test loop.
4) I introduced count_id to confirm the expected thread usage.
I modified your example to hopefully vary the loop workload and demonstrate improved performance in some cases. This can vary with processor or compiler optimisation.
program main use omp_lib implicit none double precision :: wtime, a, b, c double precision, allocatable :: res(:) integer :: i, j, i_max, j_max, N, id, count_id(0:11) double precision, external :: delta_seconds ! call omp_set_num_threads (4) ! do j = 1,2 if (j==1) j_max = 1 ! not enough work if (j==2) j_max = 10000 ! more work DO N = 1,3 i_max = 10000 * 10**(n-1) allocate(res(i_max)) write (*,*) 'i_max, j_max =',i_max, j_max ! count_id = 0 wtime = delta_seconds ( ) !$omp parallel do default(private) shared(res, i_max, count_id) do i = 1, i_max ! long subroutine, here replaced by dummy calculations id = omp_get_thread_num() count_id(id) = count_id(id) + 1 call do_more_work (j_max, a) a = real(i) b = real(i)**2 c = b**2+a**2 res(i) = real(c) end do !$omp end parallel do wtime = delta_seconds ( ) print*, 'Elapsed time parallel simulation : ', wtime, count_id(0) count_id = 0 wtime = delta_seconds ( ) do i = 1, i_max ! long subroutine, here replaced by dummy calculations id = omp_get_thread_num() count_id(id) = count_id(id) + 1 call do_more_work (j_max, a) a = real(i) b = real(i)**2 c = b**2+a**2 res(i) = real(c) end do wtime = delta_seconds ( ) print*, 'Elapsed time serial simulation : ', wtime, count_id(0) deallocate(res) ! END DO ! N end do ! j end program double precision function Delta_Seconds () integer*8 :: clock1 = 0 integer*8 :: clock2, rate double precision :: dt CALL SYSTEM_CLOCK (clock2,rate) dt = (clock2-clock1)/DBLE(rate) clock1 = clock2 Delta_Seconds = dt end function Delta_Seconds subroutine do_more_work (j_max, a) double precision :: a, b integer :: i, j_max a = 1.0 do i = 1,j_max b = log(a) a = a + b end do end subroutine do_more_work
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My apologies, as I did not declare j_max as shared. ( I don't normally use "default(private)" but prefer to declare all variables explicitly as private or shared )
I did some other minor changes to compare the serial and parallel results.
Now, for small i_max,j_max, parallel is slower than serial, but for increased workload with larger i_max, j_max, parallel is faster.
The revised example is:
program main use omp_lib implicit none double precision :: wtime, a, b, c, so,ss double precision, allocatable :: res(:) integer :: i, j, i_max, j_max, N, id, count_id(0:11) double precision, external :: delta_seconds ! call omp_set_num_threads (4) ! do j = 1,2 if (j==1) j_max = 1 ! not enough work if (j==2) j_max = 1000 ! more work, but not too much DO N = 1,3 i_max = 1000 * 10**(n-1) allocate(res(i_max)) write (*,10) 'i_max, j_max = ',i_max, j_max ! count_id = 0 so = 0 wtime = delta_seconds ( ) !$omp parallel do default(private) shared(res, i_max, j_max, count_id) REDUCTION(+ : so) do i = 1, i_max ! long subroutine, here replaced by dummy calculations id = omp_get_thread_num() count_id(id) = count_id(id) + 1 call do_more_work (j_max, a) b = real(i)**2 c = sqrt(b)+a**2 res(i) = real(c) so = so + c end do !$omp end parallel do wtime = delta_seconds ( ) write (*,11) 'Elapsed time parallel simulation : ', wtime, count_id(0), so ! report sum count_id = 0 ss = 0 wtime = delta_seconds ( ) do i = 1, i_max ! long subroutine, here replaced by dummy calculations id = omp_get_thread_num() count_id(id) = count_id(id) + 1 call do_more_work (j_max, a) b = real(i)**2 c = sqrt(b)+a**2 res(i) = real(c) ss = ss + c end do wtime = delta_seconds ( ) write (*,11) 'Elapsed time serial simulation : ', wtime, count_id(0), ss, so-ss ! compare sums deallocate(res) ! END DO ! N end do ! j 10 format (/a,i0,1x,i0) 11 format (a,f10.6,1x,i8,2es12.4) end program double precision function Delta_Seconds () integer*8 :: clock1 = 0 integer*8 :: clock2, rate double precision :: dt CALL SYSTEM_CLOCK (clock2,rate) dt = (clock2-clock1)/DBLE(rate) clock1 = clock2 Delta_Seconds = dt end function Delta_Seconds subroutine do_more_work (j_max, a) double precision :: a, b integer :: i, j_max a = 1.1 ! b now varies through loop do i = 1,j_max b = log(a) a = a + b end do a = b end subroutine do_more_work

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page