Do loop takes no advantage of openMP parallelization

tamborski__matteo · ‎02-25-2020

Hi,

I am willing to parallelize an inner loop with openMP.

The code is similar to the one below: a do loop that runs in parallel and contains a long subroutine here replaced by dummy calculations (the scope of the parallelization would be to parallelize the time-consuming subroutine).

  program main
      use omp_lib
      implicit none
        
        double precision :: wtime, a, b, c
        double precision, allocatable :: res(:)
        integer ::i, j, i_max, j_max
        
        i_max = 10000
        allocate(res(i_max))
        
        call omp_set_num_threads (4)
        wtime = omp_get_wtime ( )
        
        !$omp parallel default(private) shared(res, i_max)
        !$omp do
        do i = 1, i_max
            ! long subroutine, here replaced by dummy calculations
            a = real(i)
            b = real(i)**2
            c = b**2+a**2
            res(i) = real(c)
        end do
        !$omp end do
        !$omp end parallel
        
        wtime = omp_get_wtime ( ) - wtime
        print*, 'Elapsed time parallel simulation : ', wtime
        
        wtime = omp_get_wtime ( )
        do i = 1, i_max
            ! long subroutine, here replaced by dummy calculations
            a = real(i)
            b = real(i)**2
            c = b**2+a**2
            res(i) = real(c)
        end do
        
        wtime = omp_get_wtime ( ) - wtime
        print*, 'Elapsed time serial simulation   : ', wtime
        
    end program

What the program prints out is:

Elapsed time parallel simulation : 2.046999987214804E-003
Elapsed time serial simulation : 2.710008993744850E-005

That is the opposite of what I'd expect.

For the dummy code the compiler could have taken some shortcut to make sequential code faster, but similar result happens when I run the code with the long subroutine instead of the dummy 'res(i) = c': in this case the parallelization doesn't slow the code that much, but it is still slower or, at best, no gain.

I compiled it with intel fortran compiler using Microsoft VS.

Why is parallelization inefficient in my code? I could not find any documentation (or rules of thumb) on whether a code could gain or not from parallelization. Could be a Microsoft VS project option instead?

I will greatly appreciate your contribution.

Thanks a lot.

Matteo

John_Campbell · ‎02-25-2020

You have a number of problems with this example;

1) omp_get_wtime may not be accurate enough. I use System_Clock with integer*8 arguments to get better precision.

2) your loop does not have enough calculation to offset the OMP region overheads ( ~ 5E-6 seconds ) in comparison to serial loop. It is a common problem of !$OMP trivial examples that they don't demonstrate the basic requirement of !$OMP to improve clock time performance.

3) don't use too much optimisation that trivialises your test loop.

4) I introduced count_id to confirm the expected thread usage.

I modified your example to hopefully vary the loop workload and demonstrate improved performance in some cases. This can vary with processor or compiler optimisation.

  program main
      use omp_lib
      implicit none
        
        double precision :: wtime, a, b, c
        double precision, allocatable :: res(:)
        integer :: i, j, i_max, j_max, N, id, count_id(0:11)
        double precision, external :: delta_seconds
!        
        call omp_set_num_threads (4)
!
        do j = 1,2
          if (j==1) j_max = 1         ! not enough work
          if (j==2) j_max = 10000     ! more work
          
          DO N = 1,3
            i_max = 10000 * 10**(n-1)
            allocate(res(i_max))
              write (*,*) 'i_max, j_max =',i_max, j_max
!
            count_id = 0
            wtime = delta_seconds ( )
         !$omp parallel do default(private) shared(res, i_max, count_id)
            do i = 1, i_max
                ! long subroutine, here replaced by dummy calculations
                id = omp_get_thread_num()
                count_id(id) = count_id(id) + 1
                call do_more_work (j_max, a)
                a = real(i)
                b = real(i)**2
                c = b**2+a**2
                res(i) = real(c)
            end do
         !$omp end parallel do
          
            wtime = delta_seconds ( )
            print*, 'Elapsed time parallel simulation : ', wtime, count_id(0)
          
            count_id = 0
            wtime = delta_seconds ( )
            do i = 1, i_max
                ! long subroutine, here replaced by dummy calculations
                id = omp_get_thread_num()
                count_id(id) = count_id(id) + 1
                call do_more_work (j_max, a)
                a = real(i)
                b = real(i)**2
                c = b**2+a**2
                res(i) = real(c)
            end do
          
            wtime = delta_seconds ( )
            print*, 'Elapsed time serial simulation   : ', wtime, count_id(0)
            deallocate(res)
!
          END DO ! N
        end do ! j
        
    end program

   double precision function Delta_Seconds ()
      integer*8        :: clock1 = 0
      integer*8        :: clock2, rate
      double precision :: dt
 
       CALL SYSTEM_CLOCK (clock2,rate)
       dt = (clock2-clock1)/DBLE(rate) 
       clock1 = clock2
       Delta_Seconds = dt
   end function Delta_Seconds

   subroutine do_more_work (j_max, a)
      double precision :: a, b
      integer :: i, j_max
      a = 1.0
      do i = 1,j_max
        b = log(a)
        a = a + b
      end do
   end subroutine do_more_work

John_Campbell · ‎02-27-2020

My apologies, as I did not declare j_max as shared. ( I don't normally use "default(private)" but prefer to declare all variables explicitly as private or shared )

I did some other minor changes to compare the serial and parallel results.

Now, for small i_max,j_max, parallel is slower than serial, but for increased workload with larger i_max, j_max, parallel is faster.

The revised example is:

  program main
      use omp_lib
      implicit none
        
        double precision :: wtime, a, b, c, so,ss
        double precision, allocatable :: res(:)
        integer :: i, j, i_max, j_max, N, id, count_id(0:11)
        double precision, external :: delta_seconds
!        
        call omp_set_num_threads (4)
!
        do j = 1,2
          if (j==1) j_max = 1         ! not enough work
          if (j==2) j_max = 1000      ! more work, but not too much
          
          DO N = 1,3
            i_max = 1000 * 10**(n-1)
            allocate(res(i_max))
              write (*,10) 'i_max, j_max = ',i_max, j_max
!
            count_id = 0
            so = 0
            wtime = delta_seconds ( )
         !$omp parallel do default(private) shared(res, i_max, j_max, count_id) REDUCTION(+ : so) 
            do i = 1, i_max
                ! long subroutine, here replaced by dummy calculations
                id = omp_get_thread_num()
                count_id(id) = count_id(id) + 1
                call do_more_work (j_max, a)
                b = real(i)**2
                c = sqrt(b)+a**2
                res(i) = real(c)
                so = so + c
            end do
         !$omp end parallel do
          
            wtime = delta_seconds ( )
            write (*,11) 'Elapsed time parallel simulation : ', wtime, count_id(0), so  ! report sum
          
            count_id = 0
            ss = 0
            wtime = delta_seconds ( )
            do i = 1, i_max
                ! long subroutine, here replaced by dummy calculations
                id = omp_get_thread_num()
                count_id(id) = count_id(id) + 1
                call do_more_work (j_max, a)
                b = real(i)**2
                c = sqrt(b)+a**2
                res(i) = real(c)
                ss = ss + c
            end do
          
            wtime = delta_seconds ( )
            write (*,11) 'Elapsed time serial simulation   : ', wtime, count_id(0), ss, so-ss  ! compare sums
            deallocate(res)
!
          END DO ! N
        end do ! j
    10 format (/a,i0,1x,i0)
    11 format (a,f10.6,1x,i8,2es12.4)
    end program

   double precision function Delta_Seconds ()
      integer*8        :: clock1 = 0
      integer*8        :: clock2, rate
      double precision :: dt
 
       CALL SYSTEM_CLOCK (clock2,rate)
       dt = (clock2-clock1)/DBLE(rate) 
       clock1 = clock2
       Delta_Seconds = dt
   end function Delta_Seconds

   subroutine do_more_work (j_max, a)
      double precision :: a, b
      integer :: i, j_max
      a = 1.1            ! b now varies through loop
      do i = 1,j_max
        b = log(a)
        a = a + b
      end do
      a = b
   end subroutine do_more_work