Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
26730 Discussions

Do loop takes no advantage of openMP parallelization

tamborski__matteo
95 Views

Hi, 

I am willing to parallelize an inner loop with openMP. 

The code is similar to the one below: a do loop that runs in parallel and contains a long subroutine here replaced by dummy calculations (the scope of the parallelization would be to parallelize the time-consuming subroutine).

  program main
      use omp_lib
      implicit none
        
        double precision :: wtime, a, b, c
        double precision, allocatable :: res(:)
        integer ::i, j, i_max, j_max
        
        i_max = 10000
        allocate(res(i_max))
        
        call omp_set_num_threads (4)
        wtime = omp_get_wtime ( )
        
        !$omp parallel default(private) shared(res, i_max)
        !$omp do
        do i = 1, i_max
            ! long subroutine, here replaced by dummy calculations
            a = real(i)
            b = real(i)**2
            c = b**2+a**2
            res(i) = real(c)
        end do
        !$omp end do
        !$omp end parallel
        
        wtime = omp_get_wtime ( ) - wtime
        print*, 'Elapsed time parallel simulation : ', wtime
        
        wtime = omp_get_wtime ( )
        do i = 1, i_max
            ! long subroutine, here replaced by dummy calculations
            a = real(i)
            b = real(i)**2
            c = b**2+a**2
            res(i) = real(c)
        end do
        
        wtime = omp_get_wtime ( ) - wtime
        print*, 'Elapsed time serial simulation   : ', wtime
        
    end program

What the program prints out is:

Elapsed time parallel simulation :   2.046999987214804E-003
 Elapsed time serial simulation   :   2.710008993744850E-005

That is the opposite of what I'd expect.

For the dummy code the compiler could have taken some shortcut to make sequential code faster, but similar result happens when I run the code with the long subroutine instead of the dummy 'res(i) = c': in this case the parallelization doesn't slow the code that much, but it is still slower or, at best, no gain.

I compiled it with intel fortran compiler using Microsoft VS.

Why is parallelization inefficient in my code? I could not find any documentation (or rules of thumb) on whether a code could gain or not from parallelization. Could be a Microsoft VS project option instead?

 

I will greatly appreciate your contribution.

Thanks a lot.

Matteo

 

 

 

0 Kudos
2 Replies
John_Campbell
New Contributor II
95 Views

You have a number of problems with this example;

1) omp_get_wtime may not be accurate enough. I use System_Clock with integer*8 arguments to get better precision.

2) your loop does not have enough calculation to offset the OMP region overheads ( ~ 5E-6 seconds ) in comparison to serial loop. It is a common problem of !$OMP trivial examples that they don't demonstrate the basic requirement of !$OMP to improve clock time performance.

3) don't use too much optimisation that trivialises your test loop.

4) I introduced count_id to confirm the expected thread usage.

I modified your example to hopefully vary the loop workload and demonstrate improved performance in some cases. This can vary with processor or compiler optimisation.

  program main
      use omp_lib
      implicit none
        
        double precision :: wtime, a, b, c
        double precision, allocatable :: res(:)
        integer :: i, j, i_max, j_max, N, id, count_id(0:11)
        double precision, external :: delta_seconds
!        
        call omp_set_num_threads (4)
!
        do j = 1,2
          if (j==1) j_max = 1         ! not enough work
          if (j==2) j_max = 10000     ! more work
          
          DO N = 1,3
            i_max = 10000 * 10**(n-1)
            allocate(res(i_max))
              write (*,*) 'i_max, j_max =',i_max, j_max
!
            count_id = 0
            wtime = delta_seconds ( )
         !$omp parallel do default(private) shared(res, i_max, count_id)
            do i = 1, i_max
                ! long subroutine, here replaced by dummy calculations
                id = omp_get_thread_num()
                count_id(id) = count_id(id) + 1
                call do_more_work (j_max, a)
                a = real(i)
                b = real(i)**2
                c = b**2+a**2
                res(i) = real(c)
            end do
         !$omp end parallel do
          
            wtime = delta_seconds ( )
            print*, 'Elapsed time parallel simulation : ', wtime, count_id(0)
          
            count_id = 0
            wtime = delta_seconds ( )
            do i = 1, i_max
                ! long subroutine, here replaced by dummy calculations
                id = omp_get_thread_num()
                count_id(id) = count_id(id) + 1
                call do_more_work (j_max, a)
                a = real(i)
                b = real(i)**2
                c = b**2+a**2
                res(i) = real(c)
            end do
          
            wtime = delta_seconds ( )
            print*, 'Elapsed time serial simulation   : ', wtime, count_id(0)
            deallocate(res)
!
          END DO ! N
        end do ! j
        
    end program

   double precision function Delta_Seconds ()
      integer*8        :: clock1 = 0
      integer*8        :: clock2, rate
      double precision :: dt
 
       CALL SYSTEM_CLOCK (clock2,rate)
       dt = (clock2-clock1)/DBLE(rate) 
       clock1 = clock2
       Delta_Seconds = dt
   end function Delta_Seconds

   subroutine do_more_work (j_max, a)
      double precision :: a, b
      integer :: i, j_max
      a = 1.0
      do i = 1,j_max
        b = log(a)
        a = a + b
      end do
   end subroutine do_more_work

 

John_Campbell
New Contributor II
95 Views

My apologies, as I did not declare j_max as shared. ( I don't normally use "default(private)" but prefer to declare all variables explicitly as private or shared )

I did some other minor changes to compare the serial and parallel results.

Now, for small i_max,j_max, parallel is slower than serial, but for increased workload with larger i_max, j_max, parallel is faster.

The revised example is:

  program main
      use omp_lib
      implicit none
        
        double precision :: wtime, a, b, c, so,ss
        double precision, allocatable :: res(:)
        integer :: i, j, i_max, j_max, N, id, count_id(0:11)
        double precision, external :: delta_seconds
!        
        call omp_set_num_threads (4)
!
        do j = 1,2
          if (j==1) j_max = 1         ! not enough work
          if (j==2) j_max = 1000      ! more work, but not too much
          
          DO N = 1,3
            i_max = 1000 * 10**(n-1)
            allocate(res(i_max))
              write (*,10) 'i_max, j_max = ',i_max, j_max
!
            count_id = 0
            so = 0
            wtime = delta_seconds ( )
         !$omp parallel do default(private) shared(res, i_max, j_max, count_id) REDUCTION(+ : so) 
            do i = 1, i_max
                ! long subroutine, here replaced by dummy calculations
                id = omp_get_thread_num()
                count_id(id) = count_id(id) + 1
                call do_more_work (j_max, a)
                b = real(i)**2
                c = sqrt(b)+a**2
                res(i) = real(c)
                so = so + c
            end do
         !$omp end parallel do
          
            wtime = delta_seconds ( )
            write (*,11) 'Elapsed time parallel simulation : ', wtime, count_id(0), so  ! report sum
          
            count_id = 0
            ss = 0
            wtime = delta_seconds ( )
            do i = 1, i_max
                ! long subroutine, here replaced by dummy calculations
                id = omp_get_thread_num()
                count_id(id) = count_id(id) + 1
                call do_more_work (j_max, a)
                b = real(i)**2
                c = sqrt(b)+a**2
                res(i) = real(c)
                ss = ss + c
            end do
          
            wtime = delta_seconds ( )
            write (*,11) 'Elapsed time serial simulation   : ', wtime, count_id(0), ss, so-ss  ! compare sums
            deallocate(res)
!
          END DO ! N
        end do ! j
    10 format (/a,i0,1x,i0)
    11 format (a,f10.6,1x,i8,2es12.4)
    end program

   double precision function Delta_Seconds ()
      integer*8        :: clock1 = 0
      integer*8        :: clock2, rate
      double precision :: dt
 
       CALL SYSTEM_CLOCK (clock2,rate)
       dt = (clock2-clock1)/DBLE(rate) 
       clock1 = clock2
       Delta_Seconds = dt
   end function Delta_Seconds

   subroutine do_more_work (j_max, a)
      double precision :: a, b
      integer :: i, j_max
      a = 1.1            ! b now varies through loop
      do i = 1,j_max
        b = log(a)
        a = a + b
      end do
      a = b
   end subroutine do_more_work

 

Reply