Solved: OMP: clock time increases with number of CPUs: OMP_NESTED issu

hallevison · ‎03-29-2011

Hi:

This is a continuation of a thread I originally started on the VTune
Forum (http://software.intel.com/en-us/forums/showthread.php?t=81601),
but the intel guys suggested that I move it here. Let me start
by saying that I am new to parallelization and OMP, but am very
familiar with fortran and large-scale simulations.

I have a code where the clock time increases when the number of
CPUs gets larger than 4. In particular, when I isolate the part
of the code I am having problems with, I get:

N threads Clock time (s)
1 136.
2 90.
4 65.
6 113.
8 165.
10 202.
12 237.

With 8 threads, vtune shows:

coord_vb2h is taking most of the time. This is very suprising
since it is a trivial routine that should not take any time at
all - and it doesn't when the number of CPUs is small. The same
is true for kickvh. Looking at the source code shows that the OMP
directives are taking most of the time:

Now this is where I get confused. While coord_vb2h and the like
have OMP directives in them to parallelize some of the loops,
they are not the outermost parallelized loop. It is not my
intention to run these in parallel so I set OMP_NESTED=false.
Shouldn't this stop the OMP directives in these routine from
being called? I can't just remove the directives from these
routines because I need them in other parts of the code.

Thanks for your help.

-Hal.

fah10 · ‎03-29-2011

In my experience, dealing with nested parrallelizm always causes problems. Therefore one should avoid it

If you want to have parallel subroutines which can be called from inside and outside parallel regions, the only working way I found so far, is to duplicate code as in the following example in mysub.

If I reproduced your program behavior with yoursub correctly, you can see that each thread executes _all_ iterations when called from an existing parallel region. This causes a data race which may lead to wrong results. This would also explain the slow down of your program at large thread numbers, since the processor gets very busy in keeping the cache consistency when you have such data races.

[fortran]program omptest
   use omp_lib
   implicit none
   real, dimension(4) :: a
  
   !$omp parallel default(none) shared(a)
   call mysub(a)
   call yoursub(a)
   !$omp end parallel
   write(*,*) '-----------'
   call mysub(a)
   call yoursub(a)

contains

   subroutine yoursub(a)
      real, dimension(:) :: a
      integer :: i, n

      call omp_set_nested(.false.)

      !$omp parallel do shared(a) private(i)
      do i=1,4
         a(i) = 1.0
         write(*,*) 'yoursub: ',omp_get_thread_num(), i
      end do
      !$omp end parallel do

   end subroutine yoursub

   subroutine mysub(a)
      real, dimension(:) :: a
      integer :: i, n

      if (.not.omp_in_parallel()) then
         !$omp parallel do shared(a) private(i)
         do i=1,4
            a(i) = 1.0
            write(*,*) 'mysub: ',omp_get_thread_num(), i
         end do
         !$omp end parallel do
      else
         !$omp do
         do i=1,4
            a(i) = 1.0
            write(*,*) 'mysub: ',omp_get_thread_num(), i
         end do
         !$omp end do
      end if

   end subroutine mysub

end program omptest
[/fortran]

> ifort -openmp omp_in_parallel.F90 && OMP_NUM_THREADS=2 ./a.out
mysub: 0 1
mysub: 0 2
mysub: 1 3
mysub: 1 4
yoursub: 0 1
yoursub: 0 2
yoursub: 0 3
yoursub: 0 4
yoursub: 0 1
yoursub: 0 2
yoursub: 0 3
yoursub: 0 4
-----------
mysub: 0 1
mysub: 0 2
mysub: 1 3
mysub: 1 4
yoursub: 0 1
yoursub: 0 2
yoursub: 1 3
yoursub: 1 4

View solution in original post

fah10 · ‎03-29-2011

In my experience, dealing with nested parrallelizm always causes problems. Therefore one should avoid it

If you want to have parallel subroutines which can be called from inside and outside parallel regions, the only working way I found so far, is to duplicate code as in the following example in mysub.

If I reproduced your program behavior with yoursub correctly, you can see that each thread executes _all_ iterations when called from an existing parallel region. This causes a data race which may lead to wrong results. This would also explain the slow down of your program at large thread numbers, since the processor gets very busy in keeping the cache consistency when you have such data races.

[fortran]program omptest
   use omp_lib
   implicit none
   real, dimension(4) :: a
  
   !$omp parallel default(none) shared(a)
   call mysub(a)
   call yoursub(a)
   !$omp end parallel
   write(*,*) '-----------'
   call mysub(a)
   call yoursub(a)

contains

   subroutine yoursub(a)
      real, dimension(:) :: a
      integer :: i, n

      call omp_set_nested(.false.)

      !$omp parallel do shared(a) private(i)
      do i=1,4
         a(i) = 1.0
         write(*,*) 'yoursub: ',omp_get_thread_num(), i
      end do
      !$omp end parallel do

   end subroutine yoursub

   subroutine mysub(a)
      real, dimension(:) :: a
      integer :: i, n

      if (.not.omp_in_parallel()) then
         !$omp parallel do shared(a) private(i)
         do i=1,4
            a(i) = 1.0
            write(*,*) 'mysub: ',omp_get_thread_num(), i
         end do
         !$omp end parallel do
      else
         !$omp do
         do i=1,4
            a(i) = 1.0
            write(*,*) 'mysub: ',omp_get_thread_num(), i
         end do
         !$omp end do
      end if

   end subroutine mysub

end program omptest
[/fortran]

> ifort -openmp omp_in_parallel.F90 && OMP_NUM_THREADS=2 ./a.out
mysub: 0 1
mysub: 0 2
mysub: 1 3
mysub: 1 4
yoursub: 0 1
yoursub: 0 2
yoursub: 0 3
yoursub: 0 4
yoursub: 0 1
yoursub: 0 2
yoursub: 0 3
yoursub: 0 4
-----------
mysub: 0 1
mysub: 0 2
mysub: 1 3
mysub: 1 4
yoursub: 0 1
yoursub: 0 2
yoursub: 1 3
yoursub: 1 4

hallevison · ‎03-29-2011

OK, I think I know what to try. I have one question, why do you need the

[bash]!$omp do  
!$omp end do[/bash]

at lines 43 and 48 in your example?

Thanks!

-Hal

fah10 · ‎03-29-2011

In case you call the routine from an existing parallel region, you need to distribute the loop to all threads. Otherwise all iterations would be calculated by all threads, leading to data races again.

If you don't want to run the loop in parallel, you can replace the !$omp do ... !$omp end do
by !$omp single .. !$omp end single
Then only one thread executes the loop (all iterations). However, the other threads are idle during loop execution. So one can use them to calculate parts of the loop by !$omp do ... !$omp end do

hallevison · ‎03-29-2011

Hum... I was playing with your code and comparing it to what I am doing. I seem to get a different answer if I do the following (sorry for the F77):

[bash]      real a(4)

c...  OMP stuff
!$	logical OMP_get_dynamic,OMP_get_nested
!$	integer nthreads,OMP_get_max_threads


c...  OMP stuff
!$    write(*,'(a)')      ' OpenMP parameters:'
!$    write(*,'(a)')      ' ------------------'
!$    write(*,*) '   Dynamic thread allocation = ',OMP_get_dynamic()
!$    call OMP_set_nested(.false.)
!$    write(*,*) '   Nested parallel loops = ',OMP_get_nested()
!$    nthreads = OMP_get_max_threads() ! In the *parallel* case
!$    write(*,'(a,i3,/)') '   Number of threads  = ', nthreads 


      do i=1,4
         call sub(i,a)
      enddo

      write(*,*) '-----------'  

!$omp parallel do default(none) shared(a)
      do i=1,4
         call sub(i,a)
      enddo
!$omp end parallel do  

      stop
      end

c---------------------
      subroutine sub(i,a)  
      real a(4)
      integer omp_get_thread_num

!$omp parallel do shared(a) private(j)  
      do j=1,4  
         a(j) = 1.0  
         write(*,*) 'subo: ',omp_get_thread_num(), i,j  
      end do  
!$omp end parallel do  

      return
      end
[/bash]

The output of this code is:

OpenMP parameters:
------------------
Dynamic thread allocation = F
Nested parallel loops = F
Number of threads = 4

subo: 0 1 1
subo: 2 1 3
subo: 3 1 4
subo: 1 1 2
subo: 0 2 1
subo: 3 2 4
subo: 2 2 3
subo: 1 2 2
subo: 3 3 4
subo: 0 3 1
subo: 2 3 3
subo: 1 3 2
subo: 0 4 1
subo: 2 4 3
subo: 1 4 2
subo: 3 4 4
-----------
subo: 0 1 1
subo: 0 1 2
subo: 0 1 3
subo: 0 2 1
subo: 0 3 1
subo: 0 2 2
subo: 0 2 3
subo: 0 2 4
subo: 0 3 2
subo: 0 3 3
subo: 0 3 4
subo: 0 1 4
subo: 0 4 1
subo: 0 4 2
subo: 0 4 3
subo: 0 4 4

So, I am getting the correct number of write statements. It seems to me that the only significant difference between what we are doing is that I have a parallel do aorund the outer loop and you just create a parallel section. Does this make sense to you?

Note, however, that omp_get_thread_num() always returns a 0 when the nested loops occur. Does anyone understand that?

fah10 · ‎03-30-2011

Yes, the number of write statements is correct. But be aware that you have a race condition in the code, since all threads (from the first parallel region) write at the a variable at the same time.

!$omp parallel always opens a new parallel region, no matter if nested parallelism is activated or not.
Even if you do a !$omp parallel if(condition), it opens a new parallel region, even if condition evaluates to .false.
If nesting is deactivated or condition is false, the number of threads in the new region is one.
Since omp_get_thread_num() always referes to the innermost enclosing parallel region, you always get 0 in your example.
But the threads from the outer parallel region are still there and write to a at the same time.

It's very instructive to read Section 2.4 from the current OpenMP standard ( http://openmp.org/wp/openmp-specifications/ )
It describes the behavior of the parallel construct and how the number of threads is determined (algorithm 2.1)

hallevison · ‎03-30-2011

Yes, I understand that I have a race condition. But, it seems to me that setting OMP_NESTED to false should stop this, don't you agree., since only the outermost loop should be parallelized. In particular, in my case it looks to me like the compiler was putting in barriers in the nested loops when it is not needed.

Your solution does fix my problem. The code runs a factor of 6 times faster on on 8 cores, which is what I have been expecting. So, thanks!

jimdempseyatthecove · ‎03-30-2011

The outer loop in your program starts a parallel region with a thread team size of n (4, 6, ... whatever).

Each thread from that team (team member number 0:3, 0:5, ... 0:whatever-1) calls your subroutine.

Each thread from your outer loop level team (attempts to) start a parallel region, but NESTED is disabled, the result of this action is: a new team is established (per thread from outer level), each with one team member (as nested = false) and who's team member number is 0 (since this is a new parallel region with a new team, but with one thread). Each team member now executes the do loop (as team member 0 of the new team for each new team)for the complete iteration space (one team member in team== no slices).

What you are observing is the same(complete) iteration space of the called subroutine is being executed (in parallel) n times (n being the number of team members in outer loop).

Not only is this redundant work, it is also not thread safe (operations are not atomic)

Using

!$omp do

Note lack of parallel, would be the (a) correct way to slice up this do loop using the current established thread team. (this was mentioned by one of the other posters)

By omitting "parallel" on the above statement, you declare you wish to use the current establish thread team for the slicing of the iteration space. Caution, programming in this style, is fine, but also requires all members of the team cross through the "!$omp do" (implied barrier at !$omp end do requires participation of all team member numbers).

Enabling nested would not work for the code as written. This would call the subroutine n times for each thread, each thread would establish a new thread team of n' threads (n' may = n or other number), each new thread team would slice up the iteration space, meaning each team completes the entire iteration space (t times the work) but in n' slices.

N.B.

It is unfortunate that the OpenMP standard is straddled with the historical artifact that omp_get_thread_num() is now use to obtain the 0-based thread team member number as opposed to a application wide 0-based cardinal thread number.In hindsight this functionshould probably be named something like omp_get_thread_team_member_num(). This would precondition new programmers that this number varies with the instantiation of thead teams.

Jim Dempsey

OMP: clock time increases with number of CPUs: OMP_NESTED issue?