Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

restrictive use of OpenMP BARRIER

John_Campbell
New Contributor II
680 Views

I have been trying to use BARRIER to align the computation in the active threads to more closely align the use of memory and improve cache availability.

I have been surprised by the apparent restrictions on the use of BARRIER in an !$OMP PARALLEL DO region and was wondering if I was expecting too much.

BARRIER requires "Each barrier must be encountered by all threads in a team". However, in the example below, if "a_few_tests" is not a multiple of the team size, the program will hang.

!$OMP PARALLEL DO
   do test = 1,a_few_tests
     do step = 1,many_steps
       call use_very_large_array (test,step)
     end do ! step
   !$OMP CRITICAL
     call report_results 
   end do ! test
!$OMP END PARALLEL DO
...

   subroutine use_very_large_array (test,step)
use large_array_info
 
 !$OMP BARRIER      ! to align threads for memory usage
... 
! calculate using array
   end subroutine use_very_large_array

Rather than all threads, for a PARALLEL DO construct, I would prefer all active threads in a team. When nearing the end of the DO, the number of active threads is managed by the SCHEDULE and so should be available to the OMP manager.

There appear to be a number of restrictions with the use of BARRIER. I am not sure that BARRIER is allowed in a DO loop, as the compiler I am using only allows the use of BARRIER in the called subroutine. 

For this code to work, I must adjust the team size using call omp_set_num_threads (team_size) and possibly add phantom tests so that "a_few_tests" is a multiple of "team_size".

( The approach of including BARRIER has significantly improved performance in a multi-thead calculation that has large memory demand, resulting in more uniform run times between threads. Previously thread run times varied between 4h:20m to 5h:20m to now only 3h:40m with BARRIER. There are 10 threads/tests on an i7-8790K for 5,000 steps of 16gb array. Each thread reads 32Gb of memory per step (now 2.6 sec) so common memory usage between threads appears to be occurring more often. )

Questions:

Is what I am describing common to all implementations of BARRIER for OpenMP ?

Have others been able to overcome these apparent restrictions in another way ?

I am attaching a working example that appears to demonstrate the problem I am identifying.

0 Kudos
5 Replies
jimdempseyatthecove
Honored Contributor III
680 Views

You will have to structure your loop differently:

!$OMP PARALLEL
nThreads = omp_get_num_threads()
if(mod(a_few_test,nThreads) == 0) then
  iterCount = a_few_tests
else
  iterCount = a_few_tests + nThreads - mod(a_few_test,nThreads)
endif
!$OMP DO 
do test = 1,iterCount
  do step = 1,many_steps 
    call use_very_large_array (a_few_tests,test,step) 
  end do ! step 
  !$OMP CRITICAL 
  call report_results  
end do ! test 
!$OMP END PARALLEL DO 


subroutine use_very_large_array (testlimit,test,step) 
  use large_array_info
  integer :: testlimit,test,step ! note a_few_tests/testlimit may be located in large_array_info
!$OMP BARRIER      ! to align threads for memory usage 
  if(test <= testlimit) then
  ...  
 
  ! calculate using array
  endif
end subroutine use_very_large_array 

Jim Dempsey

0 Kudos
John_Campbell
New Contributor II
680 Views

Jim,

Thanks very much for your advice.

The approach I am taking is: given num_tests and max_threads; calculate num_pass of threads; then min_threads for solving in num_pass; then adjust num_tests = min_threads * num_pass.  I also must use "call omp_set_num_threads (min_threads)" so the team size is correct. For the extra tests (max = num_pass-1), I can skip the calculations, but make sure they proceed through all barriers.

The use of BARRIER in this way to "synchronise" all threads where the memory transfer rate is the bottleneck, appears to be having a significant effect on run time performance. For each solution step the 16Gb array is read twice, so I am now placing two barriers before each pass. For the i7-8700, this has reduced elapse time by over 30% and for the case of i7-4790 by over 50%. This shows a significant improvement when trying to address the memory transfer bottleneck. Having min_threads < max_threads also appears to be beneficial in this case. In initial runs, I was surprised how much slower the 4790 was in comparison to the 8700, but the memory transfer rate is a performance limiter. I am assuming there must be some sharing of L3 and possibly L2 cache between threads. The initial symptom was the large variation in elapsed time of threads, when reporting their results, for what was an identical calculation load.

I was expecting that BARRIER could be easier to manage as the thread scheduler should have known which threads are active. However, at least the solution to the problem is manageable, so thanks again for identifying the solution.

John

0 Kudos
jimdempseyatthecove
Honored Contributor III
680 Views

The 4790 has 2 memory channels and Max Memory Bandwidth of 25.6GB/s (8MB SmartCache)
The 8700 has 2 memory channels and Max Memory Bandwidth of 41.6GB/s (12MB SmartCache)

On memory I/O intensive applications, the limiting factors are the number of memory channels and the memory bandwidth per channel. While one thread performing block memory copy operations, on these CPUs, could potentially saturate memory bandwidth, memory intensive applications tend to do some work. As to an ideal number of threads, this will depend on the ratio of non-memory I/O (including reads from cache) verses memory I/O. In your application, you may have one of three different scenarios that best describes the activity:

1) Memory Bandwidth limitation
2) Cache utilization by single thread
3) Cache utilization by multiple threads

For situations 1 and 2, you may want to use:

    KMP_AFFINITY=scatter
or
    OMP_PROC_BIND=spread

For situation 3:

    KMP_AFFINITY=compact
or
    OMP_PROC_BIND=close

And for all three use a sub-set of all available threads

Note, also available is OMP_PLACES which provides additional control of thread placement.

What you asked of BARRIER (excluding number of threads that exit the region) is contrary to the design requirements of the barrier. Consider the implementation difficulties in describing the behavior should your parallel region have multiple !$OMP DO loops ending with !$OMP END DO NOWAIT. How would you desire the barrier to behave?

Jim Dempsey

0 Kudos
John_Campbell
New Contributor II
680 Views

What you asked of BARRIER (excluding number of threads that exit the region) is contrary to the design requirements of the barrier. Consider the implementation difficulties in describing the behavior should your parallel region have multiple !$OMP DO loops ending with !$OMP END DO NOWAIT. How would you desire the barrier to behave?

Jim, I don't understand what you are suggesting. By placing a BARRIER in a !$OMP DO ...!$OMP END DO, I was wanting the barrier to make all (active) threads wait and then continue (together) from this point, so NOWAIT is not relevant.

However, I can only place the BARRIER in the DO ..END DO by hiding it in a called routine. I suspect I am doing something wrong by choosing BARRIER, but don't know why.

So I would like to better understand "is contrary to the design requirements of the barrier". If what I have done (to ensure all threads proceed from this point together) is "contrary to the design ..", is there a more appropriate way?  After all, if BARRIER was appropriate in this case, I would have expected that only "active" threads would have been considered. (My approach has not considered the complexity of nested OpenMP.)

0 Kudos
jimdempseyatthecove
Honored Contributor III
680 Views

John,

I was showing an example of how BARRIER cannot conclusively determine the current number of participating threads of (each) loop as opposed to total (and even remaining) threads within the parallel region.

A parallel region (which may be nested) knows how many threads it has, and how many of its threads have entered a (the same) barrier. The (active) barrier gait opens and resets when the number of entries equals the number of threads in the current parallel region. While conceivably the code could be written to discount threads exiting (or waiting) the parallel region, the standards committee chose not to do so. This choice may be an oversight .OR. has adverse implications not considered here.

Jim Dempsey

0 Kudos
Reply