$omp do reduction does not synchronize INSIDE the parallel region

bmg_88 · ‎02-19-2024

Hi,

I am working with a code that somehow looks like this:

!$omp parallel
...
    !$omp do
      do i=5, N
         ... !heavy computation
         P_mtx(i) = ...
         R_mtx(i) = ... 
      end do
    !$omp end do
    !$omp single
      do i=1, 4
         ... !heavy computation
         P_mtx(i) = ...
         R_mtx(i) = ... 
      end do
    !$omp end single
    !$omp barrier

    !$omp single
      coef_CG_1 = 0.0d+0
    !$omp end single

    !$omp do reduction(+:coef_CG_1) schedule(runtime)
      do i = 1, tot_cells
         coef_CG_1 = coef_CG_1 + R_mtx(i)**2  
      end do
    !$omp end do

    print *, coef_CG_1 !--> only 1 thread (sometimes) gives the correct value

    !in the end I got the wrong result because coef_CG_1 is not correct 
...
!$omp end parallel

I can ensure that there is no problem before the directive $omp barrier. So, the problem here is on the $omp do reduction. I tried to print the results of coef_CG_1 for all threads and found that only 1 thread (sometimes) gave the correct result.

My question: Is that true that the compiler cannot guarantee the synchronization (after the reduction procedure) inside the parallel region?

Obviously, I found the solution of problem like this:

!$omp parallel
...
  thread_id = omp_get_thread_num()
  num_threads = omp_get_num_thread()
    !$omp do
      do i=5, N
         ... !heavy computation
         P_mtx(i) = ...
         R_mtx(i) = ... 
      end do
    !$omp end do
    !$omp single
      do i=1, 4
         ... !heavy computation
         P_mtx(i) = ...
         R_mtx(i) = ... 
      end do
    !$omp end single
    !$omp barrier

    !$omp do
      do i = 1, num_threads
      coef_CG_1(i) = 0.0d+0
    !$omp end do

    !$omp do schedule(runtime)
      do i = 1, tot_cells
         coef_CG_1(thread_id+1) = coef_CG_1(thread_id+1) + R_mtx(i)**2  
      end do
    !$omp end do

    print *, sum(coef_CG_1(thread_id+1)) !--> this is correct

    !The calculations are correct 
...
!$omp end parallel

However, it is still unclear for me, why $omp do reduction does not synchronize automatically for the first code? FYI, I cannot split the code into several parallel regions.

Any advices will be appreciated. Thanks.

TobiasK · ‎02-20-2024

@bmg_88

I moved this to the Fortran forum.

We have some very active and skilled people here that might guide you.

Just a short comment, your assumption about !$omp do reduction is wrong. The synchronization point is unspecified by the standard, so in practice the synchronization/reduction happens only at the !$end parallel point.

https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf

Page 133:

33 The location in the OpenMP program at which values are combined and the order in which values

34 are combined are unspecified. Thus, when comparing sequential and parallel executions, or when

35 comparing one parallel execution to another (even if the number of threads used is the same),

I would probably rewrite the code with multiple parallel regions, I doubt you see the overhead of that. Also for coef_CG_1 why do you not include the reduction directly in your first loop where you calculate R_mtx()?

jimdempseyatthecove · ‎02-20-2024

As @TobiasK indicates, the reduction occurs at the end of the parallel region (reduction clause belongs on the "!$omp parallel ...").

Your last code sample is one way of handling this.

An alternative is to have private(coef_CG_1_private), and shared(coef_CG_1)

Have each thread tally into its coef_CG_1_private

Then after the loop one of:

!$omp atomic

coef_CG_1 = coef_CG_1 + coef_CG_1_private

!$omp end atomic

! *** caution at this point in code, coef_CG_1 may not be fully updated

.OR.

Use a critical section if the value of coef_CG_1 is to be used before the next synchronizing point of the parallel region.

Jim Dempsey