Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29001 Discussions

Repeatability of sum reductions with OpenMP

mriedman
Novice
1,475 Views
With some special test cases my OpenMP code does not yield identical results when running the same job twice with the same number of threads. I finally identified sum reductions being the root cause. Apparently OpenMP at the end of the loop adds the partial sums in order of thread execution which is not deterministic.

So I'm wondering what the OpenMP standard says on this matter, especially as I found it possible to create repeatable reductions explicitly with no performance penalty, see below. Any comments are welcome.

Michael

[fortran]!       --------------------------------
!       computing alpha 
!       --------------------------------

        alpha_denom_thr(:) = 0.
!$omp parallel private(mt, alpha_denom)
        alpha_denom = 0.
        mt = omp_get_thread_num()

!$omp do private(i)
        do i=1,nchan
          Mpeta(i) = Mpeta(i) - omega * v_temp(i)

          alpha_denom = alpha_denom + ( r_s (i) * Mpeta (i) ) 
        end do
!$omp end do
        alpha_denom_thr(mt) = alpha_denom
!$omp end parallel
        alpha_denom = sum(alpha_denom_thr)  ! Repeatable reduction

[/fortran]
0 Kudos
5 Replies
jimdempseyatthecove
Honored Contributor III
1,475 Views
I haven't tried this but experiment with

!$omp parallel do private(i), ordered, reduction(+:alpha_denom)
...
!$omp end parallel do

What I do not know for sure is if the loop runs in parallel with the reduction ordered
Similar effect to:

alpha_denom = 0.
!$omp parallel private(mt, my_alpha_denom)
my_alpha_denom = 0.
!$omp do private(i)
... (accumulate my_alpha_denom)
!$omp end do
!$omp ordered
alpha_denom = alpha_denom + my_alpha_denom
!$omp end ordered
!$omp end parallel

Note, your original method should not have any more overhead than the above.
The same work is done regardless of being hidden by the directives.

Jim Dempsey
0 Kudos
TimP
Honored Contributor III
1,475 Views
Traditionally (even before there was an Intel implementation), reduction is expected not to give the identical results when the number of threads is changed. Thus the option in certain production applications to bypass OpenMP reduction.
If the variations with order of thread reduction are significant, you probably need to perform this part of the calculation in double precision.
0 Kudos
mriedman
Novice
1,475 Views
thanks, Jim,
I will play with ordered to see if that does the same thing.

thanks, Tim,
I agree that the datamust berather ill conditioned to exposesuch abehaviour. However this lack of repeatability is something that makes users quite nervous, so I'm looking for ways tohide some ofthe symptoms.

Normally one would need to do the reduction after sorting by magnitudes. But this will introduce enormous overhead.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,475 Views
>>Normally one would need to do the reduction after sorting by magnitudes. But this will introduce enormous overhead

The problem (reducing effect of round-off error) will not come by sorting the partial results produced by each thread and then performing the summation (reduction operator). e.g. add the partial results from smallest to largest. The crux of the matter is reducing effect of round-off error in producing the partial results by each thread. Best results may come from each thread producing a table of results (e.g. partial sums), organized by size, then the reduction(outside, or at end of,the parallel region)is performed by summing each thread's table from small to large to produce the final reduction. Note, part of this reduction can be performed in parallel.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,475 Views
An efficient technique for bin selection is to use the exponent or some number of msb portion of the exponent and use that as a bin index.

REAL(4) uses 8-bit exponent (256 bins)
REAL(8) uses 11-bit exponent (2048 bins)

If you do not wish to use all the bits of the exponent for bin selection, then your bin number generator may have to take into consideration the distribution of the binable data. Either this can be a computational function or table driven bin number generator.

Jim Dempsey
0 Kudos
Reply