- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With some special test cases my OpenMP code does not yield identical results when running the same job twice with the same number of threads. I finally identified sum reductions being the root cause. Apparently OpenMP at the end of the loop adds the partial sums in order of thread execution which is not deterministic.
So I'm wondering what the OpenMP standard says on this matter, especially as I found it possible to create repeatable reductions explicitly with no performance penalty, see below. Any comments are welcome.
Michael
So I'm wondering what the OpenMP standard says on this matter, especially as I found it possible to create repeatable reductions explicitly with no performance penalty, see below. Any comments are welcome.
Michael
[fortran]! -------------------------------- ! computing alpha ! -------------------------------- alpha_denom_thr(:) = 0. !$omp parallel private(mt, alpha_denom) alpha_denom = 0. mt = omp_get_thread_num() !$omp do private(i) do i=1,nchan Mpeta(i) = Mpeta(i) - omega * v_temp(i) alpha_denom = alpha_denom + ( r_s (i) * Mpeta (i) ) end do !$omp end do alpha_denom_thr(mt) = alpha_denom !$omp end parallel alpha_denom = sum(alpha_denom_thr) ! Repeatable reduction [/fortran]
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I haven't tried this but experiment with
!$omp parallel do private(i), ordered, reduction(+:alpha_denom)
...
!$omp end parallel do
What I do not know for sure is if the loop runs in parallel with the reduction ordered
Similar effect to:
alpha_denom = 0.
!$omp parallel private(mt, my_alpha_denom)
my_alpha_denom = 0.
!$omp do private(i)
... (accumulate my_alpha_denom)
!$omp end do
!$omp ordered
alpha_denom = alpha_denom + my_alpha_denom
!$omp end ordered
!$omp end parallel
Note, your original method should not have any more overhead than the above.
The same work is done regardless of being hidden by the directives.
Jim Dempsey
!$omp parallel do private(i), ordered, reduction(+:alpha_denom)
...
!$omp end parallel do
What I do not know for sure is if the loop runs in parallel with the reduction ordered
Similar effect to:
alpha_denom = 0.
!$omp parallel private(mt, my_alpha_denom)
my_alpha_denom = 0.
!$omp do private(i)
... (accumulate my_alpha_denom)
!$omp end do
!$omp ordered
alpha_denom = alpha_denom + my_alpha_denom
!$omp end ordered
!$omp end parallel
Note, your original method should not have any more overhead than the above.
The same work is done regardless of being hidden by the directives.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Traditionally (even before there was an Intel implementation), reduction is expected not to give the identical results when the number of threads is changed. Thus the option in certain production applications to bypass OpenMP reduction.
If the variations with order of thread reduction are significant, you probably need to perform this part of the calculation in double precision.
If the variations with order of thread reduction are significant, you probably need to perform this part of the calculation in double precision.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks, Jim,
I will play with ordered to see if that does the same thing.
thanks, Tim,
I agree that the datamust berather ill conditioned to exposesuch abehaviour. However this lack of repeatability is something that makes users quite nervous, so I'm looking for ways tohide some ofthe symptoms.
Normally one would need to do the reduction after sorting by magnitudes. But this will introduce enormous overhead.
I will play with ordered to see if that does the same thing.
thanks, Tim,
I agree that the datamust berather ill conditioned to exposesuch abehaviour. However this lack of repeatability is something that makes users quite nervous, so I'm looking for ways tohide some ofthe symptoms.
Normally one would need to do the reduction after sorting by magnitudes. But this will introduce enormous overhead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Normally one would need to do the reduction after sorting by magnitudes. But this will introduce enormous overhead
The problem (reducing effect of round-off error) will not come by sorting the partial results produced by each thread and then performing the summation (reduction operator). e.g. add the partial results from smallest to largest. The crux of the matter is reducing effect of round-off error in producing the partial results by each thread. Best results may come from each thread producing a table of results (e.g. partial sums), organized by size, then the reduction(outside, or at end of,the parallel region)is performed by summing each thread's table from small to large to produce the final reduction. Note, part of this reduction can be performed in parallel.
Jim Dempsey
The problem (reducing effect of round-off error) will not come by sorting the partial results produced by each thread and then performing the summation (reduction operator). e.g. add the partial results from smallest to largest. The crux of the matter is reducing effect of round-off error in producing the partial results by each thread. Best results may come from each thread producing a table of results (e.g. partial sums), organized by size, then the reduction(outside, or at end of,the parallel region)is performed by summing each thread's table from small to large to produce the final reduction. Note, part of this reduction can be performed in parallel.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
An efficient technique for bin selection is to use the exponent or some number of msb portion of the exponent and use that as a bin index.
REAL(4) uses 8-bit exponent (256 bins)
REAL(8) uses 11-bit exponent (2048 bins)
If you do not wish to use all the bits of the exponent for bin selection, then your bin number generator may have to take into consideration the distribution of the binable data. Either this can be a computational function or table driven bin number generator.
Jim Dempsey
REAL(4) uses 8-bit exponent (256 bins)
REAL(8) uses 11-bit exponent (2048 bins)
If you do not wish to use all the bits of the exponent for bin selection, then your bin number generator may have to take into consideration the distribution of the binable data. Either this can be a computational function or table driven bin number generator.
Jim Dempsey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page