OpenMP bug?

Brian1 · ‎05-09-2013

The attached code occasionally produces incorrect results, due to what I suspect is a data race in the OpenMP implementation. I believe the code is standards compliant and should produce deterministic output. The expected output is:

[plain]

user@host $ ./a.out

8. 1. 8. 1. 8. 1. 8. 1. 8. 1.

[/plain]

Occasionally, the program will produce erroneous output such as a blank line or numbers other than 8 and 1. I have observed the following:

[plain]

user@host $ ./a.out

user@host $ ./a.out
Array "whole" contains zeros!
8. 0. 0. 0. 0. 0. 0. 0. 0. 0.

[/plain]

It appears there is a race condition during the copy-in or copy-out of the non-contiguous array section pointed to by "slice" that is passed to the subroutine "sub". The C pointer "ptr" is private through the OpenMP parallel directive and is used to allocate memory that is private to each thread via a call to malloc. This memory is associated with the pointer "whole", which is threadprivate via the declaration in "data_mod", and subsequently becomes associated with the shared pointer variable "slice", which is accessed only by one thread due to the OpenMP single directive. Memory allocation via "malloc" on Mac OS X is documented to be thread safe, so that shouldn't be the source of the race.

Compiler and host (Mac OS X 10.8) information:

[plain]

user@host $ ifort --version
ifort (IFORT) 12.1.6 20120928
Copyright (C) 1985-2012 Intel Corporation. All rights reserved.

user@host $ uname -a
Darwin host.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan 6
22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64

[/plain]

Casey · ‎05-09-2013

put an !$omp barrier before the single block.

The single block has an implied wait at the end but not the beginning. The single block just means that only one of your worker threads will execute the block (rather than all of them) but it will start as soon as the first thread gets to it. The race condition occurs when the thread that runs the single block gets there before the others finish the code prior to the single block. Putting a barrier before the single ensures the code only runs after all of the threads finish their work. In short, it is a race, but it isn't OpenMP's fault.

Brian1 · ‎05-10-2013

Casey W. wrote:

put an !$omp barrier before the single block.

I suppose adding a barrier to the single block could have an effect (and perhaps prevent the data race), but why would this need to be done? There should only be one thread executing the single block, so what would be causing a race within it? Can you point me to the line(s) of code that are actually producing the race?

TimP · ‎05-10-2013

A single block starting without a barrier is valid only when the single block doesn't access anything which is modified in the preceding code (by threads which may still be busy when the first thread reaches the single). It took me some months to get my posted example correct (as far as I know, with hints from expert colleagues).

Casey · ‎05-10-2013

Brian wrote:

Quote:

Casey W.wrote:
put an !$omp barrier before the single block.

I suppose adding a barrier to the single block could have an effect (and perhaps prevent the data race), but why would this need to be done? There should only be one thread executing the single block, so what would be causing a race within it? Can you point me to the line(s) of code that are actually producing the race?

The race occurs when line 55 is reached before all threads have finished executing line 52.

The barrier makes sure that all threads finish the c_f_pointer() call before single block uses the results of that call. The single directive doesn't do any thread synchronization at its start it just limted execution to 1 thread. When the first thread gets to line 54 it moves on to line 55 (no matter where the other threads are) and every subsuquent thread that gets to line 54 jumps to line 64 and waits at the implied barrier for all other threads.

Brian1 · ‎05-10-2013

Thank you for your help tracking down the cause of the data race. If I understand this correctly, you are saying that the single directive could cause an individual thread to somehow have an inconsistent view of memory that it modified in the immediately preceding parallel section. Clearly, this could lead a data race and it could cause the bug I found.

However, the single directive appears to be a red herring. I removed it and made "slice" a private variable in the parallel directive. Now there is output from each thread and one or two of them will occasionally be erroneous. The modified program is attached.

Isn't each thread guaranteed to have a consistent view of memory that only it has accessed within the parallel region? Each thread is only working with threadprivate ("whole") or private ("ptr", "slice") variables, yet somehow there is still a data race.

Casey · ‎05-10-2013

As far as I can tell the race is somewhere in the c bindings. Is c_f_pointer threadsafe? I'm not familiar with the iso_c_binding stuff, so I can't really help there. I did rip out all of the C bindings (allocate/deallocate vs malloc+c_f_pointer/free) and that code runs with no problems, unable to reproduce the bug from bug2.f90. Given that, I am inclined to point my finger at c_f_pointer, but thats about all the help I can give at this point.

jimdempseyatthecove · ‎05-10-2013

I cannot get this to fail on my system. Are you linking in the multi-threaded CRTL?

Just before you "call free(ptr)" add "write(*,*) ptr" (do not place just after allocation as this may alter the symptom)

See if allocating same block

Jim Dempsey

Brian1 · ‎05-10-2013

Okay, another iteration and some new info. The program now assigns one plus the OpenMP thread number to "whole" and prints out the thread number and the value of "ptr" (via transfer into a pointer-sized integer, since a c_ptr can't be printed directly). Here's the result with OMP_NUM_THREADS=17 in a case where the output is incorrect:

[plain]

Array "whole" contains zeros!
17. 17. 17. 17. 17. 17. 17. 17. 17. 17.
9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
13. 13. 13. 13. 13. 13. 13. 13. 13. 13.
4. 4. 4. 4. 4. 4. 4. 4. 4. 4.
15. 15. 15. 15. 15. 15. 15. 15. 15. 15.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
14. 14. 14. 14. 14. 14. 14. 14. 14. 14.
16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
12. 12. 12. 12. 12. 12. 12. 12. 12. 12.
5. 5. 5. 5. 5. 5. 5. 5. 5. 5.
7. 7. 7. 7. 7. 7. 7. 7. 7. 7.
8. 8. 8. 8. 8. 8. 8. 8. 8. 8.
2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
6. 6. 6. 6. 6. 6. 6. 6. 6. 6.
10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
11. 11. 11. 11. 11. 11. 11. 11. 11. 11.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
0: 140410170114048
16: 140410182697040
12: 140410181648464
15: 140410169081696
13: 140410170114128
14: 140410184794272
8: 140410184794192
10: 140410183745616
9: 140410171162704
11: 140410171162784
4: 140410171162624
3: 140410184794112
1: 140410183745536
5: 140410181648384
7: 140410185842688
6: 140410182696960
2: 140410169081072

[/plain]

In this case, the thread that prints incorrect values is thread two (value in "whole" should be 3). However, thread two's ptr does not alias any other thread's memory.

Another result:

[plain]

5. 5. 5. 5. 5. 5. 5. 5. 5. 5.
9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
11. 11. 11. 11. 11. 11. 11. 11. 11. 11.
8. 8. 8. 8. 8. 8. 8. 8. 8. 8.

6. 6. 6. 6. 6. 6. 6. 6. 6. 6.
10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
12. 12. 12. 12. 12. 12. 12. 12. 12. 12.
15. 15. 15. 15. 15. 15. 15. 15. 15. 15.
13. 13. 13. 13. 13. 13. 13. 13. 13. 13.
14. 14. 14. 14. 14. 14. 14. 14. 14. 14.
16. 16. 16. 16. 16. 16. 16. 16. 16. 16.
17. 17. 17. 17. 17. 17. 17. 17. 17. 17.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1.

8: 140408171528192
2: 140408156848128
10: 140408172576768
4: 140408156848208
0: 140408155815152
6: 140408169431040
5: 140408170479616
12: 140408171528272
13: 140408156848288
9: 140408170479696
3: 140408157896704
7: 140408155815232
11: 140408168382544
1: 140408168382464
16: 140408156848368
14: 140408170479776
15: 140408155815312

[/plain]

I tried enclosing the section of code from before the call to malloc to after the assignment to "whole" in a critical section:

[plain]

...

!$omp parallel private(ptr)
!$omp critical
ptr = malloc(int(nx*sizeof(1._c_double), c_size_t))
if (.not. c_associated(ptr)) then
write (*,'("Thread ", I0, ": malloc failed")') omp_get_thread_num()
stop
end if
call c_f_pointer(ptr, whole, (/ nx /))

whole = 1 + omp_get_thread_num()

!$omp end critical
if (any(whole .eq. 0._c_double)) then
write (*,'("Array ""whole"" contains zeros!")')
end if
write (*,'(999(F3.0, X))') whole

...

[/plain]

but this **still** produces incorrect results! It's as though the assignment to "whole" isn't seen by any() or by write.

Is anyone able to reproduce this behavior?

As Casey mentioned, if I replace malloc + c_f_pointer + free with Fortran allocate + deallocate, things seem to work as expected. Unfortunately, this won't work for my application because I need memory that is aligned on 16 byte boundaries. As far as I know, there is no way to do this directly from Fortran allocate.

TimP · ‎05-10-2013

I'd be surprised if you could avoid allocating with 16-byte alignments, unless possibly in the 32-but compiler with option -mia32.

Brian1 · ‎05-10-2013

Tim: Thanks, that's very useful to know. The code that started all of this mess uses FFTW. The FFTW manual has this to say about Fortran and aligned memory (see http://fftw.org/fftw3_doc/Allocating-aligned-memory-in-Fortran.html#Allocating-aligned-memory-in-Fortran):

In order to obtain maximum performance in FFTW, you should store your data in arrays that have been specially aligned in memory (see SIMD alignment and fftw_malloc). Enforcing alignment also permits you to safely use the new-array execute functions (seeNew-array Execute Functions) to apply a given plan to more than one pair of in/out arrays. Unfortunately, standard Fortran arrays do not provide any alignment guarantees. The only way to allocate aligned memory in standard Fortran is to allocate it with an external C function, like the fftw_alloc_real and fftw_alloc_complex functions.

Is it safe to assume that all Fortran compilers (not just ifort) will allocate memory on 16 byte boundaries?

TimP · ‎05-11-2013

The x86_64 ABI dictates 16-byte alignments, so a lesser alignment would be a bug.

basic alignment in linux-i386 is 8-byte, and efforts have been made over the years to make it play better with SSE2.

It may be worth while to try 32-byte alignments. Ron Green discusses details of Intel compiler support for alignment here:

http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization