The answer to this is

Eugeny_Sosnovsky · ‎02-12-2016

Consider the following Fortran code segment.

!$omp parallel do &
!$omp shared(a, N) & ! N is default integer, a is integer(kind=8)
!$omp private(i, j)
do i  = 1, N
   if ( i .gt. 10) then
      a = 5_8 ! Note the write to a shared variable
   else
      j = a ! Note the read from a shared variable
      !...
      ! Do something with j
      !...
      a = 10_8 ! Note the write to a shared variable
   end if
end do
!$omp end parallel do

My question is this: since the variable "a" is shared, one obviously can't guaranteed what its value will be at any particular point. But, can we (e.g., by OpenMP standard, or by compiler spec) guarantee that a read and a write to the exact same memory location will not occur simultaneously, such that half the words (of this 8-byte variable) are from one value (e.g., "5_8") and the other half are from another value (e.g., "10_8"), because the variable is being written to and read from at exactly the same time?

Would the answer change if "a" was a multi-element array, allocatable or not? (Assuming we were writing to specific element(s)).

Even if this is possible, the probability is of course exceedingly low. Still, maybe due to paranoia, I tried my best to study both OpenMP 2.5 and OpenMP 4.0 standards, but unfortunately I was unable to get a definite confirmation. There are many texts online, including in Intel's materials for developers, that warn against unprotected reads and writes of shared space, because this introduces race conditions... But, assuming I am OK with a race condition (i.e., being uncertain about how up-to-date a given thread's version of a shared memory element is), is there still this word-sharing issue to worry about, as described above?

I am primarily interested in the answer as ICC15 specifies it, but if OpenMP itself actually does say something definitive about this, this would be even better.

Thank you very much in advance!

edit: Just to clarify, I recognize that in many cases threads will own personal cached copies of "a", and so the writes will not actually occur to the same memory as the reads. However - by the OpenMP standard, this caching for shared variables is not required, and so I wanted to make sure that in the event it doesn't happen, the issue described above still does not occur.

jimdempseyatthecove · ‎02-13-2016

The answer to this is "depends".

In order for the read or write to occur atomically the data read or written must be performed in a single low-level operation. For a REAL(8) variable this generally means the address must be on a natural boundary (multiple of 8 bytes in this case). There is an exception to this, which probably will not occur in normal programming, and that is if the memory is atypical for normal programs. For example, the location were in the I/O address space of a device that has an 8-bit wide data port. And in this case there would be 8 low level operations in order to get the data out of or into the 8-byte location. *** This assumes you are on a processor, such as Intel, that supports an 8-byte ordered write and read. If your code is to be ported to be compiled for an ARM or other processor type, it may not support an 8-byte ordered write and read.

This said, assuming this is not the case, an additional caveat is "a" may be cached in a register (as you pointed out in your edit). Therefore if "a" is located on a multiple of 8 byte address, and if "a" happens to get written or read, it will get written or read as an 8-byte unit. Note, if "a" is attributed with volatile, all references should poke or peak at RAM. You can also use the Fortran FLUSH directive and/or !$OMP ATOMIC if it becomes important for other threads to be notified of the change (e.g. you have a "Done" flag on a parallel convergence routine).

Jim Dempsey

TimP · ‎02-13-2016

In this specific case, it doesn't appear that enough bits are changed for a partial write to make a difference.

OpenMP has nothing to say about what may happen, once you create a situation which doesn't have predictable results according to OpenMP.

The hardware cache protocol (including fill buffering) will stall updates from a thread until cache evictions associated with other threads have completed, so it seems that a non-atomic write could most likely occur when you have mis-aligned data straddling a cache boundary, a situation which you normally would attempt to avoid, and should not happen by accident under usual OS other than 32-bit Windows.

Eugeny_Sosnovsky · ‎02-13-2016

Thank you both. This raised a couple additional questions though. For the purposes below, assume that the variable in question is not cached by the thread (let's say it's VOLATILE).

Question 1.

jimdempseyatthecove wrote:

In order for the read or write to occur atomically the data read or written must be performed in a single low-level operation. For a REAL(8) variable this generally means the address must be on a natural boundary (multiple of 8 bytes in this case).

Is this what is referred to in the compiler documentation as "data alignment"? I.e., is it correct to say that if the data port is at least 8 bytes wide, then "if the data is aligned, all reads and writes will be atomic"? (I do not mean !$OMP ATOMIC here, I just mean that the data won't get partially overwritten as it's being read). And if this is the case, then if the compiler doesn't throw an "unaligned data warning", then we are guaranteed that all variables will be atomically written and read?

Question 2.

Let's say it's either an Intel64 or AMD64 machine (so, not ARM or anything like that), and, once again, the data port is wider than 8 bytes. Does this mean that if a REAL(4) (note - not 8!) variable is aligned (i.e., its memory address is a multiple of 4 - note, not 8! bytes), then necessarily it will be written to and read from atomically? Even though its address is not necessarily a multiple of 8? I ask this because, quoting from the compiler documentation:

Align 32-bit data so that its base address is a multiple of four.

Question 3.

Again, quoting the compiler documentation:

Dynamically allocated data allocated with ALLOCATE is 8-byte aligned.

Does this statement refer only to the beginning of the array, or to every individual element? I.e., let's say we have a dynamically allocated (allocatable) array consisting of several INTEGER(4) entries (or REAL(4)). Once again assuming that the data port is wider than 8 bytes, and that this is an Intel64/AMD64 machine, can we guarantee that every individual element of this array will be written to and read from atomically? Note, that here, assuming the data is sequential in memory, only every other element's address will be a multiple of 8.

Question 4.

Does !$OMP ATOMIC [UPDATE, if this is a more recent version of OpenMP], as it applies to variable "a" (for example), guarantee exclusive read/write to this variable, even if this same variable is access outside of an !$OMP ATOMIC region elsewhere? I.e., consider the following Fortran code (again, assume the shared variables are not cached):

a = 0
!$omp parallel do &
!$omp shared(a, N) &
!$omp private(i, t)
do i = 1, N
   if (i > 1000) then
      !$omp atomic
      a = a + 5
   else
      t = a
      ! ... do something with t here, without modifying a
   end if
end do
!$omp end parallel do

Once again recognizing that the result is nondeterministic here - can we guarantee that at no point will the read from "a" occur simultaneously with the atomic update of it? Note, that the read itself is not !$OMP ATOMIC, while the update is.

Is this once again dependent on whether "a" is aligned? I.e., the guarantee only present if "a" is aligned?

What if "a" was not aligned? (Let's say it's in a poorly designed COMMON block). Would this guarantee also be present? Or would it then necessarily require the !$OMP ATOMIC READ before "t = a" line to ensure that the read is atomic?

Question 5.

This one's small, and hopefully will be easy to answer: does !$OMP ATOMIC guarantee that the cached value (if present) is updated first? (i.e., via an implied FLUSH(a))? I am fairly certain that it does, but I wanted to make 100% sure.

Thank you once again in advance!

McCalpinJohn · ‎02-15-2016

The OpenMP standards have always been clear on the initial issue.....

Any memory location that is written in a parallel section cannot be read by any OpenMP thread in the same parallel region (though of course it can be read by the thread that is doing the writing). A program that violates this constraint is "non-conforming" and a conforming implementation is allowed to exhibit "unspecified behavior". Note that the specification does not say that the result must be consistent with some ordering of the underlying store operations -- the standard places no constraints on how a non-conforming program is executed or what behavior may be displayed.

In the real world, implementations are likely to simply generate code that has a race condition. Whether the incorrect answer will become visible atomically depends on the details of the system and, in particular, on the alignment of the variable. While 8 Byte writes are often guaranteed to become visible atomically, there are certainly cases in which this guarantee disappears. Two common examples would be (1) non-temporal writes crossing a cache line boundary, and (2) any kind of write that crosses a 4KiB page boundary.

Intel is very reluctant to guarantee much about atomicity of memory transactions. The information that is available is in Section 8.1.1 "Guaranteed Atomic Operations", in Volume 3 of the Intel Architecture Software Developer's Manual (document 325384, revision 057).

Note that an OpenMP compiler's behavior is only guaranteed for conforming code. If you want to play with low-level shared-memory synchronization, you might be better off using pthreads or System V shared segments, or variables shared between a parent and child process.

Eugeny_Sosnovsky · ‎02-15-2016

John McCalpin wrote:

The OpenMP standards have always been clear on the initial issue.....

Any memory location that is written in a parallel section cannot be read by any OpenMP thread in the same parallel region (though of course it can be read by the thread that is doing the writing). A program that violates this constraint is "non-conforming" and a conforming implementation is allowed to exhibit "unspecified behavior". Note that the specification does not say that the result must be consistent with some ordering of the underlying store operations -- the standard places no constraints on how a non-conforming program is executed or what behavior may be displayed.

Is this entirely true though? Consider OpenMP 4.0 spec, section 1.4.3:

The flush operation provides a guarantee of consistency between a thread’s temporary
view and memory. Therefore, the flush operation can be used to guarantee that a value
written to a variable by one thread may be read by a second thread. To accomplish this,
the programmer must ensure that the second thread has not written to the variable since
its last flush of the variable, and that the following sequence of events happens in the
specified order:
1. The value is written to the variable by the first thread.
2. The variable is flushed by the first thread.
3. The variable is flushed by the second thread.
4. The value is read from the variable by the second thread.

It seems like here, even though first thread writes, and second thread reads, the program conforms to the standard? Or does "flush" change something about the restriction you stated?

jimdempseyatthecove · ‎02-16-2016

RE: Question 1

Data alignment (to a natural boundary) means for a 2 byte variable, the low address of the variable is a multiple of 2, for a 4 byte variable, a multiple of 4, 8 byte a multiple of 8. For a 3 byte, a multiple of 4, a 5 byte, a multiple of 8. IOW the power of 2 that is .GE. that of the sizeof the variable. Cache line sizes are powers of 2 (currently 64), page size is also a power of 2. Memory controllers will read and write in cache line groupings. Usually 1 cache line, but under some circumstances multiple may be read/written in one shot. With the newer TSX/RTM capable processors, many cache lines may be written atomically. Therefore, on Intel IA-32 and Intel64 processors (AMD64 too), as long as the data does not span a cache line reads and writes (to RAM) are assured atomic. Read/Modify/Write is not assured atomic except when protected with LOCK (when possible), TSX/RTM protected region (when available), else with mutex.

Please take note of my caveat that some older, and some new embedded processors may have registers and optionally cache lines that are wider than the memory bus. Some of these do not have an implicit single transaction read/write of "naturally aligned" variables. The would support memory bus wide transactions. So if you are targeting a Pentium 1 or ARM multi-processor system you may have potential issues.

Q2

4 byte read or write (but not read/modify/write) of 4-byte aligned data on Intel64 and AMD is assured to be atomic. You may find that any alignment within a cache line also works, but is not assured per specification.

Q3

>>Dynamically allocated data allocated with ALLOCATE is 8-byte aligned

The alignment is a function of the C Runtime Library heap manager. You could replace this heap manager with one of your own and thus return any alignment. The alignment returned is ordinarily assured to be a multiple of the bytes that constitutes a pointer. Intel64/AMD64 this would be a multiple of 8 bytes but you may experience 16-byte alignment. IA-32 it is assured 4 byte, however you typically experience 8 or 16 byte. Note too that running the Debug heap or running an external heap checker (e.g. Valgrind) may affect the alignment of allocations (but not violate multiple of pointer sized variables).

Q4

If your running program has multiple parallel regions running concurrently, and if one of those is using a !$OMP MASTER, that is the only portion of that region to update the variable, then due to the fact that a different parallel region is running and also may be updating the same variable, then it too will need an atomic directive.

*** you were missing the !$OMP END ATOMIC

The atomic section is equivalent to a critical section with the exception that a single statement can optionally be implemented with LOCK or CAS operation (compare and swap).

Jim Dempsey

McCalpinJohn · ‎02-16-2016

Eugeny raises a good point -- my description of OpenMP restrictions on reading & writing an address within a parallel section was assuming that the parallel section did not contain explicit ATOMIC, CRITICAL, or ORDERED directives, or explicit FLUSH operations. The original code example does not use any of these.

I would not recommend mixing PARALLEL DO with manually constructed low-level synchronization using FLUSH operations. It might be possible, but it seems risky. The OpenMP 4.0.0 standard notes (section 1.4.4):

Note – Since flush operations by themselves cannot prevent data races, explicit flush operations are only useful in combination with non-sequentially consistent atomic directives.

PARALLEL DO works fine with ATOMIC, CRITICAL, and ORDERED sections, though CRITICAL and ORDERED can easily undo any performance benefits from the parallelization. An ATOMIC section will attempt to use hardware mechanisms with lower overhead than CRITICAL, assuming that they are available.

If you use an ATOMIC directive for the updates, the compiler has the opportunity to generate code that guarantees that the update will become visible atomically to any other thread -- even in cases where the hardware does not provide such guarantees. The OpenMP 4.0.0 standard summarizes the issue clearly:

[Excerpt from Section 1.4.1] A single access to a variable may be implemented with multiple load or store instructions, and hence is not guaranteed to be atomic with respect to other accesses to the same variable. Accesses to variables smaller than the implementation defined minimum size or to C or C++ bit-fields may be implemented by reading, modifying, and rewriting a larger unit of memory, and may thus interfere with updates of variables or fields in the same unit of memory.

If multiple threads write without synchronization to the same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. Similarly, if at least one thread reads from a memory unit and at least one thread writes without synchronization to that same memory unit, including cases due to atomicity considerations as described above, then a data race occurs. If a data race occurs then the result of the program is unspecified.

The key is the clause "without synchronization" in the second paragraph. Later discussion of the meaning of the ATOMIC construct (section 2.12.6) indicates that using it should be sufficient to ensure that any lack of atomicity in the hardware will be hidden (i.e., worked around) from the application.

jimdempseyatthecove · ‎02-16-2016

It also should be noted that a data race is not necessarily a flaw in the program. IOW a profiler may show a data race for different elements being manipulated within a shared cache line. This is not necessarily an error, but it may be a performance issue to be addressed when optimizing the program. A profiler may also show a "LOCK; inc dword ptr [rax]" as also having a race condition though it is fully atomic and safe.

Jim Dempsey

Eugeny_Sosnovsky · ‎02-16-2016

Thank you both. Jim, John, both your responses raised some questions, and while I think I am getting closer to what I need, I want to be sure, so I will ask them separately.

Jim:

1. I guess the most obvious natural question is then this: how does one ensure that data (specifically, a single variable, or a single element of an allocatable array) does NOT span cache line boundary? Is this guaranteed to be true as long as the compiler doesn't throw unaligned data warnings?

2. You explicitly said that "4 byte read or write of 4-byte aligned data <...> is assured to be atomic".
* Is this true per specification, or just a consequence of the fact that single cache line reads and writes are necessarily atomic?
* I take that to also mean that 1/2/8 byte read and write of 1/2/8-byte aligned data is also assured to be atomic, per specification?

3. Regarding your answer to my Q4: does this mean that as long as there is only one parallel region (in which there are multiple threads), and no nested parallelism anywhere in the dynamic extent, and the updates to the shared variable are atomic, then the reads from this shared variable are safe, even if unprotected via !$OMP CRITICAL or anything similar? I guess I should have specified in the original question that I was not considering multiple parallel regions running simultaneously.

Eugeny_Sosnovsky · ‎02-16-2016

John:

Maybe a more detailed code example would make my situation clearer. Consider the following Fortran code, the idea of which is to (here based on a single array B, in the real code based on some relatively complex geometric calculations) tag certain elements of A as .true.:

allocate( A(1000) )
A = .false.

if_over = .false.

!$omp parallel do &
!$omp shared(A) &
!$omp shared(if_over) &
!$omp private(i,j,k) &
!$omp shared(B)
!...
! B is an integer array, with values ranging from 1 to 995, chosen RANDOMLY
!...
do i = 1, 1000
   if (.not. if_over) then ! Line L0
      j = B(i)
      k = j + 5
      
      !$omp flush(A)     ! Line L1
      A(j:k) = .true.    ! Line L2
      !$omp flush(A)     ! Line L3
   
      if (all(A)) then   ! Line L4
         !$omp atomic
         if_over = if_over .or. .true.
         !$omp end atomic ! Not required in some OpenMP versions
      else
!         !$omp flush(if_over) ! Line L5
      end if
   end if
end do
!$omp end parallel do

Now, in light of what we have discussed, I have the following questions:

1. if_over is updated atomically, but it is not read atomically on Line L0. Still, it IS potentially read by one thread and read by all. Is this nonconformal? If so, would uncommenting Line L5 make the program conformal?

2. Assume that there are cached versions of A for all threads. Consider this scenario: for thread 0, j = 1, k = 6, and for thread 1, j = 2, k = 7. Both threads flush (Line L1), and at this point, all elements of A are .false. Then Line L2 happens for both threads, and now their cached versions differ. Then thread 0 flushes (Line L3), then thread 1. What is the state of the main-memory A immediately after this moment, assuming there were no other threads? Are elements 1..7 .true., and the rest .false.? Only 2..7? Is it undefined because it's nonconformal, despite synchronization?

3. Is anything else about Lines L1..L3 potentially nonconformal? If so, is making L2 a loop with elementwise !$OMP ATOMIC updates, or surrounding L2 with !$OMP CRITICAL, the only way to make the program conformal? And, in fact - would surrounding Line L2 with !$OMP CRITICAL even make this program conformal, considering that the read is still unprotected (although immediately preceded by synchronization)?

Thank you in advance.

McCalpinJohn · ‎02-16-2016

Concerning alignment and atomicity, the only authoritative documentation is Section 8.1.1 "Guaranteed Atomic Operations" in Volume 3 of the Intel Architectures SW Developer's Manual (document 325384).

There are lots of ways to request alignment, but the only way to be sure of alignment is to have a pointer to the variable and examine it. This is usually done by checking to that (((size_t)&object) % sizeof(object)) is zero. Then you need to double-check Section 8.1.1 of Volume 3 of the SW Developer's Manual to make sure that the size you are using is one of the ones that guarantees atomic read and/or write behavior on the platform you are using.

It should be easy (trivial?) to guarantee 32-bit alignment on a 32-bit variable. Reading and writing 32-bit values on 32-bit boundaries has been guaranteed to be atomic since the 486 processor. Once you have a single 32-bit shared variable on a 32-bit boundary, you don't need to worry about atomicity any more.

If your programming language insists on making the logical variable a 64-bit value, you are still in luck with Intel processors. Reading and writing 64-bit values on 64-bit boundaries has been guaranteed to be atomic since the Pentium processor. All you have to do is check the address of the variable, make sure that (address%8) is zero, and you won't need to worry about this any more.

The code could easily break if you try to port it to another platform, but that is a topic for another day.

jimdempseyatthecove · ‎02-17-2016

>>checking to that (((size_t)&object) % sizeof(object)) is zero

Valid for short, int, float or double or _mm64, __m128, __m256, __m512
but is not valid for an object (say struct) who's size is not a power of 2.

Jim Dempsey

McCalpinJohn · ‎02-23-2016

I am still a bit confused about the motivation behind all of this.

The OpenMP standards that I have looked at have been frustrating, but I think I am finally starting to understand their terminology. I have been trying to find unambiguous language that says that it is possible to write a "conforming" code that has non-deterministic behavior due to data races. I finally found this in Section 1.1 of version 4.5 of the OpenMP standard, which says:

OpenMP-compliant implementations are not required to check for data dependencies, data conflicts, race conditions, or deadlocks, any of which may occur in conforming programs.

I find it unfortunate that the specification uses the term "unspecified" to refer to the behavior of code with data races. This terminology is frustratingly unhelpful because it does not distinguish between unbounded meaning of "unspecified" (which may include catching fire, singing "Happy Birthday to You", or other arbitrary behavior) and the bounded meaning of "unspecified" that is more commonly used in the context of data races -- meaning that you will get an answer that corresponds to one of the possible orderings of the execution of the threads, but that you will not be able to predict which answer you get, and the result is free to vary from one execution of the code to the next.

The language from Section 1.1 (above) makes it clear that a conforming program can have data races, so I will assume that "unspecified" means "not deterministic" in the context of data races. But that is my interpretation, and it may not be shared by compiler writers.....

Concerning Question 5 in post 4 above: The OpenMP 4.5 standard does clearly state that ATOMIC operations are preceded and followed by implicit FLUSH operations.

Question about specific guarantees made by OpenMP for simultaneous read/write of individual memory segments