Assume image 1 executes the statement: CALL ATOMIC_DEFINE(VAR,VAR_VALUE).
Is VAR immediately updated on image 2 and ready to use by image 2 (in expressions, etc.), or should image 2 always CALL ATOMIC_REF(VAR_VALUE,VAR) to retrieve the updated value (and use VAR_VALUE in expressions)?
No, the call to ATOMIC_REF is not necessary. It may take a while for VAR to take on the new value if it is repeatedly referenced, even without crossing a segment boundary. The standard says:
"Atomic operations shall make asynchronous progress. If a variable X on image P is defined by an atomic subroutine on image Q, image R repeatedly references X
by an atomic subroutine in an unordered segment, and no other image defines X
in an unordered segment, image R shall eventually receive the value assigned by image Q, even if none of the images P, Q, or R execute an image control statement until after the definition of X
by image Q and the reception of that value by image R."
Mmmm. That's not what F2018 11.6.2 says...
"A coarray may be referenced or defined by execution of an atomic subroutine during the execution of a segment that is unordered relative to the execution of a segment in which the coarray is referenced or defined by execution of an atomic subroutine. ... Otherwise, ... if a variable is defined or becomes undefined on an image in a segment, it shall not be referenced, defined, or become undefined in a segment on another image unless the segments are ordered..."
That first sentence in 11.6.2p3 you quote is, I think, nonsense. I think it should be "relative to the execution of another segment..." But ultimately it is saying the same thing. I asked another committee member who lives this stuff every day, and he said: "It means that atomic operations do not have to follow the normal segment ordering rules. Of course, if the two atomic operations are on data on the same image, there is no problem anyway. The case where is matters is when two or more images are involved - i.e.at least one of the atomic operations is on a remote image. The “poster child” use case is for a counter to be on one image and various other images increment the counter with atomic_fetch_add operations. None of the participating segments require ordering with respect to each other. This is a core part of the parallel integer sort algorithm."
For clarity, my interpretation of the quoted standard fragment is that atomic operations need to be used on both sides (reference and definition) of a coarray operation between different images for them to be exempt from the normal rules.
If that's not the intent of that first sentence, then it needs to be changed.
Very interesting! I did just try out with my own codes (implementing a customized synchronization using atomic_define / atomic_ref) and can confirm that OpenCoarrays/gfortran does not require the use of atomic_ref for local access of atomic values (that were remotely defined through atomic_define):
For example, I can use
intSyncValue = Object_CA % mA_atomic_intImageActivityFlag99(intArrIndex)
call atomic_ref(intSyncValue, Object_CA % mA_atomic_intImageActivityFlag99(intArrIndex), STAT = intAtomicRefStat)
in my own code. The customized synchronization does still work without use of atomic_ref. (Personally, I would still prefer the use of atomic_ref).
Modern Fortran explained gives a simple example of a spin-wait loop synchronization using atomic_define and atomic_ref. To my understanding, it is exactly the customized synchronization itself that does impose (short-time) UNORDERED execution segments with the data transfer through atomic_define within the spin-wait loop synchronization. This is because the required SYNC MEMORY statement on the receiving image must be executed after (successful) local access through atomic_ref (or access without atomic_ref, as shown here). Execution of the SYNC MEMORY statement(s) is required to maintain segment ordering for the main data transfers (outside the customized synchronization). Therefore, UNORDERED execution segments are a (short-time) requirement of the spin-wait loop synchronization itself.
This is for synchronizing (additional) data transfers. The situation is different if we want to synchronize without any additional data transfer (other than the data transfer of a single call to an atomic subroutine, using a simple trick (algebra) to store and transfer more than only a single value within a scalar integer). Here too, the atomic data transfer does impose (short-time) unordered execution segments as long as the data transfer through an atomic subroutine does last. But now, because we do not transfer any further data outside the customized synchronization itself, it may not be required to execute a SYNC MEMORY statement at all, not on the sending image(s) nor on the receiving image(s). (That's what I currently believe. Else, execution of SYNC MEMORY would not impose any limitation.)
Practical use-cases for (customized) synchronization without any additional data transfers may arise from synchronizing the code execution itself: Implementing a timer with a (customized) synchronization may allow to detect and handle any kind of errors, failures, or problems with the run-time execution of a parallel algorithm. Another use-case is to use such kind of customized synchronization for controlling the run-time execution of a parallel algorithm itself (i.e. using the customized synchronization as integrated part of a parallel algorithm).
Think of ATOMIC_DEFINE as a send atom to image
and ATOMIC_REF as fetch atom from image.
Producer Consumer atomic_define(..) atomic_ref(..) ! useful when rank is neither producer nor consumer atomic_define(..) local(direct) reference ! Producer sending to remote rank local(direct) reference atomic_ref(..) ! Consumer receiving from remote rank
Just for completion:
The UPC++ Programmer’s Guide (v2018.3.0), https://escholarship.org/content/qt10g5t8jr/qt10g5t8jr.pdf , has an atomic code example on page 15 below, where they state:
// once a memory location is accessed with atomics, it should only be
// subsequently accessed using atomics to prevent unexpected results
I can't tell if this does apply to Coarray Fortran as well, but I'd feel more safe with using atomic subroutines in my codes.
I feel that this topic still needs clarification. In particular, I am interested in the latency associated with the execution on image N of:
Since this call is executed on image N, is VALUE updated "instantaneously" ? The use case would be for an error monitoring routine which would check the local value of ATOM in nearly all the procedures of the code (and this routine would be called millions of times, if not more).
It involves calls to the RTL to lock and unlock and another RTL call, so no, not "instantaneous". Keep in mind that the code has no idea that it is on image N until it gets into the support library.
Is this value set only on the same image, or can it be set by other images? If the same image only, there are interlocked access Windows API routines defined in KERNEL32.
Thanks Steve - so it would probably best, in my case, to use (on image N) directly the local value of ATOM and skip the call to ATOMIC_REF (knowing that, ultimately, ATOM would be updated at some point if ATOMIC_DEFINE'd by another image).
But in this scenario, how is the risk of a simultaneous use / race condition of ATOM (local value on image N) and an update of its value (through atomic update) mitigated?
Keep in mind that unless you use the ATOMIC_xxx routines, a SYNC xxx statement or an image control statement, you're not guaranteed that a change to ATOM from another image will be reflected in your local copy. So you might spin forever waiting for ATOM to be updated and never see it, if you're not doing other coarray stuff. I can pretty much guarantee that in Intel's implementation you won't see it update in this case.
How much of an "overrun" of ATOM being changed are you willing to accept? You might add code to do a SYNC MEMORY every 1000 tests, or whatever. I don't know what your application is doing here that it would test for this so often - that's not a good use of coarrays.
I was thinking along these lines: implementing a timer so that the call to ATOMIC_REF does not occur constantly, but every other nnn seconds (so it's a variation of your idea of having a counter). This is of course not the main reason why coarrays are used in this code (the code is an HPC-based, large scale simulation tool).
It's the complexity of the tool, in fact, which drives the need for an elegant error-handling mechanism (so as to provide information as detailed as possible should an error condition occurs, at any time in the code and on any image).