Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

ATOMIC_DEFINE and ATOMIC_REF

OP1
New Contributor II
766 Views

Assume image 1 executes the statement: CALL ATOMIC_DEFINE(VAR[2],VAR_VALUE).

Is VAR immediately updated on image 2 and ready to use by image 2 (in expressions, etc.), or should image 2 always CALL ATOMIC_REF(VAR_VALUE,VAR[2]) to retrieve the updated value (and use VAR_VALUE in expressions)?

0 Kudos
13 Replies
Steve_Lionel
Honored Contributor III
767 Views

No, the call to ATOMIC_REF is not necessary. It may take a while for VAR to take on the new value if it is repeatedly referenced, even without crossing a segment boundary. The standard says:

"Atomic operations shall make asynchronous progress. If a variable X on image P is defined by an atomic subroutine on image Q, image R repeatedly references X

by an atomic subroutine in an unordered segment, and no other image defines X

in an unordered segment, image R shall eventually receive the value assigned by image Q, even if none of the images P, Q, or R execute an image control statement until after the definition of X

by image Q and the reception of that value by image R."

0 Kudos
IanH
Honored Contributor II
766 Views

Mmmm.  That's not what F2018 11.6.2 says...

"A coarray may be referenced or defined by execution of an atomic subroutine during the execution of a segment that is unordered relative to the execution of a segment in which the coarray is referenced or defined by execution of an atomic subroutine.  ... Otherwise, ... if a variable is defined or becomes undefined on an image in a segment, it shall not be referenced, defined, or become undefined in a segment on another image unless the segments are ordered..."

0 Kudos
Steve_Lionel
Honored Contributor III
766 Views

That first sentence in 11.6.2p3 you quote is, I think, nonsense. I think it should be "relative to the execution of another segment..." But ultimately it is saying the same thing. I asked another committee member who lives this stuff every day, and he said: "It means that atomic operations do not have to follow the normal segment ordering rules. Of course, if the two atomic operations are on data on the same image, there is no problem anyway. The case where is matters is when two or more images are involved - i.e.at least one of the atomic operations is on a remote image. The “poster child” use case is for a counter to be on one image and various other images increment the counter with atomic_fetch_add operations. None of the participating segments require ordering with respect to each other. This is a core part of the parallel integer sort algorithm."

0 Kudos
IanH
Honored Contributor II
767 Views

For clarity, my interpretation of the quoted standard fragment is that atomic operations need to be used on both sides (reference and definition) of a coarray operation between different images for them to be exempt from the normal rules.

If that's not the intent of that first sentence, then it needs to be changed.

0 Kudos
Steve_Lionel
Honored Contributor III
766 Views

That's not the intent.  One does not need to pair def/ref operations.

0 Kudos
Michael_S_17
New Contributor I
766 Views

Very interesting! I did just try out with my own codes (implementing a customized synchronization using atomic_define / atomic_ref) and can confirm that OpenCoarrays/gfortran does not require the use of atomic_ref for local access of atomic values (that were remotely defined through atomic_define):

For example, I can use

intSyncValue = Object_CA % mA_atomic_intImageActivityFlag99(intArrIndex)

instead of

call atomic_ref(intSyncValue, Object_CA % mA_atomic_intImageActivityFlag99(intArrIndex), STAT = intAtomicRefStat)

in my own code. The customized synchronization does still work without use of atomic_ref. (Personally, I would still prefer the use of atomic_ref).

Modern Fortran explained gives a simple example of a spin-wait loop synchronization using atomic_define and atomic_ref. To my understanding, it is exactly the customized synchronization itself that does impose (short-time) UNORDERED execution segments with the data transfer through atomic_define within the spin-wait loop synchronization. This is because the required SYNC MEMORY statement on the receiving image must be executed after (successful) local access through atomic_ref (or access without atomic_ref, as shown here). Execution of the SYNC MEMORY statement(s) is required to maintain segment ordering for the main data transfers (outside the customized synchronization). Therefore, UNORDERED execution segments are a (short-time) requirement of the spin-wait loop synchronization itself.

This is for synchronizing (additional) data transfers. The situation is different if we want to synchronize without any additional data transfer (other than the data transfer of a single call to an atomic subroutine, using a simple trick (algebra) to store and transfer more than only a single value within a scalar integer). Here too, the atomic data transfer does impose (short-time) unordered execution segments as long as the data transfer through an atomic subroutine does last. But now, because we do not transfer any further data outside the customized synchronization itself, it may not be required to execute a SYNC MEMORY statement at all, not on the sending image(s) nor on the receiving image(s). (That's what I currently believe. Else, execution of SYNC MEMORY would not impose any limitation.)

Practical use-cases for (customized) synchronization without any additional data transfers may arise from synchronizing the code execution itself: Implementing a timer with a (customized) synchronization may allow to detect and handle any kind of errors, failures, or problems with the run-time execution of a parallel algorithm. Another use-case is to use such kind of customized synchronization for controlling the run-time execution of a parallel algorithm itself (i.e. using the customized synchronization as integrated part of a parallel algorithm).

cheers

0 Kudos
jimdempseyatthecove
Honored Contributor III
767 Views

FWIW

Think of ATOMIC_DEFINE as a send atom to image
and ATOMIC_REF as fetch atom from image.

Programming methods:

Producer                  Consumer
atomic_define(..)         atomic_ref(..)  ! useful when rank is neither producer nor consumer
atomic_define(..)         local(direct) reference ! Producer sending to remote rank
local(direct) reference   atomic_ref(..) ! Consumer receiving from remote rank

Jim Dempsey

0 Kudos
Michael_S_17
New Contributor I
766 Views

Just for completion:
The UPC++ Programmer’s Guide (v2018.3.0), https://escholarship.org/content/qt10g5t8jr/qt10g5t8jr.pdf , has an atomic code example on page 15 below, where they state:

// once a memory location is accessed with atomics, it should only be
// subsequently accessed using atomics to prevent unexpected results

I can't tell if this does apply to Coarray Fortran as well, but I'd feel more safe with using atomic subroutines in my codes.

Regards

0 Kudos
OP1
New Contributor II
767 Views

I feel that this topic still needs clarification. In particular, I am interested in the latency associated with the execution on image N of:

CALL ATOMIC_REF(VALUE,ATOM)

Since this call is executed on image N, is VALUE updated "instantaneously" ? The use case would be for an error monitoring routine which would check the local value of ATOM in nearly all the procedures of the code (and this routine would be called millions of times, if not more).

0 Kudos
Steve_Lionel
Honored Contributor III
767 Views

It involves calls to the RTL to lock and unlock and another RTL call, so no, not "instantaneous". Keep in mind that the code has no idea that it is on image N until it gets into the support library.

Is this value set only on the same image, or can it be set by other images? If the same image only, there are interlocked access Windows API routines defined in KERNEL32.

0 Kudos
OP1
New Contributor II
767 Views

Thanks Steve - so it would probably best, in my case, to use (on image N) directly the local value of ATOM and skip the call to ATOMIC_REF (knowing that, ultimately, ATOM would be updated at some point if ATOMIC_DEFINE'd by another image).

But in this scenario, how is the risk of a simultaneous use / race condition of ATOM (local value on image N) and an update of its value (through atomic update) mitigated?

0 Kudos
Steve_Lionel
Honored Contributor III
767 Views

Keep in mind that unless you use the ATOMIC_xxx routines, a SYNC xxx statement or an image control statement, you're not guaranteed that a change to ATOM from another image will be reflected in your local copy. So you might spin forever waiting for ATOM to be updated and never see it, if you're not doing other coarray stuff. I can pretty much guarantee that in Intel's implementation you won't see it update in this case.

How much of an "overrun" of ATOM being changed are you willing to accept? You might add code to do a SYNC MEMORY every 1000 tests, or whatever. I don't know what your application is doing here that it would test for this so often - that's not a good use of coarrays.

0 Kudos
OP1
New Contributor II
767 Views

I was thinking along these lines: implementing a timer so that the call to ATOMIC_REF does not occur constantly, but every other nnn seconds (so it's a variation of your idea of having a counter). This is of course not the main reason why coarrays are used in this code (the code is an HPC-based, large scale simulation tool).

It's the complexity of the tool, in fact, which drives the need for an elegant error-handling mechanism (so as to provide information as detailed as possible should an error condition occurs, at any time in the code and on any image).

Thanks!

0 Kudos
Reply