ifort 18 feature request: using F08 atomic subroutines with array components (using an integer array component of a derived type coarray together with atomic_define)
I am currently demanding something from the Fortran 2008 language that may not be explicitly specified by the standard. Also, there might be some reason that such a feature can't be implemented in a safe manner, that I am not aware of. Nevertheless, ifort as well as gfortran/OpenCoarrays do support the use of derived type coarrays with an (integer) array component together with F2008 atomic subroutines: https://github.com/MichaelSiehl/Atomic_Subroutines--Using_Coarray_Arrays_to_Allow_for_Safe_Remote_Co...
Gfortran/OpenCoarrays does fully support this. But with ifort 18 beta (and probably with ifort 17 as well -I don't have access to it yet-) support is only partly implemented: we can use an integer array component with atomic_define for local access to the derived type coarray array component (one array element only of course), but for remote references we get 'error #8583: COARRAY argument of ATOMIC_DEFINE/ATOMIC_REF intrinsic subroutine shall be a coarray.' (I am leaving out atomic_ref here, because I use atomic_ref only for local references to an integer array component of a derived type coarray, which works well with ifort yet).
I set up an example of a customized synchronization procedure using atomic subroutines, here: https://github.com/MichaelSiehl/Atomic_Subroutines--How_the_Parallel_Codes_may_look_like--Part_2 . The code in the src folder compiles and runs using gfortran/OpenCoarrays, but ifort 18 can't handle the following statement in the OOOPimsc_admImageStatus_CA.f90 source file (line 265):
call atomic_define (Object_CA [intImageNumber] % mA_atomic_intImageActivityFlag99(intArrIndex,1), intImageActivityFlag)
(here, 'Object_CA' is a derived type coarray and 'mA_atomic_intImageActivityFlag99' is an integer array component)
ifort 18 beta generates the above mentioned compile time 'error #8583: COARRAY argument of ATOMIC_DEFINE/ATOMIC_REF intrinsic subroutine shall be a coarray' with that.
On the other hand, if we omit only the remote reference '[intImageNumber]', ifort does compile the code into a working application (not really working of course, because there is no remote reference any more):
call atomic_define (Object_CA % mA_atomic_intImageActivityFlag99(intArrIndex,1), intImageActivityFlag)
If we only omit the array index, we get the correct compiler 'error #6360: A scalar-valued argument is required in this context':
call atomic_define (Object_CA [intImageNumber] % mA_atomic_intImageActivityFlag99, intImageActivityFlag)
Thus, my current main question: Are there any doubts that the above remote reference to an array component using atomic_define shouldn't be implemented with ifort yet?
I've already filed a ticket on the lack of support for coarrays with Intel Fortran on Mac, which is likely related to the reliance on Intel MPI, which itself is not supported on Mac. As for your closing question, the clear evidence of non-support is indicated by the lack of documentation of the "-coarray" flag on Mac.
If you want atomics with Intel Fortran on Mac, you can use OpenMP 4 or MPI-3, or continue to use GCC with OpenCoarrays as you are doing today.
Did you test these features on Linux? If they are not supported on Linux, please file an issue here: https://supporttickets.intel.com/.
Yes, I am using ifort 18 beta (update 1) on Linux Ubuntu. Ifort's support for atomic components (not only array components) with atomic subroutines comes very slow: Ifort 15 did not even support atomic components with atomic subroutines at all (if I remember correctly), wheras OpenCoarrays atomic's support excels. I will use your above link to issue the topic there.
Thanks and Best Regards
Thank you for your report! This issue is better to be reported via our Online Service Center at https://supporttickets.intel.com/
Instructions on how to file a ticket are available here:
thanks for looking at this. I've already filed a ticket on this topic and got a support request number at June 23th. On June 24th, Kenneth (from the Developer Products Support) did confirm to file a feature request for this.
ifort 18.0.0 (initial release) is still not able to compile the codes. Thus, I did set up another simple test case (see the code below) for checking the compiler. (I already do have a support ticket for the feature request from intel, by the way, and will give that additional test case to them as well).
The code shows that ifort (since version 17 or even earlier) does already have support implemented for using array components (of derived type coarrays) together with atomic subroutines. The test case works already for local access; With local access it is only the coindexed syntax that does not work.
ifort -coarray -coarray-num-images=4 Main.f90 -o a.out program Main use, intrinsic :: iso_fortran_env implicit none ! type :: atomic_test integer(atomic_int_kind), dimension (1:2, 1:2) :: atomic_test_component end type atomic_test ! type (atomic_test), codimension
Intel Fortran team can perhaps look into a related matter as well, I'll later submit an incident toward this.
It appears the intrinsic ATOMIC_DEFINE procedure works with a scalar coarray of type integer with kind ATOMIC_INT_KIND as shown below:
use, intrinsic :: iso_fortran_env, only : atomic_int_kind, output_unit integer(kind=atomic_int_kind) :: foo
When built with 4 images (coarray-num-images=4), the following output is obtained during execution:
Image = 1; foo[Idx] = 42 Image = 2; foo[Idx] = 43 Image = 3; foo[Idx] = 44 Image = 4; foo[Idx] = 45
However the code variant with a COINDEXED OBJECT raises a runtime exception, this seems to be an Intel Fortran bug:
use, intrinsic :: iso_fortran_env, only : atomic_int_kind, output_unit type :: t integer(kind=atomic_int_kind) :: i end type type(t) :: foo
The code compiles ok with Intel Fortran 18.0 compiler, however upon execution:
Fatal error in MPI_Get: Invalid displacement argument in RMA call , error stack: MPI_Get(168): MPI_Get(origin_addr=002EFABC, origin_count=4, MPI_CHAR, target_ran k=3, target_disp=-10731512, target_count=4, MPI_CHAR, win=0xa0000000) failed MPI_Get(112): Invalid displacement argument in RMA call Fatal error in MPI_Get: Invalid displacement argument in RMA call , error stack: MPI_Get(168): MPI_Get(origin_addr=002AFABC, origin_count=4, MPI_CHAR, target_ran k=2, target_disp=-10993656, target_count=4, MPI_CHAR, win=0xa0000000) failed MPI_Get(112): Invalid displacement argument in RMA call Fatal error in MPI_Get: Invalid displacement argument in RMA call , error stack: MPI_Get(168): MPI_Get(origin_addr=003CF7BC, origin_count=4, MPI_CHAR, target_ran k=0, target_disp=-9814776, target_count=4, MPI_CHAR, win=0xa0000000) failed MPI_Get(112): Invalid displacement argument in RMA call
When you use atomic_define for remote data transfer you must synchronize that data transfer (e.g. using a spin-wait loop together with local atomic_ref and SYNC MEMORY). With my above test case, I did use the atomic subroutines for local access only (sequentially), thus no synchronization with that code.
Also, personally I would suggest not to use atomic_ref for remote (read) access but only for local synchronization instead. (Someone else may correct me if he/she got a safe test case and experiences for applying atomic_ref for remote access).
The main purpose of Fortran 2008 atomic subroutines might be for implementing customized synchronization primitives as procedures to extend the possibilities of the Fortran 2008 base language. It is all to easy to introduce subtle coding bugs that raise hard to resolve runtime (and logic) failures with atomic subroutines. Please refer to Modern Fortran explained, chapter B.10.1. The above runtime failure may not necessarily be a compiler bug.
Michael S. wrote:
.. I would suggest not to use atomic_ref for remote (read) access but only for local synchronization instead. .. Please refer to Modern Fortran explained, chapter B.10.1. ..
Placing the Fortran standard itself as the ultimate reference, ahead of the book Modern Fortran Explained (MFE), can you point to any constraint or a note in the standard indicating the atomic subroutines, particularly ATOMIC_REF, are to be used for local synchronization and not remote access? And of the need to synchronize the data transfer following an atomic action to define the variable on a specific image.
I do not see anything in the standard that goes with your description. Starting with the example provided with the description of the intrinsic, "CALL ATOMIC REF (I , VAL) causes VAL to become defined with the value of I on image 3.", the entire purpose of the atomic subprograms is described as to perform an action on the ATOM argument atomically and the requirement in the intrinsics toward the ATOM argument is then given as either a scalar coarray or a coindexed object.
The two cases I show in Quote #9 are meant to be minimal working examples (MWE) for the two possibilities of the ATOM argument in the two atomic subroutines being considered: one with the scalar coarray works as expected whereas the second case with the coindexed object fails with Intel compiler. I will view this as a compiler error until and unless the Intel Fortran team can point out to something in the standard document that makes it non-conforming.
MFE is a good place to get an introduction to the subject matter, but to implementations, it is only the standard that matters.
I know your intention is to point to a possible compiler bug and that may even be the case here, with your code using a single (ordered) execution segment. Still, I am getting nervous when I see code applying remote read through atomic_ref with unordered execution segments due to some bad experiences.
The critical phrase (in TS18508) might be “processor dependent”. Your code does compile and run with OpenCoarrays/gfortran. From my past experiences this may not necessarily be a good thing. (The following is derived from my past experiences -as far as I remember it correctly- with testing remote access through atomic_ref; anyone with different experiences or opinions, feel free to comment or correct me):
Some time ago I did try a test case using atomic_ref for remote (read) access with OpenCoarray/gfortran. (I don't have the test case at hand yet, not sure if I did also use ifort with it.) What I remember was unpredictable remote data transfer through atomic_ref with unordered (customized) execution segments. The problem was that, even with different settings for the use of SYNC MEMORY (i.e. no matter where I did place the statement within the code), there was no guarantee for the transmitted value: sometimes it was the expected value from the expected execution segment, but sometimes it was the value from the preceding execution segment (I believe to remember it that way, maybe it was even impossible to get the value from the desired execution segment at all). Thus, I soon gave up with using remote read through atomic_ref.
The problem with unpredictable behavior (i.e. remote data transfer) is that it may work in some settings (as with your example code), but with a different setting (e.g. unordered execution segments) it may not. Then, with a slightly more sophisticated code structure, it may become rather difficult till impossible to figure out what went wrong with the code.
That is why I would prefer runtime failure of an application at the first (early) place in favour of unpredictable remote data transfer with an parallel application. (Even if I doubt, though, that this was intentionally with your test case). I wonder if someone else has some similar experiences too. To me, programming with remote atomic_ref and unordered execution segments can be a very nasty animal.
Nevertheless, my current experiences with remote write through atomic_define (and synchronized local atomic_ref) are very encouraging already: Still some open questions but also good confidence that we may succeed with it for developing (nearly) unlimited parallel applications (by combining the use of atomic remote write with unordered segments together with the standard way of coarray programming using ordered execution segments). With one exception: processor dependence – I don't expect it to work with heterogeneous hardware (e.g. between a GPU and a CPU).
The Intel Developers did a great job: ifort 18.0.1 (update 1) does now work with array components of derived type coarrays (a single array element at a time only of course) together with atomic subroutines. The run-time behaviour is very similar (if not exactly the same) as with OpenCoarrays/gfortran. This means, the remote data transfer through atomic_define can be unstable (the data transfer may not always occur). Nevertheless, we can already cope with such unstable data transfer easily at the low-level: https://github.com/MichaelSiehl/Atomic_Subroutines-Part_4--How_To_Cope_With_Unreliable_Data_Transfer.... Further simple strategies will be required to cope with it at a higher level of our PGAS programming as well. We will see if this is already enough to develop reliable parallel software in the near future and if the remote data transfer through atomic_define will become more stable with more adequate parallel hardware.
Just a brief update after some further testing:
Firstly, I was wrong: The remote data transfer through atomic_define works stable and reliable. The ifort developers did really a great job.
So far, all remote data transfer of the individual array elements of the derived type coarray array component through atomic_define (ifort 18 update 1) do complete successfully (it seems to 100%). Thanks to the high speed of the resulting executables, I was able to identify the real source of my problem: Accessing the remotely transmitted values locally through atomic_ref is the problem. These values are highly transitory (atomic_ref has really deserved that naming). In practice it means that the programmer is required to access each single of these transmitted values massively (a lot of times) locally through atomic_ref with very high speed (a fast running spin-wait loop). Atomic_ref offers only a small space (of time) for successfully getting all the remotely transmitted values. In practice, and because we are still required to offer a synchronization diagnostics and abort functionality from our spin-wait loop (which makes the spin-wait loop overall too slow for successful access to atomic_ref), I was able to cope with the problem just by further nesting of the crucial part of the spin-wait loop like this:
do do intCount1 = 1, 500 ! to provide an high speed spin-wait loop with massive access through atomic_ref do intCount2 = 1, intNumberOfImages ! this loop alone would be to slow for atomic_ref ! . . .
The outer do loop is the real (slow-running) spin-wait loop. It handles the overall multi-image synchronization and checks for a remote synchronization abort. The innermost loop (Count2) does process the data transfer from the distinct images within the same synchronization. These both loops alone turned out to be too slow for high-speed many-time access through atomic_ref. Thus, I did add another loop (Count1) to separate the abort functionality largely from the multi-image synchronization, which led to (so far) 100% successful multi-image synchronizations through atomic_ref.
I will provide a complete test case shortly.
A quick update:
After some further testing (this time with OpenCoarrays/gfortran), I found a silly synchronization error within my code. After a correction everything works well with OpenCoarrays/gfortran. Thus, I am pretty much sure that all my problems with atomic_ref are due to my own coding error. I will give more information shortly.
Everything works perfect with ifort 18 update 1!
I would like to confirm that both, atomic_define (for remote transfer) and atomic_ref (for local check), do seem to work 100% stable and reliable with the single elements of an array component on a shared memory computer. With my further testing I was unable to corrupt the data transfer itself. (With my earlier test cases, I made the mistake to abort the synchronization process before it did complete).
This is also true for OpenCoarrays/gfortran, with one exception: OpenCoarrays/gfortran requires proper synchronization at program start (e.g. using SYNC IMAGES), so that the atomic_define does not temporal precede the atomic_ref. (I am not sure yet about the exact reason behind that). Nevertheless, thanks to this somewhat faulty runtime behaviour, I was already required to implement (nearly) bulletproof user-defined (or customized) synchronization procedures that can already handle nearly any kind of runtime failure, even hardware failures. Only few lines of Fortran 2008 code were required: