Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Does a STOP'ped image still SYNC with other images

OP1
New Contributor II
686 Views

I may have asked this question a while back but for some reason I cannot see my contributions to this forum beyond one year back or so.

The following code hangs forever (Intel 19 Update 3, Cluster Edition):

PROGRAM TEST
IMPLICIT NONE
SYNC ALL
IF (THIS_IMAGE()==1) STOP
SYNC ALL
END PROGRAM TEST

Should I expect all images with number greater than 1 to execute successfully the SYNC ALL statement? In other words, does a STOP'ped image still SYNC with other active images?

Edit: my understanding is that, in the example above, error termination should be executed. But then, it should not hang!

Same thing here (a better way of handling the stopped image) - and it still hangs:

PROGRAM TEST
IMPLICIT NONE
INTEGER :: ERR
IF (THIS_IMAGE()==1) STOP
SYNC ALL (STAT=ERR)
WRITE(*,*) THIS_IMAGE(),ERR
END PROGRAM TEST

Honestly, this is such basic coarray stuff that I am not sure what it says about Intel commitment to implement (and test) those features...

0 Kudos
8 Replies
Steve_Lionel
Honored Contributor III
686 Views

No, a stopped image does not sync, and the behavior you show is reasonable per the standard. Fortran 2018 adds the concept of "failed images" and the ability to detect failure on an image control statement. The soon-to-be-released 19.1 has some support for this.

Consider using ERROR STOP if you want the whole program to die. Otherwise, don't use STOP in coarray programs!

0 Kudos
OP1
New Contributor II
686 Views

Steve - I thought that STOP was meant to allow termination of an image - while allowing notification to other images (through STAT= specifier) that the image that executed the STOP statement has, well, stopped. My understanding is that ERROR STOP is the sledgehammer approach, where once executed by any image, all other images will stop as soon as notified.

Maybe you meant that the SYNC ALL (STAT=...) feature is implemented in the next version, but buggy in 19 Update 3? (I can't check with Update 4 for now).

0 Kudos
OP1
New Contributor II
686 Views

I am looking at p. 342 of the excellent "Modern Fortran Explained - Incorporating Fortran 2018" by Metcalf and al.

Normal termination occurs in three steps: initiation, synchronization, and completion. An image initiates normal termination but its data need to be accessible to other images until they have all initiated termination. Hence there needs to be synchronization, after which all images can complete execution.

Normal termination is initiated by a STOP statement, for instance. I would really not expect the to code hang on the SYNC ALL (STAT=...) line, in light of the above paragraph.

Am i still missing something?

0 Kudos
Michael_S_17
New Contributor I
686 Views

Steve Lionel (Ret.) (Blackbelt) wrote:

No, a stopped image does not sync, and the behavior you show is reasonable per the standard. Fortran 2018 adds the concept of "failed images" and the ability to detect failure on an image control statement. The soon-to-be-released 19.1 has some support for this.

Consider using ERROR STOP if you want the whole program to die. Otherwise, don't use STOP in coarray programs!

I agree. The ability to detect failures on image control statements is essential. It will be very interesting to see how this works in-build by the implementers.  Personally, I did already succeed to implement failure detection and handling with the implementation of my own customized synchronization routines. There is still much room for improvement but it works through a time limit for any (customized) synchronization on each image and a complete synchronization abort in case of a failure. (This works even if the whole remote data transfer is corrupted, i.e. without any remote data transfer at all, because each image controls it's own sync status independently within the customized synchronization). Among future improvements could be a self-adjusting abort timer and (only theoretically yet) an only partial synchronization abort of only the failed images. But especially that last point could be difficult: I can't tell how exactly the in-build synchronization methods (like SYNC ALL) are implemented, but could imagine that they use atomics in quite a similar fashion as we do when implementing basic customized synchronization procedures. From my own experiences (ifort and gfortran/OpenCoarrays/MPICH), I can tell that ATOMIC data transfer channels (coarrays) as a whole are very much sensitive to failures: Even if the failure is only on one image (I did not use STOP to emulate that but did simply not synchronize on a specific image to test the runtime in case of a failure), this can or can not have an impact on all remaining data transfer channels of that atomic coarray. Thus, the only solution in case of a failure I could go so far, was to not only abort the complete synchronization on all the involved images, but also to reallocate the data transfer channels (atomic coarray) to recover from such failures. This process requires coarray teams (as already support by OpenCoarrays), to limit the failure and recover to only a small number of images.

Regards

0 Kudos
Steve_Lionel
Honored Contributor III
686 Views

nn n. wrote:

I am looking at p. 342 of the excellent "Modern Fortran Explained - Incorporating Fortran 2018" by Metcalf and al.

Normal termination occurs in three steps: initiation, synchronization, and completion. An image initiates normal termination but its data need to be accessible to other images until they have all initiated termination. Hence there needs to be synchronization, after which all images can complete execution.

Normal termination is initiated by a STOP statement, for instance. I would really not expect the to code hang on the SYNC ALL (STAT=...) line, in light of the above paragraph.

Am i still missing something?

What you're missing is that F2008 provides no help for coarray images that terminate. The assumption for all of the synchronization is that all images are active until they all terminate. Obviously this doesn't reflect the real world, hence the years spent on building "failed image" support into the language. 

0 Kudos
Michael_S_17
New Contributor I
686 Views

I did just check with OP's two test cases above, as well as my own test case including a STOP statement, using recent gfortran/OpenCoarrays/MPICH: With the STOP statement, the runtime does appear to immediately terminate execution of the whole parallel application (on all images) when the STOP statement is reached, issuing the following runtime termination message:

STOP
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0

This runtime behavior with the STOP statement is very similar to ERROR STOP.

I don't have access to ifort yet, but according to what OP says, maybe it's not a bad idea at all to create a support request at the Intel service center with your above test cases?

Regards

0 Kudos
Michael_S_17
New Contributor I
686 Views

Just for completion: With a more sophisticated test case, using gfortran/OpenCoarrays/MPICH with coarray teams, the runtime does now hang as well when reaching a STOP statement. But even in this case the runtime does at least issue a STOP message on screen.

0 Kudos
Steve_Lionel
Honored Contributor III
686 Views

I note that compiler 19.1 will give you a nice message if one of the images has stopped - it is less likely to hang.

0 Kudos
Reply