termination of coarray images

jerzyg · ‎01-03-2012

I'd like to report problems with termination of CoArray Fortran programs.

If one of images terminates dies, either due to "stop" or "error stop", or opening read-only non-existent file, or calling subroutine "exit", other images do not terminate and the entire group of images hangs forever. Moreover, if there is an explicit barrier in the code next to the place where one of the image is supposed to crash, other images can cross the barrier.

Please find a reproducer below. The program simulates an error termination of one image.

This is done in line 13.

[fortran]program coarrCrashTest
implicit none
integer :: ic(10)
integer :: id, lastid, k = 1, n = 0
!
id = this_image()
lastid = num_images()
!
write(*,*) 'Image ', id, ' of ', lastid
sync all    ! Synchronization here, but it does not help...
if (id == 2) then
      write(*,*) 'About to crash:',id
      error stop        ! <-- crash #1
      ! stop            ! <-- crash #2
      ! open(unit=100,file='f.dat',status='old') ! <-- crash #3
      ! call exit(1)    ! <-- crash #4
      ! k = k / n       ! <-- crash #5
endif
sync all    ! Synchronization here, but it does not help neither...
write(*,*) 'Finished: ',id
end program[/fortran]

Typical output from the execution (8 images are started by setting environmental variable FOR_COARRAY_NUM_IMAGES=8) looks like the listing below:

Image 7 of 8
Image 2 of 8
Image 5 of 8
Image 8 of 8
Image 4 of 8
Image 1 of 8
Image 3 of 8
Image 6 of 8
About to crash: 2
Finished: 5
Finished: 8
Finished: 1
Finished: 7
Finished: 3
Finished: 4
Finished: 6

Then the program hangs and never finishes. The lines containing "Finished: " clearely come from the images that passed the barrier "sync all", line 19.The Windows Task Manager shows 9 (nine) processes "coarrCrashTest.exe" running. Eight processes have almost identical memory usage, one of these processes consumes a bit less memory. There is a noticeable CPU utilization pattern: all images consume approx. 13% of CPU, except for the one with the lowest memory utilization, which takes ~0% of CPU. This group of processes runs infinitely, till I either terminate one with "End process" button from Windows Task Manager or close the console window.

The reproducer contains also other "typical" ways to terminate the program (commented out), cf. lines 13 - 17 in the code, marked with "crash #1" to " crash #5". The lines "crash #1" to "crash #4" result in identical behavior as described above.

The only crash condition that actually terminates the execution is given in line 17(marked as "crash #5"). If we comment out lines 12-16 and uncomment the line 17 , the program terminates in a quite expected way. Here is the output from such crash:

Image 7 of 8
Image 1 of 8
Image 2 of 8
Image 4 of 8
Image 3 of 8
Image 5 of 8
Image 8 of 8
Image 6 of 8
About to crash: 2
job aborted:
rank: node: exit code[: error message]
0: localhost: 123
1: localhost: 3: process 1 exited without calling finalize
2: localhost: 123
3: localhost: 123
4: localhost: 123
5: localhost: 123
6: localhost: 123
7: localhost: 123

Quoting fromISO/IEC JTC1/SC22/WG5 N1824 (April 2010), p. 24:

The statement
error stop
has been introduced. When executed on one image, it initiates error termination there and
hence causes all other images that have not already initiated error termination to initiate error
termination. It thus causes the whole calculation to stop as soon as is practicable.

According to the cited document, the most canonical way to exceptionally terminate the CAF program is to use "error stop" statement. Apparently, ifort CAF does not terminate the images as it can be expected.

The code was tested on WinXP, 32-bit, dual-core Centrino.

It seems thatVisual Fortran Compiler XE 12.1.2.278 [IA-32] and earlier versions are affected.

The reproducer code was also tested on Linux EMT64 8-core machine with the following versions of ifort:

*ifort (IFORT)12.1.0 20111011

* ifort (IFORT) 12.0.0 20101006

It seems that the described behaviour appears also in these versions.

best regards,

Jerzy

jimdempseyatthecove · ‎01-03-2012

The problem you have is SYNC ALL, if entered by one image must have a corisponding entry of a SYNC ALL by all other images. Exiting an image where any other image will subsequently issue SYNCH ALL will cause a lockup.

Add shared coarray variable logical TerminateApplication

Initialized to .false.

LOGICAL FUNCTION CheckTermiate()
CheckTerminate = .true.
sync memory
do i=1,num_images()
if(TerminateApplication) return
end do
CheckTerminate = .false.
end FUNCTION CheckTerminate

SUBROUTINE TerminateNow()
TerminateApplication = .true.
do while(.true.)
sync memory
do i=1,num_images()
if(.not. TerminateApplication) exit ! back to while(.true.)
if(i == num_images()) then
error 'exit program'
endif
end do
end do
end subroutine TerminateNow
...
! periodically in code
if(CheckTerminate) call TerminateNow()
...

Jim Dempsey

Steven_L_Intel1 · ‎01-03-2012

If only it were that easy, Jim. In fact, the Fortran standard explicitly states the behavior of ERROR STOP which is to terminate all images. This feature was added to the language just for this purpose.

jimdempseyatthecove · ‎01-03-2012

Then your sync all is broken, as demonstrated by Jerzy.

Are you implying that Jerzy must wait for the compiler to get fixed?

At least until the compiler isfixed, Jerzy could impliment the "hack" I layed out earlier (assuming this works).

Jim

jerzyg · ‎01-03-2012

Jim,

thanks for the hack you proposed. I'm afraid that you want to redevelop something that is already included in the standard. As Steve pointed out, the construct "error stop" has been introduced precisely for the purpose of handling the emergency exits.

But it is not the end of the story. It is even worse. Please read the excerpt from the draftISO/IEC JTC1/SC22/WG5 N1824, p. 24, sec. 12.7 (sorry for this long quotation):

All the synchronization statements, that is, sync all, sync images, lock, unlock, and sync
memory, have optional stat= and errmsg= specifiers. They have the same role for these statements
as they do for allocate and deallocate in Fortran 2003.
If any of these statements, including allocate and deallocate, encounter an image that has
executed a stop or end program statement and have a stat= specifier, the stat= variable is
given the value of the constant stat stopped image in the iso fortran env intrinsic module,
and the effect of executing the statement is otherwise the same as that of executing the sync
memory statement. Without a stat= specifier, the execution of such a statement initiates error
termination (Section 13).

So the standard offers a possibility to check if any of the images has prematurely terminated. Basically, that's what your code implements :-) Following the standard-based approach, I implemented a small correction to my original reproducer:

[fortran]program coarrCrashTest
use iso_fortran_env
implicit none
integer :: ic(10)
integer :: id, lastid, k = 1, n = 0
integer :: istat = 0
!
id = this_image()
lastid = num_images()
!
istat = 0
write(*,*) 'Image ', id, ' of ', lastid
sync all    ! Synchronization here, but it does not help...
if (id == 2) then
      write(*,*) 'About to crash:',id
      error stop        ! <-- crash #1
      ! stop            ! <-- crash #2
      ! open(unit=100,file='f.dat',status='old') ! <-- crash #3
      ! call exit(1)    ! <-- crash #4
      ! k = k / n       ! <-- crash #5
endif
sync all(stat=istat)    ! Synchronization here, this time we check exit status
if (istat == stat_stopped_image) then
      write(*,*) 'Stopped image detected by ', id
end if
write(*,*) 'Finished: ',id
end program[/fortran]

Unfortunately, it fails at execution in exactly the same way as the previous unmodified version. My conclusion: the current implementation of the CAF in the compiler is somewhat wrong. I guess that the final "sync all" does not actually check how many images are still present in the flock, it just waits for all images that have been started. But it is just a supposition.

best regards,

Jerzy