Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

termination of coarray images

jerzyg
Beginner
1,494 Views
I'd like to report problems with termination of CoArray Fortran programs.
If one of images terminates dies, either due to "stop" or "error stop", or opening read-only non-existent file, or calling subroutine "exit", other images do not terminate and the entire group of images hangs forever. Moreover, if there is an explicit barrier in the code next to the place where one of the image is supposed to crash, other images can cross the barrier.
Please find a reproducer below. The program simulates an error termination of one image.
This is done in line 13.
[fortran]program coarrCrashTest
implicit none
integer :: ic(10)
  • integer :: id, lastid, k = 1, n = 0 ! id = this_image() lastid = num_images() ! write(*,*) 'Image ', id, ' of ', lastid sync all ! Synchronization here, but it does not help... if (id == 2) then write(*,*) 'About to crash:',id error stop ! <-- crash #1 ! stop ! <-- crash #2 ! open(unit=100,file='f.dat',status='old') ! <-- crash #3 ! call exit(1) ! <-- crash #4 ! k = k / n ! <-- crash #5 endif sync all ! Synchronization here, but it does not help neither... write(*,*) 'Finished: ',id end program[/fortran]
  • Typical output from the execution (8 images are started by setting environmental variable FOR_COARRAY_NUM_IMAGES=8) looks like the listing below:
    Image 7 of 8
    Image 2 of 8
    Image 5 of 8
    Image 8 of 8
    Image 4 of 8
    Image 1 of 8
    Image 3 of 8
    Image 6 of 8
    About to crash: 2
    Finished: 5
    Finished: 8
    Finished: 1
    Finished: 7
    Finished: 3
    Finished: 4
    Finished: 6
    Then the program hangs and never finishes. The lines containing "Finished: " clearely come from the images that passed the barrier "sync all", line 19.The Windows Task Manager shows 9 (nine) processes "coarrCrashTest.exe" running. Eight processes have almost identical memory usage, one of these processes consumes a bit less memory. There is a noticeable CPU utilization pattern: all images consume approx. 13% of CPU, except for the one with the lowest memory utilization, which takes ~0% of CPU. This group of processes runs infinitely, till I either terminate one with "End process" button from Windows Task Manager or close the console window.
    The reproducer contains also other "typical" ways to terminate the program (commented out), cf. lines 13 - 17 in the code, marked with "crash #1" to " crash #5". The lines "crash #1" to "crash #4" result in identical behavior as described above.
    The only crash condition that actually terminates the execution is given in line 17(marked as "crash #5"). If we comment out lines 12-16 and uncomment the line 17 , the program terminates in a quite expected way. Here is the output from such crash:
    Image 7 of 8
    Image 1 of 8
    Image 2 of 8
    Image 4 of 8
    Image 3 of 8
    Image 5 of 8
    Image 8 of 8
    Image 6 of 8
    About to crash: 2
    job aborted:
    rank: node: exit code[: error message]
    0: localhost: 123
    1: localhost: 3: process 1 exited without calling finalize
    2: localhost: 123
    3: localhost: 123
    4: localhost: 123
    5: localhost: 123
    6: localhost: 123
    7: localhost: 123
    Quoting fromISO/IEC JTC1/SC22/WG5 N1824 (April 2010), p. 24:
    The statement
    error stop
    has been introduced. When executed on one image, it initiates error termination there and
    hence causes all other images that have not already initiated error termination to initiate error
    termination. It thus causes the whole calculation to stop as soon as is practicable.
    According to the cited document, the most canonical way to exceptionally terminate the CAF program is to use "error stop" statement. Apparently, ifort CAF does not terminate the images as it can be expected.

    The code was tested on WinXP, 32-bit, dual-core Centrino.
    It seems thatVisual Fortran Compiler XE 12.1.2.278 [IA-32] and earlier versions are affected.
    The reproducer code was also tested on Linux EMT64 8-core machine with the following versions of ifort:
    *ifort (IFORT)12.1.0 20111011
    * ifort (IFORT) 12.0.0 20101006
    It seems that the described behaviour appears also in these versions.
    best regards,
    Jerzy
    0 Kudos
    4 Replies
    jimdempseyatthecove
    Honored Contributor III
    1,494 Views
    The problem you have is SYNC ALL, if entered by one image must have a corisponding entry of a SYNC ALL by all other images. Exiting an image where any other image will subsequently issue SYNCH ALL will cause a lockup.

    Add shared coarray variable logical TerminateApplication

  • Initialized to .false.

    LOGICAL FUNCTION CheckTermiate()
    CheckTerminate = .true.
    sync memory
    do i=1,num_images()
    if(TerminateApplication) return
    end do
    CheckTerminate = .false.
    end FUNCTION CheckTerminate

    SUBROUTINE TerminateNow()
    TerminateApplication = .true.
    do while(.true.)
    sync memory
    do i=1,num_images()
    if(.not. TerminateApplication) exit ! back to while(.true.)
    if(i == num_images()) then
    error 'exit program'
    endif
    end do
    end do
    end subroutine TerminateNow
    ...
    ! periodically in code
    if(CheckTerminate) call TerminateNow()
    ...


    Jim Dempsey
  • 0 Kudos
    Steven_L_Intel1
    Employee
    1,494 Views
    If only it were that easy, Jim. In fact, the Fortran standard explicitly states the behavior of ERROR STOP which is to terminate all images. This feature was added to the language just for this purpose.
    0 Kudos
    jimdempseyatthecove
    Honored Contributor III
    1,494 Views
    Then your sync all is broken, as demonstrated by Jerzy.

    Are you implying that Jerzy must wait for the compiler to get fixed?

    At least until the compiler isfixed, Jerzy could impliment the "hack" I layed out earlier (assuming this works).

    Jim
    0 Kudos
    jerzyg
    Beginner
    1,494 Views
    Jim,
    thanks for the hack you proposed. I'm afraid that you want to redevelop something that is already included in the standard. As Steve pointed out, the construct "error stop" has been introduced precisely for the purpose of handling the emergency exits.
    But it is not the end of the story. It is even worse. Please read the excerpt from the draftISO/IEC JTC1/SC22/WG5 N1824, p. 24, sec. 12.7 (sorry for this long quotation):
    All the synchronization statements, that is, sync all, sync images, lock, unlock, and sync
    memory, have optional stat= and errmsg= specifiers. They have the same role for these statements
    as they do for allocate and deallocate in Fortran 2003.
    If any of these statements, including allocate and deallocate, encounter an image that has
    executed a stop or end program statement and have a stat= specifier, the stat= variable is
    given the value of the constant stat stopped image in the iso fortran env intrinsic module,
    and the effect of executing the statement is otherwise the same as that of executing the sync
    memory statement. Without a stat= specifier, the execution of such a statement initiates error
    termination (Section 13).
    So the standard offers a possibility to check if any of the images has prematurely terminated. Basically, that's what your code implements :-) Following the standard-based approach, I implemented a small correction to my original reproducer:
    [fortran]program coarrCrashTest
    use iso_fortran_env
    implicit none
    integer :: ic(10)
  • integer :: id, lastid, k = 1, n = 0 integer :: istat = 0 ! id = this_image() lastid = num_images() ! istat = 0 write(*,*) 'Image ', id, ' of ', lastid sync all ! Synchronization here, but it does not help... if (id == 2) then write(*,*) 'About to crash:',id error stop ! <-- crash #1 ! stop ! <-- crash #2 ! open(unit=100,file='f.dat',status='old') ! <-- crash #3 ! call exit(1) ! <-- crash #4 ! k = k / n ! <-- crash #5 endif sync all(stat=istat) ! Synchronization here, this time we check exit status if (istat == stat_stopped_image) then write(*,*) 'Stopped image detected by ', id end if write(*,*) 'Finished: ',id end program[/fortran]
  • Unfortunately, it fails at execution in exactly the same way as the previous unmodified version. My conclusion: the current implementation of the CAF in the compiler is somewhat wrong. I guess that the final "sync all" does not actually check how many images are still present in the flock, it just waits for all images that have been started. But it is just a supposition.
    best regards,
    Jerzy
    0 Kudos
    Reply