- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'd like to report problems with termination of CoArray Fortran programs.
If one of images terminates dies, either due to "stop" or "error stop", or opening read-only non-existent file, or calling subroutine "exit", other images do not terminate and the entire group of images hangs forever. Moreover, if there is an explicit barrier in the code next to the place where one of the image is supposed to crash, other images can cross the barrier.
Please find a reproducer below. The program simulates an error termination of one image.
This is done in line 13.
[fortran]program coarrCrashTest implicit none integer :: ic(10)
Typical output from the execution (8 images are started by setting environmental variable FOR_COARRAY_NUM_IMAGES=8) looks like the listing below:
Image 7 of 8Image 2 of 8Image 5 of 8Image 8 of 8Image 4 of 8Image 1 of 8Image 3 of 8Image 6 of 8About to crash: 2Finished: 5Finished: 8Finished: 1Finished: 7Finished: 3Finished: 4Finished: 6
Then the program hangs and never finishes. The lines containing "Finished: " clearely come from the images that passed the barrier "sync all", line 19.The Windows Task Manager shows 9 (nine) processes "coarrCrashTest.exe" running. Eight processes have almost identical memory usage, one of these processes consumes a bit less memory. There is a noticeable CPU utilization pattern: all images consume approx. 13% of CPU, except for the one with the lowest memory utilization, which takes ~0% of CPU. This group of processes runs infinitely, till I either terminate one with "End process" button from Windows Task Manager or close the console window.
The reproducer contains also other "typical" ways to terminate the program (commented out), cf. lines 13 - 17 in the code, marked with "crash #1" to " crash #5". The lines "crash #1" to "crash #4" result in identical behavior as described above.
The only crash condition that actually terminates the execution is given in line 17(marked as "crash #5"). If we comment out lines 12-16 and uncomment the line 17 , the program terminates in a quite expected way. Here is the output from such crash:Image 7 of 8Image 1 of 8Image 2 of 8Image 4 of 8Image 3 of 8Image 5 of 8Image 8 of 8Image 6 of 8About to crash: 2job aborted:rank: node: exit code[: error message]0: localhost: 1231: localhost: 3: process 1 exited without calling finalize2: localhost: 1233: localhost: 1234: localhost: 1235: localhost: 1236: localhost: 1237: localhost: 123
Quoting fromISO/IEC JTC1/SC22/WG5 N1824 (April 2010), p. 24:
The statementerror stophas been introduced. When executed on one image, it initiates error termination there andhence causes all other images that have not already initiated error termination to initiate errortermination. It thus causes the whole calculation to stop as soon as is practicable.
According to the cited document, the most canonical way to exceptionally terminate the CAF program is to use "error stop" statement. Apparently, ifort CAF does not terminate the images as it can be expected.
The code was tested on WinXP, 32-bit, dual-core Centrino.
It seems thatVisual Fortran Compiler XE 12.1.2.278 [IA-32] and earlier versions are affected.
The reproducer code was also tested on Linux EMT64 8-core machine with the following versions of ifort:
*ifort (IFORT)12.1.0 20111011
* ifort (IFORT) 12.0.0 20101006
It seems that the described behaviour appears also in these versions.
best regards,
Jerzy
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The problem you have is SYNC ALL, if entered by one image must have a corisponding entry of a SYNC ALL by all other images. Exiting an image where any other image will subsequently issue SYNCH ALL will cause a lockup.
Add shared coarray variable logical TerminateApplication
Initialized to .false.
LOGICAL FUNCTION CheckTermiate()
CheckTerminate = .true.
sync memory
do i=1,num_images()
if(TerminateApplication) return
end do
CheckTerminate = .false.
end FUNCTION CheckTerminate
SUBROUTINE TerminateNow()
TerminateApplication = .true.
do while(.true.)
sync memory
do i=1,num_images()
if(.not. TerminateApplication) exit ! back to while(.true.)
if(i == num_images()) then
error 'exit program'
endif
end do
end do
end subroutine TerminateNow
...
! periodically in code
if(CheckTerminate) call TerminateNow()
...
Jim Dempsey
Add shared coarray variable logical TerminateApplication
Initialized to .false.
LOGICAL FUNCTION CheckTermiate()
CheckTerminate = .true.
sync memory
do i=1,num_images()
if(TerminateApplication) return
end do
CheckTerminate = .false.
end FUNCTION CheckTerminate
SUBROUTINE TerminateNow()
TerminateApplication = .true.
do while(.true.)
sync memory
do i=1,num_images()
if(.not. TerminateApplication) exit ! back to while(.true.)
if(i == num_images()) then
error 'exit program'
endif
end do
end do
end subroutine TerminateNow
...
! periodically in code
if(CheckTerminate) call TerminateNow()
...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If only it were that easy, Jim. In fact, the Fortran standard explicitly states the behavior of ERROR STOP which is to terminate all images. This feature was added to the language just for this purpose.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then your sync all is broken, as demonstrated by Jerzy.
Are you implying that Jerzy must wait for the compiler to get fixed?
At least until the compiler isfixed, Jerzy could impliment the "hack" I layed out earlier (assuming this works).
Jim
Are you implying that Jerzy must wait for the compiler to get fixed?
At least until the compiler isfixed, Jerzy could impliment the "hack" I layed out earlier (assuming this works).
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
integer :: id, lastid, k = 1, n = 0
integer :: istat = 0
!
id = this_image()
lastid = num_images()
!
istat = 0
write(*,*) 'Image ', id, ' of ', lastid
sync all ! Synchronization here, but it does not help...
if (id == 2) then
write(*,*) 'About to crash:',id
error stop ! <-- crash #1
! stop ! <-- crash #2
! open(unit=100,file='f.dat',status='old') ! <-- crash #3
! call exit(1) ! <-- crash #4
! k = k / n ! <-- crash #5
endif
sync all(stat=istat) ! Synchronization here, this time we check exit status
if (istat == stat_stopped_image) then
write(*,*) 'Stopped image detected by ', id
end if
write(*,*) 'Finished: ',id
end program[/fortran]
thanks for the hack you proposed. I'm afraid that you want to redevelop something that is already included in the standard. As Steve pointed out, the construct "error stop" has been introduced precisely for the purpose of handling the emergency exits.
But it is not the end of the story. It is even worse. Please read the excerpt from the draftISO/IEC JTC1/SC22/WG5 N1824, p. 24, sec. 12.7 (sorry for this long quotation):
All the synchronization statements, that is, sync all, sync images, lock, unlock, and syncmemory, have optional stat= and errmsg= specifiers. They have the same role for these statementsas they do for allocate and deallocate in Fortran 2003.If any of these statements, including allocate and deallocate, encounter an image that hasexecuted a stop or end program statement and have a stat= specifier, the stat= variable isgiven the value of the constant stat stopped image in the iso fortran env intrinsic module,and the effect of executing the statement is otherwise the same as that of executing the syncmemory statement. Without a stat= specifier, the execution of such a statement initiates errortermination (Section 13).
So the standard offers a possibility to check if any of the images has prematurely terminated. Basically, that's what your code implements :-) Following the standard-based approach, I implemented a small correction to my original reproducer:
[fortran]program coarrCrashTest use iso_fortran_env implicit none integer :: ic(10)
Unfortunately, it fails at execution in exactly the same way as the previous unmodified version. My conclusion: the current implementation of the CAF in the compiler is somewhat wrong. I guess that the final "sync all" does not actually check how many images are still present in the flock, it just waits for all images that have been started. But it is just a supposition.
best regards,
Jerzy

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page