Infinite waiting in coarray critical section

jerzyg · ‎01-03-2012

Dear,

recently I started using the Co-arrays Fortran feature included in Intel Fortran compiler 12.X. It's a great tool!

However, I experienced an unexpected behavior if "critical" construct is used. First of all, the synchronization at the critical section takes very long time (few seconds!). Secondly, the synchronization leads to a deadlock with active waiting.

Please see attached reproducer. In principle, the code is based on ISO/IEC JTC1/SC22/WG5 N1824 document, April 2010, page 20. The program is trivial.

[fortran]program coarrCriticalTest
implicit none
integer :: ic(10)
integer :: id, lastid
id = this_image()
lastid = num_images()
write(*,fmt=101) 'I am ', id , ' of ', lastid
critical 
      write(*,fmt=100) 'Image ', id, 'entered critical section'
end critical
write(*,fmt=100) 'I am ', id, ' , just passed critical section'
! Explicit sync before the end, but it does not help...
sync all

100 format(A,1X,I2,1X,A)
101 format(2(A,1X,I2))
end program[/fortran]

Here is an example of the output from an execution on a 2-core CPU, 8 images are requested by settingFOR_COARRAY_NUM_IMAGES=8. Of course, the execution of the program is time dependent. The output from the codetypicallylooks like this:

I am 2 of 8
I am 7 of 8
I am 8 of 8
I am 1 of 8
Image 1 entered critical section
I am 3 of 8
I am 5 of 8
I am 6 of 8
I am 4 of 8
I am 1 , just passed critical section
Image 8 entered critical section
I am 8 , just passed critical section
Image 6 entered critical section
I am 6 , just passed critical section
Image 4 entered critical section
I am 4 , just passed critical section
Image 2 entered critical section
I am 2 , just passed critical section

Usually it takes couple of seconds to reach such state, but it never comes to a situation when ALL images passed the critical section. Afterward, the program hangs and never finishes. Windows task manager indicates that:

a) 9 images coarrCriticalTest.exe are still running,

b) one of them consumes ~50% of CPU,

c) three other images take ~15% of CPU each.

Number of images with the lower CPU consumption is the same as the number of images that haven't passed the critical section.

This sort of problems is found in both Debug and Release configuration. There is no difference whether I set the number of images by the FOR_COARRAY_NUM_IMAGES environment variable or the /Qcoarray-num=images compiler switch.

The code was compiled withVisual Fortran Compiler XE 12.1.2.278 [IA-32] and tested on a dual-core Centrino machine.

I also checked the following versions:

* Visual Intel Fortran XE 2011 updates 7 and 6, IA-32, win32, dual-core Centrino: the same wrong results as described above.

*Intel Fortran12.0.0 20101006 + intel_mpi/4.0.1.007, Linux emt64, 8-core Xeon, compiled with "-coarray=shared" and tested with various number of images: the code does not work as expected, similar synchronization problems occur as described above.

* Intel Fortran12.1.0 20111011 +intel_mpi/4.0.3.008,Linux emt64, 8-core Xeon,compiled with "-coarray=shared" and executed with various number of images: the code works correctly, i.e. it can finish successfully, no delays at synchronization.

best regards,

Jerzy

jimdempseyatthecove · ‎01-03-2012

Interesting.

In the coarray implimentation you should use LOCK and UNLOCK (a coarray critical section amongst participating processes) as opposed to CRITICAL (an OpenMPcritical sectionfor OMPthreads within a single process).

This said, your program should not have attained a deadlock situation.

Your test program does not have "use omp_lib". The CRITICAL should map to the OpenMP stub library (effectively making the critical a NOP). As to if it is a NOP or a intra-process thread safe lock I cannot say as this would be an implementation issue. Try adding "use omp_lib" and see what happens(even though your intention may have been to use the LOCK).

Jim Dempsey

Steven_L_Intel1 · ‎01-03-2012

I took the program and ran on two different systems. On a quad-core (8-thread) Nehalem, the program ran correctly with 12.1.2.278 every time. When I ran it on a dual-core Centrino machine, leaving the number of images undefined, it also was ok, but when I set the number of images to 8 I saw the behavior you reported. My guess is that there is some subtle synchronization issue that is made worse when the threads are oversubscribed. I will let the developers know.

Steven_L_Intel1 · ‎01-03-2012

Jim, this program is using Fortran 2008 coarrays and the critical section feature, not OpenMP.

jimdempseyatthecove · ‎01-03-2012

The latests IVF Fortran Docs (Parallel Studio XE 2011) only lists CRITICAL under the OpenMP section, and not as a directive in the COARRAY section.

FWIW, IMHO, CRITICAL should be intra-process not inter-process as you may choose to program OpenMP with CoArrays (or choose to use additional non-OpenMP threads within a/each process). The CoArray LOCK can serve the same purpose as critical section for CoArray program (and assuming you use the error return arg) for multi-threaded (OpenMP or other thread) CoArray applications. Making CRITICAL reach out to interprocess will drasticly slow down enter/exit of critical sections for thread-safe protection within a process.

Jim Dempsey

jerzyg · ‎01-03-2012

Jim,

locks and critical sections are two different synchonization mechanizms offered by the CAF (coarrays) -- see WG5 N1824 draft I was referring to in my post. The purpose of both is subtly different: the former is well-suited for protecting data structures from concurrent modifications, while the latter (critical sections) is intended to guarantee that precisely one process can execute given piece of code. I needed the second case.According to documentation of Ifort 12.X, both are supported by the compiler.

OpenMP is not an option for the application I cope with. I want to parallelize a legacy code on shared and distributed memory using CAF. This is a very convenient way for the case at hand, because CAF offers a perfect separation of variables, common blocks etc. They just go to separate address spaces, thanks to CAF relying on heavy processes.

Skipping OMP is not a mistake.My test program does not use OpenMP, so there is no need to import omp_lib. Moreover, the underlying CAF implementation probably does not make calls to OMP. As far as I know, it uses Intel MPI as a transport layer.

best regards,

Jerzy

Steven_L_Intel1 · ‎01-03-2012

Jim,

This form of critical sections is part of the Fortran 2008 standard and works across "images" in the coarray sense. Our implementation of this provides for both shared-memory and distributed-memory applications. We don't have the luxury of redefining the standard here.

There are of course other implementations of critical sections that may be more appropriate for some applications. But the Fortran standard kind have to work the way the standard describes.

jimdempseyatthecove · ‎01-03-2012

Fair enough.

It would also be "fair enough" to document CRITICAL in both the OpenMP and in the CAF sections as being critical sections across both system when both systems in use as well as what this means for performance.

For users of OpenMP and CAF in the same application then it would also be "fair enough" to document the use of the alternate OpenMP locks as well as the CAF lock depending on which the user really wants to use.

Include sample program (OpenMP with CAF and the three methods of protecting a critical section).

Jim Dempsey

jerzyg · ‎01-03-2012

Steve,

thanks for your prompt response. Indeed, the problem usually appears when the number of images is higher than the number of cores. On a two quad CPU (Nehelem) Linux system that I can use for testing, the problem appears if the code is compiled with the 12.0 and executed with eg.FOR_COARRAY_NUM_IMAGES=16.Surprisingly, if I update the compiler to version12.1.0 20111011 (I can do it easily on the cluster I use) and do not recompile the code, then running 16 images works fine! It seems to suggest that run-time libraties are flawed, not the compiler itself.

best regards,

Jerzy

Steven_L_Intel1 · ‎01-03-2012

There are three components at work here: the compiler, the run-time library and Intel MPI. I agree that the compiler is not likely to be at fault. Whether it is in the library or Intel MPI, though, I don't know. The library makes the same calls no matter how many images there are, so who knows? We will investigate and thank you for the nice test case.