Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
16 Views

Coarray Fortran programs hang with certain number of images

On our slurm cluster, depending on the number of nodes and the number of specified images, even the simplest coarray program either hangs or segfaults. With 2 nodes (each 12 cores) I can not run more than 16/20 images depending on the way I launch it, otherwise it will segfault. With 3 or more nodes the number of images I can use increases, but is always below the maximum.

The program is

program hello
      implicit none
      sync all
      write (*,*) "hello from image", this_image()
      sync all
end program hello

 

$ ifort --version
ifort (IFORT) 19.0.1.144 20181018

$ ifort -coarray=distributed -coarray-num-images=20 hello.f90 -o hello

16 images work, 20 hang and the maximum number of 24 crashes:

$ ./hello
MPI startup(): I_MPI_SCALABLE_OPTIMIZATION environment variable is not supported.
MPI startup(): I_MPI_CAF_RUNTIME environment variable is not supported.
MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
In coarray image 20
Image              PC                Routine            Line        Source             
hello              00000000004060D3  Unknown               Unknown  Unknown
libpthread-2.12.s  00007F48D65F67E0  Unknown               Unknown  Unknown
libicaf.so         00007F48D6A8E086  for_rtl_ICAF_BARR     Unknown  Unknown
hello              00000000004051DB  Unknown               Unknown  Unknown
hello              0000000000405182  Unknown               Unknown  Unknown
libc-2.12.so       00007F48D6271D1D  __libc_start_main     Unknown  Unknown
hello              0000000000405029  Unknown               Unknown  Unknown

Abort(0) on node 19 (rank 19 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 19

etc.

In this mode if I set I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so and run ./hello I get no output at all, but it does not hang. Trying it multiple times sometimes gives:

srun: error: slurm_send_recv_rc_msg_only_one to tev0107:42260 : Transport endpoint is not connected
srun: error: slurm_receive_msg[192.168.76.7]: Zero Bytes were transmitted or received

 

In the no_launch mode:

$ ifort -coarray=distributed -switch no_launch hello.f90 -o hello

I can consistently run 20 images without problems:

$ srun --mpi=pmi2 -n 20 ./hello
MPI startup(): I_MPI_CAF_RUNTIME environment variable is not supported.
MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
 hello from image           3
 hello from image          11

...

but beyond 20 it will either hang or segfault again.

 

If I remove the sync all statements and just keep the write I see a couple of "hello from image" messages, but after 5 or 6 it just hangs. It does not segfault without the sync statements.

 

We also have the 2018 release installed, and I don't have the problem there. But with the 2018 release even this trivial program with just a write and a sync all takes 30 seconds of cpu time before finishing. An equivalent program running direct mpi directives finishes in a fraction of a second. I can also run any number of images in shared mode on just a single machine (in oversubscription).

The guide at https://software.intel.com/en-us/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-... seems to be outdated since the tcp fabric is no longer supported with Intel MPI.

Am I the first one to test the Coarray Fortran implementation with the 2019 suite or is all of this just a user error? Is the Coarray Fortran implementation considered to be stable?

Hope you can help,

Tobias

 

 

0 Kudos
3 Replies
Highlighted
Black Belt
16 Views

There are known issues with

There are known issues with the coarray support in 19.0.1 and 19.0.2. Supposedly this is all fixed in 19.0.3, which I am told is coming soon. As best as I can tell from the outside, the compiler team was surprised by an update to Intel MPI in the Parallel Studio product that created compatibility issues. I do think that the Intel Product Validation people don't actually test coarrays, which is unfortunate. Maybe that has changed (or will change.)

--
Steve (aka "Doctor Fortran") - https://stevelionel.com/drfortran
0 Kudos
Highlighted
Valued Contributor I
16 Views

As support told me today, 19

As support told me today, 19.0.3 is supposed to come out next week.

0 Kudos
Highlighted
Moderator
16 Views

Unfortunately, the MPI issue

Unfortunately, the MPI issue is not fixed in Update 3. The fix is planned for future release. We should start testing of the next Compiler version in mid to late April - watch this Forum for the test announcement in mid to late April.

Please use the following workaround:

export I_MPI_REMOVED_VAR_WARNING=0
export I_MPI_VAR_CHECK_SPELLING=0

That will turn off the error messages from MPI.

I don't see segfault in all 19.0 versions.

Devorah - Intel® Developer Support
0 Kudos