- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On our slurm cluster, depending on the number of nodes and the number of specified images, even the simplest coarray program either hangs or segfaults. With 2 nodes (each 12 cores) I can not run more than 16/20 images depending on the way I launch it, otherwise it will segfault. With 3 or more nodes the number of images I can use increases, but is always below the maximum.
The program is
program hello implicit none sync all write (*,*) "hello from image", this_image() sync all end program hello
$ ifort --version
ifort (IFORT) 19.0.1.144 20181018
$ ifort -coarray=distributed -coarray-num-images=20 hello.f90 -o hello
16 images work, 20 hang and the maximum number of 24 crashes:
$ ./hello
MPI startup(): I_MPI_SCALABLE_OPTIMIZATION environment variable is not supported.
MPI startup(): I_MPI_CAF_RUNTIME environment variable is not supported.
MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
In coarray image 20
Image PC Routine Line Source
hello 00000000004060D3 Unknown Unknown Unknown
libpthread-2.12.s 00007F48D65F67E0 Unknown Unknown Unknown
libicaf.so 00007F48D6A8E086 for_rtl_ICAF_BARR Unknown Unknown
hello 00000000004051DB Unknown Unknown Unknown
hello 0000000000405182 Unknown Unknown Unknown
libc-2.12.so 00007F48D6271D1D __libc_start_main Unknown Unknown
hello 0000000000405029 Unknown Unknown Unknown
Abort(0) on node 19 (rank 19 in comm 496): application called MPI_Abort(comm=0x84000002, 0) - process 19
etc.
In this mode if I set I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so and run ./hello I get no output at all, but it does not hang. Trying it multiple times sometimes gives:
srun: error: slurm_send_recv_rc_msg_only_one to tev0107:42260 : Transport endpoint is not connected
srun: error: slurm_receive_msg[192.168.76.7]: Zero Bytes were transmitted or received
In the no_launch mode:
$ ifort -coarray=distributed -switch no_launch hello.f90 -o hello
I can consistently run 20 images without problems:
$ srun --mpi=pmi2 -n 20 ./hello
MPI startup(): I_MPI_CAF_RUNTIME environment variable is not supported.
MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
hello from image 3
hello from image 11
...
but beyond 20 it will either hang or segfault again.
If I remove the sync all statements and just keep the write I see a couple of "hello from image" messages, but after 5 or 6 it just hangs. It does not segfault without the sync statements.
We also have the 2018 release installed, and I don't have the problem there. But with the 2018 release even this trivial program with just a write and a sync all takes 30 seconds of cpu time before finishing. An equivalent program running direct mpi directives finishes in a fraction of a second. I can also run any number of images in shared mode on just a single machine (in oversubscription).
The guide at https://software.intel.com/en-us/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential seems to be outdated since the tcp fabric is no longer supported with Intel MPI.
Am I the first one to test the Coarray Fortran implementation with the 2019 suite or is all of this just a user error? Is the Coarray Fortran implementation considered to be stable?
Hope you can help,
Tobias
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are known issues with the coarray support in 19.0.1 and 19.0.2. Supposedly this is all fixed in 19.0.3, which I am told is coming soon. As best as I can tell from the outside, the compiler team was surprised by an update to Intel MPI in the Parallel Studio product that created compatibility issues. I do think that the Intel Product Validation people don't actually test coarrays, which is unfortunate. Maybe that has changed (or will change.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As support told me today, 19.0.3 is supposed to come out next week.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, the MPI issue is not fixed in Update 3. The fix is planned for future release. We should start testing of the next Compiler version in mid to late April - watch this Forum for the test announcement in mid to late April.
Please use the following workaround:
export I_MPI_REMOVED_VAR_WARNING=0 export I_MPI_VAR_CHECK_SPELLING=0
That will turn off the error messages from MPI.
I don't see segfault in all 19.0 versions.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page