One of my team members from Russia is accessing a NFS installation of MPI 5.1.0.038 located at a US site. When this team member runs the simple ring application test.c, she encounters a segmentation fault when running with four processes and one process per node. This does not happen for the team members based at US sites. The seg fault does not happen when the application is executed on only a single node, the login node.
The test.c application was compiled by each team member in this way (in a user-specific scratch space in the US NFS allocation) :
mpiicc –g -o testc-intelMPI test.c
To run the executable, we use:
mpirun -n 4 -perhost 1 -env I_MPI_FABRICS tcp -hostfile /nfs/<pathTo>/machines.LINUX ./testc-intelMPI
For the U.S based team members, the output is as follows:
Hello world: rank 0 of 4 running on <hostname1> Hello world: rank 1 of 4 running on <hostname2> Hello world: rank 2 of 4 running on <hostname3> Hello world: rank 3 of 4 running on <hostname4>
When my Russian team member executes this in the same manner, the segmentation fault message states:
/nfs/<pathTo>/intel-5.1.0.038/compilers_and_libraries_2016.0.079/linux/mpi/intel64/bin/mpirun: line 241: 7902 Segmentation fault (core dumped) mpiexec.hydra "$@" 0<&0
When using gdb, we learn the following:
Program received signal SIGSEGV, Segmentation fault. mfile_fn (arg=0x0, argv=0x49cdc8) at ../../ui/mpich/utils.c:448
We do not have the source files with this installation and are unable to inspect utils.c.
Conversely, to run on just the login node with:
mpirun -n 4 -perhost 1 ./testc-intelMPI
No segmentation fault happens:
Hello world: rank 0 of 4 running on <loginHostname> Hello world: rank 1 of 4 running on <loginHostname> Hello world: rank 2 of 4 running on <loginHostname> Hello world: rank 3 of 4 running on <loginHostname>
Let me know of any suggestions for how I can change the environment to enable my Russian team member to run this code correctly.
Hello Artem and others,
Thank you for your suggestion. We resolved the issue earlier today. The original execution by the team member had a typo; when repeated today with the '-v' option and the correct mpirun parameters, it ran as expected.