Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2178 Discussions

SIGFPE with mpiexec.hydra for Intel MPI 2019 update 7

Bart_O_
Novice
3,008 Views

If I use Intel MPI update 7 in a Slurm configuration on two cores on two separate nodes, I get a SIGFPE here (according to gdb on the generated core file):

#0 0x00000000004436ed in ipl_create_domains (pi=0x0, scale=4786482) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:2240

This happens only with mpirun / mpiexec.hydra using e.g. "mpirun -n 2 ./test"

I know of 3 workarounds, any of which will let me run this successfully, but I thought maybe you or others should know about this crash:

1. Set I_MPI_PMI_LIBRARY=libpmi2.so and use "srun -n 2 ./test" (with Slurm configured to use pmi2).

2. Use I_MPI_HYDRA_TOPOLIB=ipl

3. Use the "legacy" mpiexec.hydra.

0 Kudos
4 Replies
Bart_O_
Novice
3,008 Views

Some more details:

OS: CentOS 7.7, Linux blg8616.int.ets1.calculquebec.ca 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz GenuineIntel GNU/Linux

If I run (from "mpiicc ${I_MPI_ROOT}/test/test.c -g -o test") with "I_MPI_HYDRA_TOPOLIB=ipl" I get this:

[oldeman@blg8616 test]$ I_MPI_DEBUG=5 mpirun -n 2 ./test
[0] MPI startup(): libfabric version: 1.10.0a1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): Rank    Pid      Node name                         Pin cpu
[0] MPI startup(): 0       173711   blg8616.int.ets1.calculquebec.ca  37
[0] MPI startup(): 1       220728   blg8621.int.ets1.calculquebec.ca  29
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_FC=ifort
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/cvmfs/soft.computecanada.ca/easybuild/software/2019/avx2/Compiler/intel2020/intelmpi/2019.7.217
[0] MPI startup(): I_MPI_LINK=opt
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=ipl
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=slurm
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=5
Hello world: rank 0 of 2 running on blg8616.int.ets1.calculquebec.ca
Hello world: rank 1 of 2 running on blg8621.int.ets1.calculquebec.ca

but without that set, it gives me (and also with just "hostname"):

[oldeman@blg8616 test]$ I_MPI_DEBUG=5 mpirun -n 2 ./test
srun: error: blg8621: task 1: Floating point exception (core dumped)
srun: error: blg8616: task 0: Floating point exception (core dumped)

[mpiexec@blg8616.int.ets1.calculquebec.ca] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:528): downstream from host blg8616 exited with status 136
[mpiexec@blg8616.int.ets1.calculquebec.ca] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2114): assert (exitcodes != NULL) failed

 

0 Kudos
PrasanthD_intel
Moderator
3,008 Views

Hi Bart,

Thanks for reaching out to us.

We will investigate this issue further and will get back to you soon.

 

Thanks

Prasanth

0 Kudos
DrAmarpal_K_Intel
2,969 Views

Hi Bart,


Intel MPI Library 2019 U8 has just been released. Could you please rerun your experiments with mpiexec.hydra and report your findings, please?


Best regards,

Amar


0 Kudos
DrAmarpal_K_Intel
2,869 Views

Hi Bart,


Having not received your response for over a month, I am going ahead and closing this thread. Whenever, Intel MPI Library's native process manager is not used, we recommend to set the PMI library explicitly, using the I_MPI_PMI_LIBRARY environment variable.


For more details, please refer to the following links -

[1] https://software.intel.com/content/www/us/en/develop/articles/how-to-use-slurm-pmi-with-the-intel-mpi-library-for-linux.html

[2] https://slurm.schedmd.com/mpi_guide.html


This issue will be treated as resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Best regards,

Amar


0 Kudos
Reply