Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Bart_O_
Novice
569 Views

SIGFPE with mpiexec.hydra for Intel MPI 2019 update 7

If I use Intel MPI update 7 in a Slurm configuration on two cores on two separate nodes, I get a SIGFPE here (according to gdb on the generated core file):

#0 0x00000000004436ed in ipl_create_domains (pi=0x0, scale=4786482) at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_service.c:2240

This happens only with mpirun / mpiexec.hydra using e.g. "mpirun -n 2 ./test"

I know of 3 workarounds, any of which will let me run this successfully, but I thought maybe you or others should know about this crash:

1. Set I_MPI_PMI_LIBRARY=libpmi2.so and use "srun -n 2 ./test" (with Slurm configured to use pmi2).

2. Use I_MPI_HYDRA_TOPOLIB=ipl

3. Use the "legacy" mpiexec.hydra.

0 Kudos
4 Replies
Bart_O_
Novice
569 Views

Some more details:

OS: CentOS 7.7, Linux blg8616.int.ets1.calculquebec.ca 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz GenuineIntel GNU/Linux

If I run (from "mpiicc ${I_MPI_ROOT}/test/test.c -g -o test") with "I_MPI_HYDRA_TOPOLIB=ipl" I get this:

[oldeman@blg8616 test]$ I_MPI_DEBUG=5 mpirun -n 2 ./test
[0] MPI startup(): libfabric version: 1.10.0a1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): Rank    Pid      Node name                         Pin cpu
[0] MPI startup(): 0       173711   blg8616.int.ets1.calculquebec.ca  37
[0] MPI startup(): 1       220728   blg8621.int.ets1.calculquebec.ca  29
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_FC=ifort
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/cvmfs/soft.computecanada.ca/easybuild/software/2019/avx2/Compiler/intel2020/intelmpi/2019.7.217
[0] MPI startup(): I_MPI_LINK=opt
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=ipl
[0] MPI startup(): I_MPI_HYDRA_BOOTSTRAP=slurm
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=5
Hello world: rank 0 of 2 running on blg8616.int.ets1.calculquebec.ca
Hello world: rank 1 of 2 running on blg8621.int.ets1.calculquebec.ca

but without that set, it gives me (and also with just "hostname"):

[oldeman@blg8616 test]$ I_MPI_DEBUG=5 mpirun -n 2 ./test
srun: error: blg8621: task 1: Floating point exception (core dumped)
srun: error: blg8616: task 0: Floating point exception (core dumped)

[mpiexec@blg8616.int.ets1.calculquebec.ca] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:528): downstream from host blg8616 exited with status 136
[mpiexec@blg8616.int.ets1.calculquebec.ca] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2114): assert (exitcodes != NULL) failed

 

PrasanthD_intel
Moderator
569 Views

Hi Bart,

Thanks for reaching out to us.

We will investigate this issue further and will get back to you soon.

 

Thanks

Prasanth

530 Views

Hi Bart,


Intel MPI Library 2019 U8 has just been released. Could you please rerun your experiments with mpiexec.hydra and report your findings, please?


Best regards,

Amar


430 Views

Hi Bart,


Having not received your response for over a month, I am going ahead and closing this thread. Whenever, Intel MPI Library's native process manager is not used, we recommend to set the PMI library explicitly, using the I_MPI_PMI_LIBRARY environment variable.


For more details, please refer to the following links -

[1] https://software.intel.com/content/www/us/en/develop/articles/how-to-use-slurm-pmi-with-the-intel-mp...

[2] https://slurm.schedmd.com/mpi_guide.html


This issue will be treated as resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Best regards,

Amar


Reply