Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Intel MPI update 7 on Mellanox IB causes mpi processes to hang

SoftWeb_V_
Beginner
5,667 Views

The recently released Intel MPI update 7 does not work at all on our nodes with Mellanox Infiniband network adapters. All mpi processes hang indefinitely without any error message or warning. This behaviour occurs regardless of the FI_PROVIDER, we have tested mlx, verbs and tcp.

 

Characteristics of our system:

  • CPU: 2x Intel(R) Xeon(R) Gold 6126
  • Adapter: Mellanox Technologies MT27700 Family [ConnectX-4]
  • Operative System: Cent OS 7.7
  • Related libraries: ICC v2020.1, Intel MPI v2019.7, UCX v1.5.1, OFED v4.7-3.2.9

Steps to reproduce:

  1. Start a job on two different nodes on the Infiniband network
  2. Compile end execute the minimal test program from Intel MPI v2019.7 with mpirun (I also enabled debug output)
    $ mpicc /path/to/impi-2019.7/test/test.c -o test
    $ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpirun ./test

Result:

$ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpirun ./test
[mpiexec@node357.hydra.os] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 42848 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[mpiexec@node357.hydra.os] Launch arguments: /usr/bin/ssh -q -x node356.hydra.brussel.vsc /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 42848 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[proxy:0:0@node357.hydra.os] Warning - oversubscription detected: 1 processes will be placed on 0 cores
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:1@node356.hydra.os] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:1@node356.hydra.os] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:1@node356.hydra.os] PMI response: cmd=appnum appnum=0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:1@node356.hydra.os] PMI response: cmd=my_kvsname kvsname=kvs_309381_0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get kvsname=kvs_309381_0 key=PMI_process_mapping
[proxy:0:1@node356.hydra.os] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=barrier_in

(execution does not stop, it just hangs at this point indefinitely)

The system log of the node shows the following entry:

traps: hydra_pmi_proxy[549] trap divide error ip:4436ed sp:7ffed012ef50 error:0 in hydra_pmi_proxy[400000+ab000]

 

Expected result:

$ mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os

 

Could you help us fix this issue with Intel MPI update 7 in our system?

Is there anything that we can do to better troubleshoot it?

 

Regards,

Alex Domingo

0 Kudos
10 Replies
PrasanthD_intel
Moderator
5,667 Views

Hi,

We tried to reproduce the case but we haven't faced any error.

The hardware configuration we had is :

  Adapter: Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

  OFED : mlnx-5.0-1.0.0.0

  Operating System: CentOS 7.7

The logs you have provided do not provide much information to debug.

Can you try with mpiicc instead of mpicc, since mpicc uses gcc  compiler and not icc.

Can you tell us the output while running mpirun with just hostname.

mpirun hostname

mpiexec.hydra hostname

 

Thanks

Prasanth

 

 

0 Kudos
SoftWeb_V_
Beginner
5,667 Views

It's true that the logs I can provide do not give much information, but I already enabled all debug flags as far as I know. Is there anything else that can be done to increase the verbosity?

I tried compiling the test program with mpiicc and the result is exactly the same as reported in my original post.

Please find below the output with the command hostname as suggested

❯ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpirun hostname
[mpiexec@node377.hydra.os] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node377.hydra.brussel.vsc --upstream-port 33996 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[mpiexec@node377.hydra.os] Launch arguments: /usr/bin/ssh -q -x node375.hydra.brussel.vsc /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node377.hydra.brussel.vsc --upstream-port 33996 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0@node377.hydra.os] Warning - oversubscription detected: 1 processes will be placed on 0 cores
node375.hydra.os
[mpiexec@node377.hydra.os] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:532): downstream from host node377.hydra.brussel.vsc was killed by signal 8 (Floating point exception) 

Same result with mpiexec.hydra as well

❯ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpiexec.hydra hostname
[mpiexec@node377.hydra.os] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node377.hydra.brussel.vsc --upstream-port 35811 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[mpiexec@node377.hydra.os] Launch arguments: /usr/bin/ssh -q -x node375.hydra.brussel.vsc /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node377.hydra.brussel.vsc --upstream-port 35811 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[proxy:0:0@node377.hydra.os] Warning - oversubscription detected: 1 processes will be placed on 0 cores
node375.hydra.os
[mpiexec@node377.hydra.os] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:532): downstream from host node377.hydra.brussel.vsc was killed by signal 8 (Floating point exception)

The previous commands with hostname also generate the same trap divide error in hydra_pmi_proxy, as with the test program.

[2424158.472347] traps: hydra_pmi_proxy[240413] trap divide error ip:4436ed sp:7ffc15963750 error:0 in hydra_pmi_proxy[400000+ab000]
[2424174.583803] traps: hydra_pmi_proxy[240512] trap divide error ip:4436ed sp:7ffdb5f9f350 error:0 in hydra_pmi_proxy[400000+ab000]

These errors are certainly caused by some change in this last update 7 of Intel MPI, as we did not have issue with previous releases. Do you know if there is anything specific with update 7 that might affect ConnectX-4 adapters? I have looked at the changelog in https://software.intel.com/articles/intel-mpi-library-release-notes-linux, but the information there is so limited that it is difficult to know what has really changed. For instance, might this be caused by the added support to PMI2? does it affect Hydra?

Thanks for the help,

Alex

0 Kudos
PrasanthD_intel
Moderator
5,667 Views

Hi,

 

We are forwarding this issue to the respective team.

 

Thanks

Prasanth

0 Kudos
James_T_Intel
Moderator
5,667 Views

Please go to https://software.intel.com/en-us/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband and scroll down to the Limitations section at the bottom.  Ensure that you have all of the expected transports, or if not, please try the workaround listed there.  If you have the expected transports, and the workaround does not work for you, please repeat your test with the following environment variables set and provide the output.

I_MPI_DEBUG=16
FI_LOG_LEVEL=debug

 

0 Kudos
SoftWeb_V_
Beginner
5,667 Views

Thanks for the suggestion. Our system has an older ConnectX-4 NCA that is indeed missing the dc transport.

$ ucx_info -d | grep Transport
#   Transport: self
#   Transport: tcp
#   Transport: rc
#   Transport: ud
#   Transport: mm
#   Transport: mm
#   Transport: cma

However, setting UCX_TLS to the specific transports available in our system does not change the aforementioned issue with Intel MPI v2019 update 7 (I'll provide the output setting I_MPI_DEBUG=16 and FI_LOG_LEVEL=debug as suggested)

$ env | grep UCX
UCX_TLS=rc,ud,sm,self
$ I_MPI_DEBUG=16 FI_LOG_LEVEL=debug mpirun ./test

At this point nothing happens and the execution hangs. The exact same behaviour as before, including the immediate crash of hydra_pmi_proxy

[3528708.139705] traps: hydra_pmi_proxy[10504] trap divide error ip:4436ed sp:7ffdd320ef50 error:0 in hydra_pmi_proxy[400000+ab000]

Setting I_MPI_HYDRA_DEBUG=on provides a little bit more output as shown in my previous posts.

0 Kudos
SoftWeb_V_
Beginner
5,667 Views

We have more information about this issue. The crashes of hydra_pmi_proxy only happen within the scope of a job in Torque, the resource manager in our production systems. I cannot reproduce this issue running in a single node outside of a job. So we have the following three situations with Intel MPI 2019 update 7

  1. Multi-node Torque job: the command `mpirun ./test` hangs indefinitely and hydra_pmi_proxy crashes on the nodes with trap divide error
    $ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpirun ./test
    [mpiexec@node357.hydra.os] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 42848 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
    [mpiexec@node357.hydra.os] Launch arguments: /usr/bin/ssh -q -x node356.hydra.brussel.vsc /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 42848 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
    [proxy:0:0@node357.hydra.os] Warning - oversubscription detected: 1 processes will be placed on 0 cores
    [proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
    [proxy:0:1@node356.hydra.os] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
    [proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_maxes
    [proxy:0:1@node356.hydra.os] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
    [proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_appnum
    [proxy:0:1@node356.hydra.os] PMI response: cmd=appnum appnum=0
    [proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_my_kvsname
    [proxy:0:1@node356.hydra.os] PMI response: cmd=my_kvsname kvsname=kvs_309381_0
    [proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get kvsname=kvs_309381_0 key=PMI_process_mapping
    [proxy:0:1@node356.hydra.os] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
    [proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=barrier_in
  2. Single node Torque job (2 cores): the command `mpirun -n 2 ./test` does not hang but it errors out and hydra_pmi_proxy still crashes on the nodes with a trap divide error

    $ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpirun -n 2 ./test
    [mpiexec@node377.hydra.os] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild-skylake/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node377.hydra.brussel.vsc --upstream-port 46384 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild-skylake/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild-skylake/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
    [proxy:0:0@node377.hydra.os] Warning - oversubscription detected: 2 processes will be placed on 0 cores
    [mpiexec@node377.hydra.os] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:532): downstream from host node377.hydra.brussel.vsc was killed by signal 8 (Floating point exception)
    [mpiexec@node377.hydra.os] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2114): assert (exitcodes != NULL) failed

    The important part here is the warning oversubscription detected: 2 processes will be placed on 0 cores. This worked well with update 6. What has changed in Intel MPI 2019 update 7 that fails to use the allocated cores to the single node job?

  3. Single node outside of Torque: the command `mpirun -n 2 ./test` works as expected.

    $ mpirun -n 2 ./test
    Hello world: rank 0 of 2 running on login2.cerberus.os
    Hello world: rank 1 of 2 running on login2.cerberus.os

     

Could you please clarify if update 7 requires any specific configuration or changes in Torque?

Thanks,

Alex

0 Kudos
SoftWeb_V_
Beginner
5,667 Views

We have found a workaround to this issue by switching to the native topology detection in Intel MPI 2019 update 7.

$ I_MPI_HYDRA_TOPOLIB=ipl mpirun ./test
WARNING: release_mt library was used but no multi-ep feature was enabled. Please use release library instead.
Hello world: rank 0 of 2 running on node371.hydra.os
Hello world: rank 1 of 2 running on node368.hydra.os

Therefore the cause of this issue seems to lie in the default `hwloc` functions used for topology, which fail to work in within the jobs in Torque.

Would you need other information from us to troubleshoot and fix this issue?

0 Kudos
SoftWeb_V_
Beginner
5,667 Views

Hi, could you provide an update on the status of this issue? Have you been able to reproduce it?

Thanks,

Alex

0 Kudos
Morgan__John
Beginner
5,667 Views

Hi there. I am new here. Interesting thread, thanks for the information.

 


_________________________________________________________________________________________________
check out my project here

0 Kudos
James_T_Intel
Moderator
4,621 Views

I apologize for the delayed response. We have implemented several fixes related to this, however you may still need to use the I_MPI_HYDRA_TOPOLIB=ipl workaround.


Additionally, if you are not using multi-endpoint capabilities, you should be using the "release" library instead of the "release_mt" library. Multithreaded support is now included in the default "release" library, "release_mt" is only intended for using multi-endpoint specifically.


This thread is being closed for Intel support. Any further discussion in this thread will be considered community-only.


0 Kudos
Reply