Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI_Comm_spawn hangs

Mark14
Beginner
4,763 Views

Hi,

In our distributed application (C++), the main application launches workers on localhost and/or remote machines with  MPI_Comm_spawn.

If a machine cannot be reached (network issue/non existing name…), then MPI_Comm_spawn dumps the following messages and hangs forever. So our distributed application also hangs…

  [mpiexec@WST18] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001

  [mpiexec@WST18] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at nonexisting:8680

  [mpiexec@WST18] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service

  [mpiexec@WST18] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy

  [mpiexec@WST18] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy

  [mpiexec@WST18] do_spawn (mpiexec.c:1129): error setting up the boostrap proxies

 

However if mpiexec is called from the command line, it dumps a similar stack trace and stops with errorlevel=-1. This is the expected behavior.

> mpiexec.exe  -host nonexistinghost <command>

[mpiexec@WST18] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001

[mpiexec@WST18] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at nonexistinghost:8680

[mpiexec@WST18] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service

[mpiexec@WST18] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy

[mpiexec@WST18] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy

[mpiexec@WST18] wmain (mpiexec.c:1938): error setting up the boostrap proxies

Why does MPI_Comm_spawn not return and hang ?  Can this be avoided by a setting/environment variable ?

Environment:
   Windows
   MPI 2019 Update 8 (I_MPI_FABRICS=ofi used)

Thanks

Mark

0 Kudos
15 Replies
PrasanthD_intel
Moderator
4,733 Views

Hi Mark,


Thanks for reaching out to us.

After you reported that the program is hanging we have tested with a sample MPI_Comm_spawn program in our windows environment and for us, the program stopped with an Exit code -1.

Currently, I am not sure why your code is hanging. Could you please provide us a sample reproducer of your spawn code so we can debug what could be the problem?


Regards

Prasanth


0 Kudos
Mark14
Beginner
4,722 Views

Hi Prasanth,

Thanks for having a look.

I’ve uploaded a zip file containing a test program, source code and MPI dlls/exes

If the test program is launched with mpiexec, it hangs:

C:\Temp\mpi>mpiexec.exe -n 1 -localroot -host localhost mpi_test.exe
Before spawn
[mpiexec@WST18] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@WST18] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at unknownhost:8680
[mpiexec@WST18] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service
[mpiexec@WST18] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy
[mpiexec@WST18] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy
[mpiexec@WST18] do_spawn (mpiexec.c:1129): error setting up the boostrap proxies

This happens with MPI 2019 Update 9. I’ve also tried this with MPI 2018 Update 2 and then the program ends after a few seconds.

Can you have a look why it hangs with version 2019 ?

Kr

Mark

0 Kudos
Mark14
Beginner
4,662 Views

Hi Prasanth,

Have you been able to look into this issue ?

Thanks

Mark

0 Kudos
PrasanthD_intel
Moderator
4,626 Views

Hi Mark,

 

We have tried to reproduce the issue but we are facing a different error while running the program.

But coming to your issue it is not completely a network issue as we can see that "Before Spawn" is being printed and then the error, any network changes occurred in between?

Could you provide the Debug logs by setting I_MPI_DEBUG=10?

 

Regards

Prasanth

0 Kudos
Mark14
Beginner
4,619 Views

Hi Prasanth,

No, there are no network changes occuring.

This is the debug output:

C:\Temp\mpi>mpiexec.exe -n 1 -localroot -host localhost mpi_test.exe
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 31276 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
Before spawn
[mpiexec@WST18] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@WST18] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at unknownhost:8680
[mpiexec@WST18] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service
[mpiexec@WST18] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy
[mpiexec@WST18] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy
[mpiexec@WST18] do_spawn (mpiexec.c:1129): error setting up the boostrap proxies

Kind regards,

Mark

0 Kudos
PrasanthD_intel
Moderator
4,543 Views

Hi Mark,


Could you once try with -localonly instead of -localroot and see if it helps.

Are you targeting a Cluster or running on a single node?


Regards

Prasanth


0 Kudos
Mark14
Beginner
4,537 Views

Hi Prasanth,

With -localonly instead of -localroot,  MPI_Comm_spawn also hangs.

This is the output:

C:\Temp\mpi>mpiexec.exe -n 1 -localonly -host localhost mpi_test.exe
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 27508 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
Before spawn
[proxy:1:0@WST18 ] HYD_spawn (..\windows\src\hydra_spawn.c:245): unable to run process C:\Temp\mpi/worker.exe (error code 2)
[proxy:1:0@WST18 ] launch_processes (proxy.c:569): error creating process (error code 2). The system cannot find the file specified.

[proxy:1:0@WST18 ] main (proxy.c:920): error launching_processes

 

We are targeting a cluster.

Kind regards

Mark

0 Kudos
PrasanthD_intel
Moderator
4,533 Views

Hi Mark,

 

In the mpi_test code, you are giving a parameter for the executable in MPI_COMM_SPAWN function as "worker.exe" and there is no executable with that name in your path and hence the error: The system cannot find the file specified.

Please compile a sample worker.exe and run it again.

You can use the below code for worker.cpp

 

#include "mpi.h"

#include <stdio.h>



int main(int argc, char *argv[])

{

  MPI_Init(&argc, &argv);



  MPI_Comm com;

  MPI_Comm_get_parent(&com);



  MPI_Finalize();

  return 0;

}

 

Compile it using mpiicc worker.cpp -o worker

 

Let me know the results.

 

Regards

Prasanth

0 Kudos
Mark14
Beginner
4,522 Views

Hi Prasanth

With -localonly, it does not hang, but the remote host is ignored.
With -localroot, MPI_Comm_spawn hangs

This is the logging with -localonly and -localroot.

Kind regards,

Mark

C:\Temp\mpi>mpiexec.exe -n 1 -localonly -host localhost mpi_test.exe
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 21796 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
Before spawn
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 24932 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
After spawn

 

C:\Temp\mpi>mpiexec.exe -n 1 -localroot -host localhost mpi_test.exe
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 5244 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
Before spawn
[mpiexec@WST18 ] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@WST18 ] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at unknownhost:8680
[mpiexec@WST18 ] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service
[mpiexec@WST18 ] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy
[mpiexec@WST18 ] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy
[mpiexec@WST18 ] do_spawn (mpiexec.c:1129): error setting up the boostrap proxies

0 Kudos
PrasanthD_intel
Moderator
4,490 Views

Hi Mark,


Here are differences between localroot and localonly:


-localroot

Use this option to launch the root process directly from mpiexec if the host is local. You can use this option to launch GUI applications. The interactive process should be launched before any other process in a job. For example:


> mpiexec -n 1 -host <host2> -localroot interactive.exe : -n 1 -host <host1> background.exe

-localonly

Use this option to run an application on the local node only. If you use this option only for the local node, the Hydra service is not required.


Since you have mentioned localhost as the only host I don't think there will be a difference for you.

But you have said that remote host is ignored? what do you mean by that?


Regards

Prasanth


0 Kudos
Mark14
Beginner
4,485 Views

Hi Prasanth

>>> But you have said that remote host is ignored? what do you mean by that?

The mpi_test executable wants to spawn the worker.exe on 'unknownhost', see the provided source code.

The command 'mpiexec.exe -n 1 -hosts localhost -localonly mpi_test.exe' ignores the host setting and launcher worker.exe on localhost.

The command 'mpiexec.exe -n 1 -hosts localhost -localroot mpi_test.exe' hangs in MPI_Comm_spawn.

kr

Mark

 

0 Kudos
PrasanthD_intel
Moderator
4,448 Views

Hi Mark,


We are looking into it and we will get back to you soon.


Regards

Prasanth


0 Kudos
SantoshY_Intel
Moderator
3,878 Views

Hi,

 

Thank you for your patience. 

 

IMPI 2021.5 supports FI_TCP_IFACE=lo to select 127.0.0.1, this works independently of VPN.

So, could you please try using the latest Intel MPI 2021.5 by updating to the Intel oneAPI 2022.1?

Also, please set FI_TCP_IFACE=lo before running your code as below:

set FI_TCP_IFACE=lo
mpiexec -n 5 master.exe

 

 

Thanks & Regards,

Santosh

 

0 Kudos
Mark14
Beginner
3,864 Views

Hi,

I’ve tried this with the simplified setup (see above) and it seems to work.

However, because of another MPI issue (https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Windows-MPI-2021-4-unable-to-create-process/td-p/1327189) we cannot integrate this MPI version in our product and do some tests.

 

Kr

 

Mark

0 Kudos
SantoshY_Intel
Moderator
3,836 Views

Hi,


Thanks for the confirmation. Since your primary issue has been resolved, we are closing this thread. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


For any updates from Intel, you can keep track of your other issue here: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Windows-MPI-2021-4-unable-to-create-process/td-p/1327189.


Thanks & Regards,

Santosh


0 Kudos
Reply