- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In our distributed application (C++), the main application launches workers on localhost and/or remote machines with MPI_Comm_spawn.
If a machine cannot be reached (network issue/non existing name…), then MPI_Comm_spawn dumps the following messages and hangs forever. So our distributed application also hangs…
[mpiexec@WST18] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@WST18] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at nonexisting:8680
[mpiexec@WST18] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service
[mpiexec@WST18] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy
[mpiexec@WST18] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy
[mpiexec@WST18] do_spawn (mpiexec.c:1129): error setting up the boostrap proxies
However if mpiexec is called from the command line, it dumps a similar stack trace and stops with errorlevel=-1. This is the expected behavior.
> mpiexec.exe -host nonexistinghost <command>
[mpiexec@WST18] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@WST18] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at nonexistinghost:8680
[mpiexec@WST18] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service
[mpiexec@WST18] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy
[mpiexec@WST18] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy
[mpiexec@WST18] wmain (mpiexec.c:1938): error setting up the boostrap proxies
Why does MPI_Comm_spawn not return and hang ? Can this be avoided by a setting/environment variable ?
Environment:
Windows
MPI 2019 Update 8 (I_MPI_FABRICS=ofi used)
Thanks
Mark
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
Thanks for reaching out to us.
After you reported that the program is hanging we have tested with a sample MPI_Comm_spawn program in our windows environment and for us, the program stopped with an Exit code -1.
Currently, I am not sure why your code is hanging. Could you please provide us a sample reproducer of your spawn code so we can debug what could be the problem?
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth,
Thanks for having a look.
I’ve uploaded a zip file containing a test program, source code and MPI dlls/exes
If the test program is launched with mpiexec, it hangs:
C:\Temp\mpi>mpiexec.exe -n 1 -localroot -host localhost mpi_test.exe
Before spawn
[mpiexec@WST18] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@WST18] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at unknownhost:8680
[mpiexec@WST18] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service
[mpiexec@WST18] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy
[mpiexec@WST18] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy
[mpiexec@WST18] do_spawn (mpiexec.c:1129): error setting up the boostrap proxies
This happens with MPI 2019 Update 9. I’ve also tried this with MPI 2018 Update 2 and then the program ends after a few seconds.
Can you have a look why it hangs with version 2019 ?
Kr
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth,
Have you been able to look into this issue ?
Thanks
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
We have tried to reproduce the issue but we are facing a different error while running the program.
But coming to your issue it is not completely a network issue as we can see that "Before Spawn" is being printed and then the error, any network changes occurred in between?
Could you provide the Debug logs by setting I_MPI_DEBUG=10?
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth,
No, there are no network changes occuring.
This is the debug output:
C:\Temp\mpi>mpiexec.exe -n 1 -localroot -host localhost mpi_test.exe
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 31276 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
Before spawn
[mpiexec@WST18] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@WST18] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at unknownhost:8680
[mpiexec@WST18] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service
[mpiexec@WST18] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy
[mpiexec@WST18] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy
[mpiexec@WST18] do_spawn (mpiexec.c:1129): error setting up the boostrap proxies
Kind regards,
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
Could you once try with -localonly instead of -localroot and see if it helps.
Are you targeting a Cluster or running on a single node?
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth,
With -localonly instead of -localroot, MPI_Comm_spawn also hangs.
This is the output:
C:\Temp\mpi>mpiexec.exe -n 1 -localonly -host localhost mpi_test.exe
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 27508 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
Before spawn
[proxy:1:0@WST18 ] HYD_spawn (..\windows\src\hydra_spawn.c:245): unable to run process C:\Temp\mpi/worker.exe (error code 2)
[proxy:1:0@WST18 ] launch_processes (proxy.c:569): error creating process (error code 2). The system cannot find the file specified.
[proxy:1:0@WST18 ] main (proxy.c:920): error launching_processes
We are targeting a cluster.
Kind regards
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
In the mpi_test code, you are giving a parameter for the executable in MPI_COMM_SPAWN function as "worker.exe" and there is no executable with that name in your path and hence the error: The system cannot find the file specified.
Please compile a sample worker.exe and run it again.
You can use the below code for worker.cpp
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm com;
MPI_Comm_get_parent(&com);
MPI_Finalize();
return 0;
}
Compile it using mpiicc worker.cpp -o worker
Let me know the results.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth
With -localonly, it does not hang, but the remote host is ignored.
With -localroot, MPI_Comm_spawn hangs
This is the logging with -localonly and -localroot.
Kind regards,
Mark
C:\Temp\mpi>mpiexec.exe -n 1 -localonly -host localhost mpi_test.exe
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 21796 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
Before spawn
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 24932 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
After spawn
C:\Temp\mpi>mpiexec.exe -n 1 -localroot -host localhost mpi_test.exe
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20201005
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1a1-impi
[0] MPI startup(): libfabric provider: tcp;ofi_rxm
[0] MPI startup(): Unable to read tuning file for ch4 level
[0] MPI startup(): Unable to read tuning file for net level
[0] MPI startup(): Unable to read tuning file for shm level
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 5244 WST18 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_FABRICS=ofi
[0] MPI startup(): I_MPI_DEBUG=10
Before spawn
[mpiexec@WST18 ] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@WST18 ] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at unknownhost:8680
[mpiexec@WST18 ] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:417): unable to connect to hydra service
[mpiexec@WST18 ] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:564): error launching bstrap proxy
[mpiexec@WST18 ] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:754): unable to launch bstrap proxy
[mpiexec@WST18 ] do_spawn (mpiexec.c:1129): error setting up the boostrap proxies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
Here are differences between localroot and localonly:
-localroot
Use this option to launch the root process directly from mpiexec if the host is local. You can use this option to launch GUI applications. The interactive process should be launched before any other process in a job. For example:
> mpiexec -n 1 -host <host2> -localroot interactive.exe : -n 1 -host <host1> background.exe
-localonly
Use this option to run an application on the local node only. If you use this option only for the local node, the Hydra service is not required.
Since you have mentioned localhost as the only host I don't think there will be a difference for you.
But you have said that remote host is ignored? what do you mean by that?
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth
>>> But you have said that remote host is ignored? what do you mean by that?
The mpi_test executable wants to spawn the worker.exe on 'unknownhost', see the provided source code.
The command 'mpiexec.exe -n 1 -hosts localhost -localonly mpi_test.exe' ignores the host setting and launcher worker.exe on localhost.
The command 'mpiexec.exe -n 1 -hosts localhost -localroot mpi_test.exe' hangs in MPI_Comm_spawn.
kr
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
We are looking into it and we will get back to you soon.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for your patience.
IMPI 2021.5 supports FI_TCP_IFACE=lo to select 127.0.0.1, this works independently of VPN.
So, could you please try using the latest Intel MPI 2021.5 by updating to the Intel oneAPI 2022.1?
Also, please set FI_TCP_IFACE=lo before running your code as below:
set FI_TCP_IFACE=lo
mpiexec -n 5 master.exe
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I’ve tried this with the simplified setup (see above) and it seems to work.
However, because of another MPI issue (https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Windows-MPI-2021-4-unable-to-create-process/td-p/1327189) we cannot integrate this MPI version in our product and do some tests.
Kr
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the confirmation. Since your primary issue has been resolved, we are closing this thread. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
For any updates from Intel, you can keep track of your other issue here: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Windows-MPI-2021-4-unable-to-create-process/td-p/1327189.
Thanks & Regards,
Santosh
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page