Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

Azure multi-agent failure on MPI_Init

Ben_E_
Beginner
497 Views

Hey Forum,

I'm trying to run on the Azure cloud using Intel's MPI implementation, but there is a problem. Everything works as expected when run on one Agent (8 processors), however anything with 2 or more Agents fails on MPI_Init() roughly 25% of the time. The failure is instantaneous (see output below). I was also able to reproduce this crash with a simple point to point send between all processors. I'm unable to reproduce the issue on my local system.

Are there any known issues with Intel MPI on Azure's virtual machines? Any idea on why it may crash on initialization only some of the time?

The current solution has been simply to use microsoft's MPI library, but I would really like to figure out what the source of the problem is.

Thank you kindly.


Error output:

Master Agent: 10

Information

Secondary Agent: 2

Information

Secondary Agent: 16

Information

Secondary Agent: 12

Information

Fatal error in MPI_Init: Other MPI error, error stack:

Error

job aborted:

Debug

rank: node: exit code[: error message]

Debug

MPIR_Init_thread(658)......................:

Error

MPID_Init(195).............................: channel initialization failed

Error

0: Agent10: 123

Debug

MPIDI_CH3_Init(104)........................:

Error

1: Agent10: 1

Debug

2: Agent10: 1

Debug

MPID_nem_tcp_post_init(345)................:

Error

MPID_nem_newtcp_module_connpoll(3102)......:

Error

3: Agent10: 1

Debug

recv_id_or_tmpvc_info_success_handler(1330): read from socket failed - No error

Error

4: Agent10: 1: process 4 exited without calling finalize

Debug

Fatal error in MPI_Init: Other MPI error, error stack:

Error

5: Agent10: 1: process 5 exited without calling finalize

Debug

6: Agent10: 1: process 6 exited without calling finalize

Debug

MPIR_Init_thread(658)................:

Error

7: Agent10: 1: process 7 exited without calling finalize

Debug

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

8: Agent2: 123

Debug

MPID_nem_tcp_post_init(345)..........:

Error

9: Agent2: 1: process 9 exited without calling finalize

Debug

MPID_nem_newtcp_module_connpoll(3102):

Error

10: Agent2: 123

Debug

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

11: Agent2: 123

Debug

Fatal error in MPI_Init: Other MPI error, error stack:

Error

12: Agent2: 123

Debug

MPIR_Init_thread(658)................:

Error

13: Agent2: 1: process 13 exited without calling finalize

Debug

MPID_Init(195).......................: channel initialization failed

Error

14: Agent2: 123

Debug

MPIDI_CH3_Init(104)..................:

Error

15: Agent2: 123

Debug

MPID_nem_tcp_post_init(345)..........:

Error

16: Agent12: 123

Debug

MPID_nem_newtcp_module_connpoll(3102):

Error

17: Agent12: 123

Debug

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

18: Agent12: 123

Debug

19: Agent12: 123

Debug

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

20: Agent12: 123

Debug

MPID_Init(195).......................: channel initialization failed

Error

21: Agent12: 123

Debug

MPIDI_CH3_Init(104)..................:

Error

22: Agent12: 123

Debug

23: Agent12: 123

Debug

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

24: Agent16: 1: process 24 exited without calling finalize

Debug

25: Agent16: 1: process 25 exited without calling finalize

Debug

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

26: Agent16: 1: process 26 exited without calling finalize

Debug

MPIR_Init_thread(658)................:

Error

27: Agent16: 1: process 27 exited without calling finalize

Debug

MPID_Init(195).......................: channel initialization failed

Error

28: Agent16: 1: process 28 exited without calling finalize

Debug

MPIDI_CH3_Init(104)..................:

Error

29: Agent16: 1: process 29 exited without calling finalize

Debug

MPID_nem_tcp_post_init(345)..........:

Error

30: Agent16: 1: process 30 exited without calling finalize

Debug

MPID_nem_newtcp_module_connpoll(3102):

Error

31: Agent16: 1: process 31 exited without calling finalize

Debug

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in MPI_Init: Other MPI error, error stack:

Error

MPIR_Init_thread(658)................:

Error

MPID_Init(195).......................: channel initialization failed

Error

MPIDI_CH3_Init(104)..................:

Error

MPID_nem_tcp_post_init(345)..........:

Error

MPID_nem_newtcp_module_connpoll(3102):

Error

gen_read_fail_handler(1196)..........: read from socket failed - The specified network name is no longer available.

Error

Fatal error in PMPI_Isend: Other MPI error, error stack:

Error

PMPI_Isend(161).................: MPI_Isend(buf=000000D8B7D7F89C, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD, request=000000D8B7D7F8A0) failed

Error

MPIDI_CH3_EagerContigIsend(554).: failure occurred while attempting to send an eager message

Error

MPID_nem_newtcp_iSendContig(440):

Error

MPIU_SOCKW_Writev(454)..........:  Unable to write to a socket, An existing connection was forcibly closed by the remote host.

Error

(errno 10054)

Error

Fatal error in PMPI_Isend: Other MPI error, error stack:

Error

PMPI_Isend(161).................: MPI_Isend(buf=0000001C452DFA5C, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD, request=0000001C452DFA60) failed

Error

MPIDI_CH3_EagerContigIsend(554).: failure occurred while attempting to send an eager message

Error

MPID_nem_newtcp_iSendContig(440):

Error

MPIU_SOCKW_Writev(454)..........:  Unable to write to a socket, An existing connection was forcibly closed by the remote host.

Error

(errno 10054)

Error

 



 

0 Kudos
0 Replies
Reply