Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Intel mpi/openmp hybrid programming on clustering!

bahla_t_
Beginner
556 Views

Hello, Admin!
I'm now using Intel Cluster Studio Tool Kit! And I'm trying to run hybrid(mpi+openmp) program on 25 compute nodes!I compile my program using with -mt_mpi -openmp. I use I_MPI_DOMAIN=omp OMP_NUM_THREADS=2 environment variables, that means for every process(mpi) will have 2 threads(openmp).  I can run my program without errors still using with 14 compute nodes! But beyond 14 compute nodes, error outputs is following!

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)......................: 
MPID_Init(195).............................: channel initialization failed
MPIDI_CH3_Init(106)........................: 
MPID_nem_tcp_post_init(344)................: 
MPID_nem_newtcp_module_connpoll(3099)......: 
recv_id_or_tmpvc_info_success_handler(1328): read from socket failed - No error
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(344)..........: 
MPID_nem_newtcp_module_connpoll(3099): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(344)..........: 
MPID_nem_newtcp_module_connpoll(3099): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(344)..........: 
MPID_nem_newtcp_module_connpoll(3099): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(344)..........: 
MPID_nem_newtcp_module_connpoll(3099): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(344)..........: 
MPID_nem_newtcp_module_connpoll(3099): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(337)..........: 
MPID_nem_newtcp_module_connpoll(3099): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(337)..........: 
MPID_nem_newtcp_module_connpoll(3099): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(337)..........: 
MPID_nem_newtcp_module_connpoll(3113): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(337)..........: 
MPID_nem_newtcp_module_connpoll(3113): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(337)..........: 
MPID_nem_newtcp_module_connpoll(3113): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified network name is no longer available.

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(659)................: 
MPID_Init(195).......................: channel initialization failed
MPIDI_CH3_Init(106)..................: 
MPID_nem_tcp_post_init(337)..........: 
MPID_nem_newtcp_module_connpoll(3113): 
gen_read_fail_handler(1194)..........: read from socket failed - The specified
job aborted:
rank: node: exit code[: error message]
0: HPC-01: 1: process 0 exited without calling finalize
1: HPC-02: 123
2: HPC-03: 1: process 2 exited without calling finalize
3: HPC-04: 1: process 3 exited without calling finalize
4: HPC-05: 1: process 4 exited without calling finalize
5: HPC-06: 1: process 5 exited without calling finalize
6: HPC-07: 1: process 6 exited without calling finalize
7: HPC-08: 1: process 7 exited without calling finalize
8: HPC-09: 1: process 8 exited without calling finalize
9: HPC-10: 1: process 9 exited without calling finalize
10: HPC-11: 1: process 10 exited without calling finalize
11: HPC-12: 1: process 11 exited without calling finalize
12: HPC-13: 1: process 12 exited without calling finalize
13: HPC-14: 1: process 13 exited without calling finalize
14: HPC-16: 1: process 14 exited without calling finalize
15: HPC-17: 1: process 15 exited without calling finalize
network name is no longer available.

 

0 Kudos
4 Replies
TimP
Honored Contributor III
556 Views

Are these dual core compute nodes ? What is total number of cores? If you wish to over subscribe you will need to disable i_mpi_pin_domain 

do you have a reason for trying that mt library?

0 Kudos
bahla_t_
Beginner
556 Views

Thanks for your reply, Tim Prince! Every compute node has 1 package/4 cores, and every core has 2 threads.  I can run my program without errors still using with 14 compute nodes! But beyond 14 compute nodes, error outputs. I want to run my hybrid(mpi/openmp) program, and so I'm using mt library and I_MPI_PIN_DOMAIN environment variable! Can you suggest me that I will need to disable hyper-threading technology in BIOS, and that can cause error pinning threads within mpi process?

0 Kudos
TimP
Honored Contributor III
556 Views

Pin_domain spreads the work across cores by default on a supported Intel CPU so it may not be necessary to disable ht. If you are trying an odd selection of ranks and threads, i_mpi_debug=5 may shed light on what happens.

0 Kudos
bahla_t_
Beginner
556 Views

Thanks, Tim Prince! Unfortunately, your answer is not solution for my problem. I want to run my program with every core that has 2 threads. I wish you to understand why i used such environment variables  I_MPI_PIN_DOMAIN=omp and OMP_NUM_THREADS=2. Above post I out my problem messages. Can you briefly discuss for that problems message?

0 Kudos
Reply