Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Jeon__ByoungSeon
Beginner
149 Views

intel mpi crash at many ranks

Hi,

We're testing intel mpi (intel19, patch1) on CentOS7.5 - it is a Linux cluster with infiniband network.

Testing intel mpi benchmark, found that it works good for small scales (400 mpi ranks using 10nodes) but for larger scales like 100 nodes  (100*40 = 4000 mpi ranks), it crashes yielding message shown in the bottom.. I recompiled libopenfabric but it doesn't improve the situation. I_MPI_DEBUG 5 doesn't give us the details either - would there be any way to track the cause of crash? fi_info results shown below for reference. Any comments are appreciated.

Thanks,

BJ

PS1.

$ fi_info 
provider: verbs;ofi_rxm
    fabric: IB-0xfe80000000000000
    domain: mlx5_0
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM
provider: verbs
    fabric: IB-0xfe80000000000000
    domain: mlx5_0
    version: 1.0
    type: FI_EP_MSG
    protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
    fabric: IB-0xfe80000000000000
    domain: mlx5_0-dgram
    version: 1.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_IB_UD
 

PS2. Crash message ( mpirun  -np 4000 -genv I_MPI_DEBUG 5  -machinefile hosts ./IMB-EXT ) 

[proxy:0:

# Bidir_Get
# Bidir_Put
# Accumulate
Abort(743005711) on node 3856 (rank 3856 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3856, new_comm=0x27f9f44) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3856]: readline failed
Abort(407461391) on node 3872 (rank 3872 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3872, new_comm=0xc81e9e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(407461391) on node 2782 (rank 2782 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2782, new_comm=0x1e978b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(1011441167) on node 3906 (rank 3906 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3906, new_comm=0xb944f14) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3906]: readline failed
Abort(810114575) on node 3907 (rank 3907 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3907, new_comm=0xc1eff14) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(608787983) on node 3306 (rank 3306 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3306, new_comm=0x2014034) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3306]: readline failed
Abort(541679119) on node 2542 (rank 2542 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2542, new_comm=0x2aeb954) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2542]: readline failed
Abort(743005711) on node 3380 (rank 3380 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3380, new_comm=0x1879a04) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3380]: readline failed
Abort(273243663) on node 3782 (rank 3782 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3782, new_comm=0x257b8e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(4808207) on node 1072 (rank 1072 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=1072, new_comm=0x1e9d794) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_1072]: readline failed
[cli_3782]: readline failed
Abort(273243663) on node 1664 (rank 1664 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=1664, new_comm=0xb14a534) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Connection timed out)
[cli_1664]: readline failed
Abort(71917071) on node 2942 (rank 2942 in comm 0): Fatal error in PMPI_Comm_spl
it: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2942, new_comm=0x28c68b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2942]: readline failed
Abort(474570255) on node 2958 (rank 2958 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2958, new_comm=0x2527ff4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(474570255) on node 3552 (rank 3552 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3552, new_comm=0xc3fa3e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3552]: readline failed
Abort(139025935) on node 3630 (rank 3630 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3630, new_comm=0x1859ff4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3630]: readline failed
Abort(541679119) on node 3634 (rank 3634 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3634, new_comm=0x1d2d8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(474570255) on node 2822 (rank 2822 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2822, new_comm=0x32a68b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2822]: readline failed
Abort(474570255) on node 2704 (rank 2704 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2704, new_comm=0xbf11584) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2704]: readline failed
Abort(810114575) on node 2100 (rank 2100 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2100, new_comm=0x141f8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2100]: readline failed
Abort(1011441167) on node 3348 (rank 3348 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3348, new_comm=0xb111504) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3348]: readline failed
Abort(4808207) on node 3446 (rank 3446 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3446, new_comm=0x2c95724) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3446]: readline failed
Abort(608787983) on node 3450 (rank 3450 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3450, new_comm=0x2c84724) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(340352527) on node 3824 (rank 3824 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3824, new_comm=0x1b4a8a4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3824]: readline failed
Abort(474570255) on node 3937 (rank 3937 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3937, new_comm=0xb9e7eb4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3937]: readline failed
Abort(340352527) on node 3979 (rank 3979 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3979, new_comm=0xbea1d94) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3979]: readline failed
Abort(810114575) on node 3826 (rank 3826 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3826, new_comm=0x32af8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(1011441167) on node 3982 (rank 3982 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3982, new_comm=0xd005f74) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(407461391) on node 3975 (rank 3975 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3975, new_comm=0xc95ddb4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(4808207) on node 3572 (rank 3572 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3572, new_comm=0x1552874) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3572]: readline failed
[proxy:0:82@atom84] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:82@atom84] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:82@atom84] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:82@atom84] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:82@atom84] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:88@atom90] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:88@atom90] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:88@atom90] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:88@atom90] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:88@atom90] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:70@atom72] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:70@atom72] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:70@atom72] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:70@atom72] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:70@atom72] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:21@atom23] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:21@atom23] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:21@atom23] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:21@atom23] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:21@atom23] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:40@atom42] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:40@atom42] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:40@atom42] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:40@atom42] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:40@atom42] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:64@atom66] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:64@atom66] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:64@atom66] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:64@atom66] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:64@atom66] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:58@atom60] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:58@atom60] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:58@atom60] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:58@atom60] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:58@atom60] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:34@atom36] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:34@atom36] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:34@atom36] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:34@atom36] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:34@atom36] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:46@atom48] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:46@atom48] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:46@atom48] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:46@atom48] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:46@atom48] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:28@atom30] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): [proxy:0:14@atom16] HYD_sock_write (../../../../../src/
pm/i_hydra/libhydra/sock/hydra_sock_intel.c:353): write error (Bad file descript
or)
[proxy:0:14@atom16] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:14@atom16] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:14@atom16] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:14@atom16] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
write error (Bad file descriptor)
[proxy:0:28@atom30] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:28@atom30] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:28@atom30] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:28@atom30] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:7@atom9] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hy
dra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:7@atom9] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/proxy_
cb.c:33): error reading command
[proxy:0:7@atom9] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/proxy
/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:7@atom9] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/li
bhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:7@atom9] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): error
 waiting for event

0 Kudos
0 Replies