We are seeing the same error

Wilson_P_ · ‎08-07-2017

We have 6 Intel(R) Xeon(R) CPU D-1557 @ 1.50GHz nodes, each containing 12 cores. hpcc version 1.5.0 has been compiled with Intel's MPI and MLK. We are able to run hpcc successfully when configuring mpirun for 6 nodes and 2 cores per node. However, attempting to specify more than 2 cores per nodes (we have 12) causes the error "invalid error code ffffffff (Ring Index out of range) in MPIR_Alltoall_intra:204"

Any ideas as to what could be causing this issue?

The following environment variables have been set:
I_MPI_FABRICS=tcp
I_MPI_DEBUG=5
I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,6,7,8,9,10,11

The MPI library version is:
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)

hosts.txt contains a list of 6 hostnames

The line below shows how mpirun is specified to execute hpcc on all 6 nodes, 3 cores per node:
mpirun -print-rank-map -n 18 -ppn 3 --hostfile hosts.txt hpcc

INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIR_Alltoall_intra:204
Fatal error in PMPI_Alltoall: Other MPI error, error stack:
PMPI_Alltoall(974)......: MPI_Alltoall(sbuf=0x7fcdb107f010, scount=2097152, dtype=USER<contig>, rbuf=0x7fcdd1080010, rcount=2097152, dtype=USER<contig>, comm=0x84000004) failed
MPIR_Alltoall_impl(772).: fail failed
MPIR_Alltoall(731)......: fail failed
MPIR_Alltoall_intra(204): fail failed

Thanks!

Wilson_P_ · ‎08-07-2017

The following I_MPI_* environment variables are reported:

[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=tcp
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx5_0:0,mlx5_1:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=1
[0] MPI startup(): I_MPI_PIN_MAPPING=3:0 0,1 1,2 2

Wilson_P_ · ‎08-17-2017

We are seeing the same error after installing the Parallel Studio XE 2017 Update 4 Cluster Edition. Does anyone have any suggestions?

Thanks!

Wenzhong_F_ · ‎09-09-2017

We have the same error. Help please.

Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
PMPI_Alltoallv(665).............: MPI_Alltoallv(sbuf=0x2ad248b24180, scnts=0x2ad24376b5e0, sdispls=0x2ad24376b6a0, MPI_INTEGER, rbuf=0x2ad248c6d140, rcnts=0x2ad24376b4c0, rdispls=0x2ad24376b580, MPI_INTEGER, comm=0xc4000003) failed
MPIR_Alltoallv_impl(416)........: fail failed
MPIR_Alltoallv(373).............: fail failed
MPIR_Alltoallv_intra(226).......: fail failed
MPIR_Waitall_impl(221)..........: fail failed
PMPIDI_CH3I_Progress(623).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(288): fail failed
dcp_recv(154)...................: Internal MPI error! cannot read from remote process

MPI_Alltoall error when running more than 2 cores per node