- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have 6 Intel(R) Xeon(R) CPU D-1557 @ 1.50GHz nodes, each containing 12 cores. hpcc version 1.5.0 has been compiled with Intel's MPI and MLK. We are able to run hpcc successfully when configuring mpirun for 6 nodes and 2 cores per node. However, attempting to specify more than 2 cores per nodes (we have 12) causes the error "invalid error code ffffffff (Ring Index out of range) in MPIR_Alltoall_intra:204"
Any ideas as to what could be causing this issue?
The following environment variables have been set:
I_MPI_FABRICS=tcp
I_MPI_DEBUG=5
I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,6,7,8,9,10,11
The MPI library version is:
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)
hosts.txt contains a list of 6 hostnames
The line below shows how mpirun is specified to execute hpcc on all 6 nodes, 3 cores per node:
mpirun -print-rank-map -n 18 -ppn 3 --hostfile hosts.txt hpcc
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIR_Alltoall_intra:204
Fatal error in PMPI_Alltoall: Other MPI error, error stack:
PMPI_Alltoall(974)......: MPI_Alltoall(sbuf=0x7fcdb107f010, scount=2097152, dtype=USER<contig>, rbuf=0x7fcdd1080010, rcount=2097152, dtype=USER<contig>, comm=0x84000004) failed
MPIR_Alltoall_impl(772).: fail failed
MPIR_Alltoall(731)......: fail failed
MPIR_Alltoall_intra(204): fail failed
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following I_MPI_* environment variables are reported:
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=tcp
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx5_0:0,mlx5_1:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=1
[0] MPI startup(): I_MPI_PIN_MAPPING=3:0 0,1 1,2 2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are seeing the same error after installing the Parallel Studio XE 2017 Update 4 Cluster Edition. Does anyone have any suggestions?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have the same error. Help please.
Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
PMPI_Alltoallv(665).............: MPI_Alltoallv(sbuf=0x2ad248b24180, scnts=0x2ad24376b5e0, sdispls=0x2ad24376b6a0, MPI_INTEGER, rbuf=0x2ad248c6d140, rcnts=0x2ad24376b4c0, rdispls=0x2ad24376b580, MPI_INTEGER, comm=0xc4000003) failed
MPIR_Alltoallv_impl(416)........: fail failed
MPIR_Alltoallv(373).............: fail failed
MPIR_Alltoallv_intra(226).......: fail failed
MPIR_Waitall_impl(221)..........: fail failed
PMPIDI_CH3I_Progress(623).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(288): fail failed
dcp_recv(154)...................: Internal MPI error! cannot read from remote process
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page