We are seeing the following error on our cluster running Intel MPI, OneAPI version 2021.5.1
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIDI_NM_mpi_allgather:202
The error occurs on every task when running on more than 4096 tasks.
I am attaching a session log which includes the listing for a short (43 line) reproducer program.
Please let me know if you have questions or need additional information.
John Michalakes, UCAR
Thank you for posting in Intel Communities.
Could you please provide us with the below details to investigate more on your issue?
- Operating System & CPU details.
- How many nodes you are using to launch the MPI job.
- What is the OFI provider(tcp/mlx/psm2 etc..) you are using?
Thanks & Regards,
Thank you for the quick reply.
1. The OS and CPU are: Red Hat Enterprise Linux release 8 running on AMD EPYC 7713 64-Core Processor (dual) compute nodes
2. 32 nodes (4096 tasks) and the code succeeds. 33 nodes (4224 tasks) and the code generates the errors listed in my original report, above.
3. The OFI provider is mlx, as shown in the output below from a run with I_MPI_DEBUG=4
 MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)  MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.  MPI startup(): library kind: release  MPI startup(): libfabric version: 1.13.2rc1-impi  MPI startup(): libfabric provider: mlx
Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements
If you wish to use oneAPI on hardware that is not listed at one of the sites above, we encourage you to visit and contribute to the open oneAPI specification - https://www.oneapi.io/spec/
We are closing this issue. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks & Regards,