Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

Intel MPI NTERNAL ERROR: invalid error code (Ring Index out of range) MPIDI_NM_mpi_allgather

John_Michalakes
4,046 Views

We are seeing the following error on our cluster running Intel MPI, OneAPI version 2021.5.1 

 

INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIDI_NM_mpi_allgather:202

 

The error occurs on every task when running on more than 4096 tasks.

I am attaching a session log which includes the listing for a short (43 line) reproducer program.

Please let me know if you have questions or need additional information.

Thank you,

John Michalakes, UCAR

0 Kudos
4 Replies
SantoshY_Intel
Moderator
4,018 Views

Hi,

 

Thank you for posting in Intel Communities.

 

Could you please provide us with the below details to investigate more on your issue?

  1. Operating System & CPU details.
  2. How many nodes you are using to launch the MPI job.
  3. What is the OFI provider(tcp/mlx/psm2 etc..) you are using?

 

Thanks & Regards,

Santosh

 

0 Kudos
John_Michalakes
3,998 Views

Thank you for the quick reply.  

1.  The OS and CPU are:  Red Hat Enterprise Linux release 8 running on AMD EPYC 7713 64-Core Processor (dual) compute nodes

2.  32 nodes (4096 tasks) and the code succeeds.  33 nodes (4224 tasks)  and the code generates the errors listed in my original report, above.

3.  The OFI provider is mlx, as shown in the output below from a run with I_MPI_DEBUG=4

[0] MPI startup(): Intel(R) MPI Library, Version 2021.5  Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: mlx

 Thank you!

John

0 Kudos
JyotsnaK_Intel
Moderator
3,857 Views

Hi John,

Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements

If you wish to use oneAPI on hardware that is not listed at one of the sites above, we encourage you to visit and contribute to the open oneAPI specification - https://www.oneapi.io/spec/


Best regards,

Jyotsna


0 Kudos
SantoshY_Intel
Moderator
3,816 Views

Hi,


We are closing this issue. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards,

Santosh


0 Kudos
Reply