- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are seeing the following error on our cluster running Intel MPI, OneAPI version 2021.5.1
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIDI_NM_mpi_allgather:202
The error occurs on every task when running on more than 4096 tasks.
I am attaching a session log which includes the listing for a short (43 line) reproducer program.
Please let me know if you have questions or need additional information.
Thank you,
John Michalakes, UCAR
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
Could you please provide us with the below details to investigate more on your issue?
- Operating System & CPU details.
- How many nodes you are using to launch the MPI job.
- What is the OFI provider(tcp/mlx/psm2 etc..) you are using?
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the quick reply.
1. The OS and CPU are: Red Hat Enterprise Linux release 8 running on AMD EPYC 7713 64-Core Processor (dual) compute nodes
2. 32 nodes (4096 tasks) and the code succeeds. 33 nodes (4224 tasks) and the code generates the errors listed in my original report, above.
3. The OFI provider is mlx, as shown in the output below from a run with I_MPI_DEBUG=4
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: mlx
Thank you!
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements
If you wish to use oneAPI on hardware that is not listed at one of the sites above, we encourage you to visit and contribute to the open oneAPI specification - https://www.oneapi.io/spec/
Best regards,
Jyotsna
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are closing this issue. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks & Regards,
Santosh
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page