- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have been testing an electronic structure code on a supercomputer with Intel(R) Xeon(R) Platinum 8259CL and ethernet interconnect.
I have seen failures on multiple node calculations, which I believe is due to the interconnect/libfabric, as we have seen similar failures on other architectures and interconnects (e.g. EFA, mellanox) which could be resolved by appropriate choice of tuning file, see e.g. the post here
However, for the ethernet/TCP libfabric, no choice of tuning file seems to remedy the situation.
The MPI debug output for the default choice is:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.12 Build 20240213 (id: 4f55822)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: tcp
[48] MPI startup(): shm segment size (118 MB per rank) * (48 local ranks) = 5674 MB total
[0] MPI startup(): shm segment size (118 MB per rank) * (48 local ranks) = 5674 MB total
[0] MPI startup(): Load tuning file: "/work/shared/intel/mpi/2021.12/opt/mpi/etc/tuning_skx_shm-ofi_tcp.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 19 (TAG_UB value: 524287)
[0] MPI startup(): source bits available: 20 (Maximal number of rank: 1048575)
[0] MPI startup(): Number of NICs: 1 [0] MPI startup(): Intel(R) MPI Library, Version 2021.12 Build 20240213 (id: 4f55822)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: tcp
[48] MPI startup(): shm segment size (118 MB per rank) * (48 local ranks) = 5674 MB total
[0] MPI startup(): shm segment size (118 MB per rank) * (48 local ranks) = 5674 MB total
[0] MPI startup(): Load tuning file: "/work/shared/intel/mpi/2021.12/opt/mpi/etc/tuning_skx_shm-ofi_tcp.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 19 (TAG_UB value: 524287)
[0] MPI startup(): source bits available: 20 (Maximal number of rank: 1048575)
[0] MPI startup(): Number of NICs: 1
Does anyone have any idea what may be causing the issue/have any suggestions of anything else to try?
Thank you
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Green_James
Please use the latest release, 2021.13.
If you still encounter the error there, please try to provide a small and simple reproducer so that we can take a look at it. If the super computing center that you are using has a valid support contract, please use the priority support channel for your request. That way we have more means to help you.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page