Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2274 Discussions

Errors in TCP libfabric for Intel(R) Xeon(R) Platinum 8259CL

Green_James
Beginner
1,268 Views

Hi,

I have been testing an electronic structure code on a supercomputer with Intel(R) Xeon(R) Platinum 8259CL and ethernet interconnect.

I have seen failures on multiple node calculations, which I believe is due to the interconnect/libfabric, as we have seen similar failures on other architectures and interconnects (e.g. EFA, mellanox) which could be resolved by appropriate choice of tuning file, see e.g. the post here

However, for the ethernet/TCP libfabric, no choice of tuning file seems to remedy the situation.

The MPI debug output for the default choice is:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.12 Build 20240213 (id: 4f55822)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: tcp
[48] MPI startup(): shm segment size (118 MB per rank) * (48 local ranks) = 5674 MB total
[0] MPI startup(): shm segment size (118 MB per rank) * (48 local ranks) = 5674 MB total
[0] MPI startup(): Load tuning file: "/work/shared/intel/mpi/2021.12/opt/mpi/etc/tuning_skx_shm-ofi_tcp.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 19 (TAG_UB value: 524287)
[0] MPI startup(): source bits available: 20 (Maximal number of rank: 1048575)
[0] MPI startup(): Number of NICs: 1 [0] MPI startup(): Intel(R) MPI Library, Version 2021.12 Build 20240213 (id: 4f55822)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: tcp
[48] MPI startup(): shm segment size (118 MB per rank) * (48 local ranks) = 5674 MB total
[0] MPI startup(): shm segment size (118 MB per rank) * (48 local ranks) = 5674 MB total
[0] MPI startup(): Load tuning file: "/work/shared/intel/mpi/2021.12/opt/mpi/etc/tuning_skx_shm-ofi_tcp.dat"
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 19 (TAG_UB value: 524287)
[0] MPI startup(): source bits available: 20 (Maximal number of rank: 1048575)
[0] MPI startup(): Number of NICs: 1 

 Does anyone have any idea what may be causing the issue/have any suggestions of anything else to try?

Thank you

0 Kudos
1 Reply
TobiasK
Moderator
1,157 Views

@Green_James 
Please use the latest release, 2021.13.

If you still encounter the error there, please try to provide a small and simple reproducer so that we can take a look at it. If the super computing center that you are using has a valid support contract, please use the priority support channel for your request. That way we have more means to help you. 

0 Kudos
Reply