Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

Intel MPI 2021.3.1 hangs on launch with AMD Epyc

RMcGinnis
Beginner
3,158 Views

We are attempting to integrate an HP DL385G10 - AMD Epyc 7543 based server into an existing Intel server cluster.  The existing cluster is made of of Intel Xeon Gold 6248, 6244 and E5-2650 based servers.  

 

To verify the installation we are attempting to run the Intel MPI benchmark application (IMB-MPI1).   When we launch IMB-MPI1 using only Intel servers the benchmark runs to completion with no issues.  When we attempt to launch the benchmark using the AMD server along with Intel server(s) the benchmark will hang on launch and fails to run.

 

All Intel servers (10.103.103.27 & 10.103.103.28):

> mpirun -genv I_MPI_DEBUG=16 -host 10.103.103.27,10.103.103.28 -n 2 -ppn 1 IMB-MPI1

[0] MPI Startup(): Load tuning file: “/opt/intel/oneapi/lib/intel64/etc/tuning_skx_shm-ofi.dat”

[0] MPI Startup(): Rank  Pid          Node name

[0] MPI Startup(): 0          6954       dd11a.local {0,20,40,60}

[0] MPI Startup(): 1          7863       dd12a.local {0,20,40,60}

[0] MPI Startup():  I_MPI_ROOT=/opt/intel/oneapi/lib/intel64

[0] MPI Startup(): I_MPI_MPIRUN=mpirun

[0] MPI Startup(): I_MPI_HYDRA_DEBUG=on

[0] MPI Startup(): I_MPI_HYDRA_TOPOLIB=hwloc

[0] MPI Startup(): I_MPI_INTERNAL_MEM_POLICY=default

[0] MPI Startup(): I_MPI_DEBUG=16

<<<< Standard benchmark application output here >>>>

 

1 Intel server (10.103.103.28) and 1 AMD server (10.103.103.37)

> mpirun -genv I_MPI_DEBUG=16 -host 10.103.103.37,10.103.103.28 -n 2 -ppn 1 IMB-MPI1

[0] MPI Startup():Load tuning file: “/opt/intel/oneapi/lib/intel64/etc/tuning_generic_shm-ofi.dat”

---- APPLICATION HANGS HERE (Ctrl-C to exit) ----

 

Logging into the servers I can see the IMB-MPI1 application running on both servers looking at 'top' but the application is hung right before the printout the of rank information.  The benchmark application runs fine if I only call the benchmark on the AMD based server ("mpirun -genv I_MPI_DEBUG=16 -host 10.103.103.37 -n 1 -ppn 1 IMB-MPI1")  

 

SW:

- Mellanox OFED 4.9-3.1.5.0

- Intel MPI 2021.03.1, FI_PROVIDER=mlx, 100 GbE - RoCEV2

 

 

Server Configs:

- AMD Epyc 7543 x 2 sockets, Mellanox ConnectX5, RHEL 8.3

- Xeon Gold 6248 x 2 sockets, Mellanox ConnectX5, RHEL 7.7

- Xeon Gold 6244 x 2 sockets, Mellanox ConnectX5, RHEL 7.7

- Xeon E-2650 x 2 sockets, Mellanox ConnectX5, RHEL 8.3

 

 

Notes:

1.  IMB-MPI1 works fine with any combination of Intel servers. (No errors and runs successfully to completion)

2.  We have tried upgrading OFED to the latest version (5.4-1.0.3.0) the same hang is observed.

3.  The firewall is disabled on all of the servers.   

4.  All servers can login successfully with ssh keys (i.e. no passwords prompts)

5.  Running strace on the 'hung' benchmark applications they appear to be stuck calling epoll_wait.

 

 

 

 

Labels (1)
0 Kudos
3 Replies
JyotsnaK_Intel
Moderator
3,116 Views

Thank you for your inquiry. We offer support for hardware platforms that the Intel® oneAPI product supports. These platforms include those that are part of the Intel® Core™ processor family or higher, the Intel® Xeon® processor family, the Intel® Xeon® Scalable processor family, and others which can be found here – Intel® oneAPI Base Toolkit System Requirements, Intel® oneAPI HPC Toolkit System Requirements, Intel® oneAPI IoT Toolkit System Requirements

If you wish to use oneAPI on hardware that is not listed at one of the sites above, we encourage you to visit and contribute to the open oneAPI specification - https://www.oneapi.io/spec/


0 Kudos
JyotsnaK_Intel
Moderator
3,038 Views

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

 

0 Kudos
hakostra1
New Contributor II
2,544 Views

I believe I have the same problem (except my cluster is a pure Epyc cluster, no vendor mixes). Did you find any solution yet? I tried al recent 2021 releases of Intel MPI and they all behave in the same way.

0 Kudos
Reply