Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
1963 Discussions

Intel MPI 2021.3.1 hangs on launch/startup

RMcGinnis
Beginner
234 Views

We are attempting to integrate an HP DL385G10 - AMD Epyc 7543 based server into an existing Intel server cluster.  The existing cluster is made of of Intel Xeon Gold 6248, 6244 and E5-2650 based servers.  

 

To verify the installation we are attempting to run the Intel MPI benchmark application (IMB-MPI1).   When we launch IMB-MPI1 using only Intel servers the benchmark runs to completion with no issues.  When we attempt to launch the benchmark using the AMD server along with Intel server(s) the benchmark will hang on launch and fails to run.

 

All Intel servers (10.103.103.27 & 10.103.103.28):

> mpirun -genv I_MPI_DEBUG=16 -host 10.103.103.27,10.103.103.28 -n 2 -ppn 1 IMB-MPI1

[0] MPI Startup(): Load tuning file: “/opt/intel/oneapi/lib/intel64/etc/tuning_skx_shm-ofi.dat”

[0] MPI Startup(): Rank  Pid          Node name

[0] MPI Startup(): 0          6954       dd11a.local {0,20,40,60}

[0] MPI Startup(): 1          7863       dd12a.local {0,20,40,60}

[0] MPI Startup():  I_MPI_ROOT=/opt/intel/oneapi/lib/intel64

[0] MPI Startup(): I_MPI_MPIRUN=mpirun

[0] MPI Startup(): I_MPI_HYDRA_DEBUG=on

[0] MPI Startup(): I_MPI_HYDRA_TOPOLIB=hwloc

[0] MPI Startup(): I_MPI_INTERNAL_MEM_POLICY=default

[0] MPI Startup(): I_MPI_DEBUG=16

<<<< Standard benchmark application output here >>>>

 

1 Intel server (10.103.103.28) and 1 AMD server (10.103.103.37)

> mpirun -genv I_MPI_DEBUG=16 -host 10.103.103.37,10.103.103.28 -n 2 -ppn 1 IMB-MPI1

[0] MPI Startup():Load tuning file: “/opt/intel/oneapi/lib/intel64/etc/tuning_generic_shm-ofi.dat”

---- APPLICATION HANGS HERE (Ctrl-C to exit) ----

 

Logging into the servers I can see the IMB-MPI1 application running on both servers looking at 'top' but the application is hung right before the printout the of rank information.  The benchmark application runs fine if I only call the benchmark on the AMD based server ("mpirun -genv I_MPI_DEBUG=16 -host 10.103.103.37 -n 1 -ppn 1 IMB-MPI1")  

 

SW:

- Mellanox OFED 4.9-3.1.5.0

- Intel MPI 2021.03.1, FI_PROVIDER=mlx, 100 GbE - RoCEV2

 

 

Server Configs:

- AMD Epyc 7543 x 2 sockets, Mellanox ConnectX5, RHEL 8.3

- Xeon Gold 6248 x 2 sockets, Mellanox ConnectX5, RHEL 7.7

- Xeon Gold 6244 x 2 sockets, Mellanox ConnectX5, RHEL 7.7

- Xeon E-2650 x 2 sockets, Mellanox ConnectX5, RHEL 8.3

 

 

Notes:

1.  IMB-MPI1 works fine with any combination of Intel servers. (No errors and runs successfully to completion)

2.  We have tried upgrading OFED to the latest version (5.4-1.0.3.0) the same hang is observed.

3.  The firewall is disabled on all of the servers.   

4.  All servers can login successfully with ssh keys (i.e. no passwords prompts)

5.  Running strace on the 'hung' benchmark applications they appear to be stuck calling epoll_wait.

 

Labels (1)
0 Kudos
0 Replies
Reply