Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2228 Discussions

Debug A mp_linpack case over RoCEv2 with OneAPI 2025.0 + intelmpi

weasley
Beginner
243 Views

Dear Developer,

I am now running a two nodes OneAPI 2025.0 mp_linpack with RoCEv2 net, When I run the runme_intel64_dynamic, It can run on the two nodes, but when finished, it got an error in HPL.out:

weasley_0-1733911669572.png

And how to debug or see more info about the mp_linpack test?

My Env like this:

source /share/apps/oneapi/25.0/setvars.sh
export FI_VERBS_IFACE=ens255np0
export FI_PROVIDER=verbs

export I_MPI_DEBUG=6
export FI_LOG_LEVEL=debug

then run the ./runme_intel64_dynamic command.

Thanks!

 

0 Kudos
4 Replies
TobiasK
Moderator
185 Views

Please provide more information. HW/OS/ etc. Does it fail with a single node? FI_PROVIDER=verbs is not supported anymore

0 Kudos
weasley
Beginner
159 Views

Hi TobiasK,

Thanks for your reply!

The two server HW/OS/Other list below:

OS: RHEL 8.6 x86_64

CPU: 2*Platinum 8468

Mem: 2TB DDR5 ECC

I have multi mlx5_* cards with RoCE switcher connect.

# ibdev2netdev
mlx5_0 port 1 ==> ens255np0 (Up)
mlx5_1 port 1 ==> enp41s0np0 (Up)
mlx5_10 port 1 ==> enp218s0np0 (Up)
mlx5_2 port 1 ==> enp59s0np0 (Up)
mlx5_3 port 1 ==> enp83s0np0 (Up)
mlx5_4 port 1 ==> enp86s0f0np0 (Up)
mlx5_5 port 1 ==> enp86s0f1np1 (Up)
mlx5_6 port 1 ==> enp92s0np0 (Up)
mlx5_7 port 1 ==> enp155s0np0 (Up)
mlx5_8 port 1 ==> enp170s0np0 (Up)
mlx5_9 port 1 ==> enp187s0np0 (Up)

the mlx5_3 is 10Gb Ethernet; mlx5_4 and mlx5_5 is 25Gb Ethernet;

The other 8 cards is 400Gb Mellanox Technologies MT2910 Family [ConnectX-7] card

I use MLNX_OFED_LINUX-5.8-3.0.7.0-rhel8.6-x86_64 and install intel oneapi2025.0

0 Kudos
TobiasK
Moderator
95 Views

@weasley are you able to run it on a single node?

0 Kudos
weasley
Beginner
48 Views

Sorry for late, due to my bussiness travel.

As you say, single node linpack running ok because it is run via shm, not RoCE.

0 Kudos
Reply