- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Developer,
I am now running a two nodes OneAPI 2025.0 mp_linpack with RoCEv2 net, When I run the runme_intel64_dynamic, It can run on the two nodes, but when finished, it got an error in HPL.out:
And how to debug or see more info about the mp_linpack test?
My Env like this:
source /share/apps/oneapi/25.0/setvars.sh
export FI_VERBS_IFACE=ens255np0
export FI_PROVIDER=verbs
export I_MPI_DEBUG=6
export FI_LOG_LEVEL=debug
then run the ./runme_intel64_dynamic command.
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please provide more information. HW/OS/ etc. Does it fail with a single node? FI_PROVIDER=verbs is not supported anymore
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi TobiasK,
Thanks for your reply!
The two server HW/OS/Other list below:
OS: RHEL 8.6 x86_64
CPU: 2*Platinum 8468
Mem: 2TB DDR5 ECC
I have multi mlx5_* cards with RoCE switcher connect.
# ibdev2netdev
mlx5_0 port 1 ==> ens255np0 (Up)
mlx5_1 port 1 ==> enp41s0np0 (Up)
mlx5_10 port 1 ==> enp218s0np0 (Up)
mlx5_2 port 1 ==> enp59s0np0 (Up)
mlx5_3 port 1 ==> enp83s0np0 (Up)
mlx5_4 port 1 ==> enp86s0f0np0 (Up)
mlx5_5 port 1 ==> enp86s0f1np1 (Up)
mlx5_6 port 1 ==> enp92s0np0 (Up)
mlx5_7 port 1 ==> enp155s0np0 (Up)
mlx5_8 port 1 ==> enp170s0np0 (Up)
mlx5_9 port 1 ==> enp187s0np0 (Up)
the mlx5_3 is 10Gb Ethernet; mlx5_4 and mlx5_5 is 25Gb Ethernet;
The other 8 cards is 400Gb Mellanox Technologies MT2910 Family [ConnectX-7] card
I use MLNX_OFED_LINUX-5.8-3.0.7.0-rhel8.6-x86_64 and install intel oneapi2025.0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for late, due to my bussiness travel.
As you say, single node linpack running ok because it is run via shm, not RoCE.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page