Ethernet Products
Determine ramifications of Intel® Ethernet products and technologies
4986 Discussions

Fail to run Intel MPI between two machines through RDMA with Intel adapter E810

racky
Beginner
834 Views

Hi,  I am following the command mentioned for MPI in the RDMA/Linux/irdma-1.13.43/README_irdma.txt of RDMA, trying to run Intel MPI through RDMA and communicating between two computers with Intel Ethernet Network Adaper E810, but it doesn't work :


Environment:

  • ubuntu 20.04.2
  • Intel® Ethernet 810
  • kernel:5.15.0-105-generic
  • rdma driver 29.0 

Below is the command I run on zy@192.168.0.2,  the machine name is e810a. The other machine is user@192.168.0.11, machine name:e810b.   And I have already checked the passwordless ssh is established between zy@192.168.0.2  and  user@192.168.0.11. I also confirmed  ibv_devices are running correctly and the communication using RDMA protocols (rping) works correctly; I also tried other mpi code written by me.

zy@e810a:~/code/zy_mpi_share/mpi-benchmarks/src_c$ mpirun -l -n 2 -ppn 1 -host user@192.168.0.11,zy@192.168.0.2 -genv I_MPI_DEBUG=1 -genv FI_VERBS_MR_CACHE_ENABLE=1 -genv FI_VERBS_IFACE=rocep23s0f1 -genv FI_OFI_RXM_USE_SRX=0 -genv FI_PROVIDER='psm3' ./IMB-MPI1 Sendrecv
[mpiexec@e810a] Error: Unable to run bstrap_proxy on user@192.168.0.11 (pid 5644, exit code 768)
[mpiexec@e810a] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:157): check exit codes error
[mpiexec@e810a] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:206): poll for event error
[mpiexec@e810a] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1063): error waiting for event
[mpiexec@e810a] Error setting up the bootstrap proxies
[mpiexec@e810a] Possible reasons:
[mpiexec@e810a] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@e810a] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts.
[mpiexec@e810a] Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@e810a] 3. Firewall refused connection.
[mpiexec@e810a] Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@e810a] 4. Ssh bootstrap cannot launch processes on remote host.
[mpiexec@e810a] Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@e810a] You may try using -bootstrap option to select alternative launcher.
[bstrap:0:0@e810b] HYD_sock_connect (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:209): getaddrinfo returned error -3 (Temporary failure in name resolution)
[bstrap:0:0@e810b] main (../../../../../src/pm/i_hydra/libhydra/bstrap/src/hydra_bstrap_proxy.c:532): unable to connect to server e810a at port 41741 (check for firewalls!)

(e810a is the current machine, I also don't understand why it fails to connect the local machine)

 

I also tried the command with -bootstrap:
zy@e810a:~/code/zy_mpi_share/mpi-benchmarks/src_c$ mpirun -l -n 2 -ppn 1 -host user@192.168.0.11,zy@192.168.0.2 -genv I_MPI_DEBUG=1 -genv FI_VERBS_MR_CACHE_ENABLE=1 -genv FI_VERBS_IFACE=rocep23s0f1 -genv FI_OFI_RXM_USE_SRX=0 -genv FI_PROVIDER='psm3' -bootstrap ./IMB-MPI1 Sendrecv
[mpiexec@e810a] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:993): unrecognized launcher ./IMB-MPI1
[mpiexec@e810a] Error setting up the bootstrap proxies
[mpiexec@e810a] Possible reasons:
[mpiexec@e810a] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@e810a] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts.
[mpiexec@e810a] Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@e810a] 3. Firewall refused connection.
[mpiexec@e810a] Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@e810a] 4. ./IMB-MPI1 bootstrap cannot launch processes on remote host.
[mpiexec@e810a] You may try using -bootstrap option to select alternative launcher.

0 Kudos
14 Replies
IntelSupport
Community Manager
786 Views

Hi Racky,


Greetings for the day!


I am reaching out to inquire about the current functionality of the 100GbE Intel® Ethernet Network Adapter E810.

Could you please provide an update on whether the adapter is working properly or encountering any issues?


Please feel free to reply to this email. We're here to assist you every step of the way.


Regards,

Vishal


0 Kudos
racky
Beginner
777 Views

The adapter is working properly. And as I mentioned,  ibv_devices are running correctly and rping tests were conducted well between two machines.  In addition, Intel MPI works on the single machine through RDMA (using two interfaces connected by wire, each interface works as a host).  

Now the problem is Intel MPI doesn't work across the machine through RDMA.

0 Kudos
Simon-Intel
Employee
753 Views

Hi Racky,


Thank you for your prompt response.


We appreciate you sharing your concern with us. Please rest assured that we understand the importance of your situation and are fully committed to assisting you.


Please allow us some time to check this with the internal team, and we will get back to you as soon as we have an update.


If you have any additional information or questions, please feel free to share them with us. We're here to support you every step of the way.


Best regards,

Simon


0 Kudos
Simon-Intel
Employee
740 Views

Hi Racky,


Thank you for your patience.


Could you kindly provide the following details?

  1. System Model:
  2. Is the Ethernet card embedded on the board?


Your prompt response with this information will greatly assist us in diagnosing and resolving the issue as quickly as possible.


We look forward to hearing from you soon.


Best regards,

Simon


0 Kudos
racky
Beginner
720 Views

Hi Simon,

Thank you for your response. Here are the details you requested:

  1. System Model: Danhe. The system is assembled by a third-party vendor.

  2. Is the Ethernet card embedded on the board?: no

Please let me know if you need any additional information.

Racky

0 Kudos
Sachinks
Employee
710 Views

Hello Racky,


Thank you for the reply. We see that you have mentioned the system model as : Danhe. Can you specify the motherboard model you are using? It's so that we can check the compatibility.


Regards,

Sachin KS


0 Kudos
racky
Beginner
682 Views

Hi,
motherboard brand: supermicro

model: X12DPI-NT6

0 Kudos
Sachinks
Employee
679 Views

Hello Racky,


Thank you for the Motherboard details. May I ask you to confirm which E810 Ethernet card you are using exactly?

You can check our Ethernet E810 cards in the below link : https://www.intel.com/content/www/us/en/products/details/ethernet/800-network-adapters/e810-network-adapters/products.html


Regards,

Sachin KS


0 Kudos
racky
Beginner
651 Views

E810-CAM2

Update:I also tried to work with the computers on the different subnets, but it still doesn't work.

0 Kudos
Hayat
Employee
638 Views

Hi racky,


Please let us know if you are using the latest driver version and provide us with the driver version you are currently using.


Kindly let us know also which links you used to update your driver usually.


We would also request you to provide us below command outputs:

  • ifconfig
  • ibv_devices
  • ssh user@192.168.0.11
  • ssh zy@192.168.0.2


Kindly let us know if you have any questions.


Regards,

Hayat_Intel




0 Kudos
racky
Beginner
622 Views

Hi,

I assume you are referring to the driver of Ethernet controller? I am using 29.0.1 from https://www.intel.cn/content/www/cn/zh/download/15084/intel-ethernet-adapter-complete-driver-pack.html?wapkw=26_4_.zip

root@e810a:# sudo ethtool -i ens81f0
driver: ice
version: 5.15.0-107-generic
firmware-version: 4.30 0x8001af3f 1.3429.0
...

Same as the other computer.

 

root@e810a:/home/zhaoyue# ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:df:fd:b0:7a txqueuelen 0 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 3c:ec:ef:ab:66:c4 txqueuelen 1000 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.100.107.104 netmask 255.255.255.0 broadcast 10.100.107.255
inet6 fe80::b556:de96:817a:a1bd prefixlen 64 scopeid 0x20<link>
ether 3c:ec:ef:ab:66:c5 txqueuelen 1000 (以太网)
RX packets 2201585 bytes 331241246 (331.2 MB)
RX errors 314 dropped 62104 overruns 0 frame 314
TX packets 482469 bytes 423338900 (423.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ens81f0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.0.1 netmask 255.255.255.0 broadcast 0.0.0.0
ether 6c:b3:11:21:c5:44 txqueuelen 1000 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ens81f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.0.2 netmask 255.255.255.0 broadcast 0.0.0.0
ether 6c:b3:11:21:c5:45 txqueuelen 1000 (以太网)
RX packets 63291414 bytes 95750281682 (95.7 GB)
RX errors 0 dropped 16 overruns 0 frame 0
TX packets 1000688 bytes 67157819 (67.1 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (本地环回)
RX packets 28829 bytes 3397126 (3.3 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 28829 bytes 3397126 (3.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

usb0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 169.254.3.1 netmask 255.255.255.0 broadcast 169.254.3.255
inet6 fe80::c811:13ff:febe:6052 prefixlen 64 scopeid 0x20<link>
ether ca:11:13:be:60:52 txqueuelen 1000 (以太网)
RX packets 49 bytes 3560 (3.5 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 992 bytes 141320 (141.3 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

 

root@e810b:# ifconfig
eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether 3c:ec:ef:ab:66:a8 txqueuelen 1000 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.100.107.156 netmask 255.255.255.0 broadcast 10.100.107.255
inet6 fe80::57d7:6d1b:e919:3fed prefixlen 64 scopeid 0x20<link>
ether 3c:ec:ef:ab:66:a9 txqueuelen 1000 (以太网)
RX packets 3096967 bytes 263984126 (263.9 MB)
RX errors 620 dropped 99429 overruns 0 frame 620
TX packets 456760 bytes 400824686 (400.8 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ens81f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.1 netmask 255.255.255.0 broadcast 0.0.0.0
ether 6c:b3:11:21:c5:48 txqueuelen 1000 (以太网)
RX packets 991986 bytes 66087147 (66.0 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 63292544 bytes 95750534988 (95.7 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ens81f1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.1.2 netmask 255.255.255.0 broadcast 0.0.0.0
ether 6c:b3:11:21:c5:49 txqueuelen 1000 (以太网)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (本地环回)
RX packets 7961 bytes 1049470 (1.0 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 7961 bytes 1049470 (1.0 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

usb0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 169.254.3.1 netmask 255.255.255.0 broadcast 169.254.3.255
inet6 fe80::f8ee:2318:3ca7:87de prefixlen 64 scopeid 0x20<link>
ether be:cf:35:33:e5:a8 txqueuelen 1000 (以太网)
RX packets 70 bytes 4736 (4.7 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1016 bytes 262649 (262.6 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

 

root@e810a:# ibv_devices
device node GUID
------ ----------------
rocep23s0f0 6eb311fffe21c544
rocep23s0f1 6eb311fffe21c545

root@e810b:# ibv_devices
device node GUID
------ ----------------
rocep23s0f0 6eb311fffe21c548
rocep23s0f1 6eb311fffe21c549

 

zy@e810a:~$ ssh user@192.168.0.11
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-105-generic x86_64)

* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/pro

...

user@e810b:~$ ssh zhaoyue@192.168.0.2
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-107-generic x86_64)

* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/pro

...

 

0 Kudos
Poojitha
Employee
612 Views

Hi Racky,


Greetings for the day!

 

Thank you for sharing the details with us. We greatly appreciate your cooperation. We kindly request you to proceed to flash the latest firmware

using the provided link:

 

https://www.intel.com/content/www/us/en/download/19624/non-volatile-memory-nvm-update-utility-for-intel-ethernet-network-adapter-e810-series.html

 

Once you have completed the firmware update, please let us know the status so that we can assist you further.

 

Thank you for your understanding and cooperation throughout this

process.

 

Regards,

Poojitha


0 Kudos
IntelSupport
Community Manager
483 Views

Hello Racky,


Greetings for the day!


This is the first follow-up regarding the issue Fail to run intel mpi on two machines through RDMA on Intel E810 you reported to us.


We wanted to inquire whether you had the opportunity to review the plan of action (POA) we provided.


 

Please feel free to respond to this email at your earliest convenience.


Best Regards,

Vishal


0 Kudos
IntelSupport
Community Manager
376 Views

Hello Racky,


Greetings for the day!


This is the final follow-up regarding the reported issue. We're committed to ensuring a swift resolution and would greatly appreciate any updates or additional information you can provide.


If we don't hear back from you soon, we'll assume the issue has been resolved and will proceed to close the case.


Please feel free to respond to this email at your earliest convenience.


Best Regards,

Vishal


0 Kudos
Reply