Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Intel MPI with Mellanox RoCE

miahw
Beginner
130 Views

I am using Intel MPI 2021.10 and am getting runtime errors for an application running across two compute nodes. I have compute nodes with two Mellanox RoCE cards, which have 2 ports:

 

# lspci

0000:01:00.0 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-6 Lx]

0000:01:00.1 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-6 Lx]

0001:3f:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]

0001:3f:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]

 

of which one is connected to the network:

 

# ibv_devinfo

hca_id: mlx5_0

        transport:                      InfiniBand (0)

        fw_ver:                         26.43.1014

        node_guid:                      b8e9:2403:00e9:4b30

        sys_image_guid:                 b8e9:2403:00e9:4b30

        vendor_id:                      0x02c9

        vendor_part_id:                 4127

        hw_ver:                         0x0

        board_id:                       MT_0000000547

        phys_port_cnt:                  1

                port:   1

                        state:                  PORT_ACTIVE (4)

 

I have set the following environment variables:

export I_MPI_OFI_PROVIDER_DUMP=1
export I_MPI_DEBUG=10
export FI_PROVIDER="mlx"
export FI_MLX_DEVICES="mlx5_0:1"

 

When I start an MPI application that has been built with Intel MPI, I get the output:

 

[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx

 

But the application fails with the following errors:

 

[cn-0902-01:2383829:0:2383829] ib_mlx5_log.c:179 Local QP operation error on mlx5_0:1/RoCE (synd 0x2 vend 0x68 hw_synd 0/66)
[cn-0902-01:2383829:0:2383829] ib_mlx5_log.c:179 DCI QP 0x8787 wqe[2]: SEND s-e [rqpn 0x10ee6 rmac b8:e9:24:e9:4c:20 sgix 3 dgid ::ffff:10.128.2.2 tc 106] [va 0x7f78ea1fd600 len 32 lkey 0x1b99100]

 

Any help will be greatly appreciated, as I have exhausted ChatGPT and Google!

 

Thanks,

0 Kudos
0 Replies
Reply