- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using Intel MPI 2021.10 and am getting runtime errors for an application running across two compute nodes. I have compute nodes with two Mellanox RoCE cards, which have 2 ports:
# lspci
0000:01:00.0 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-6 Lx]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-6 Lx]
0001:3f:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
0001:3f:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
of which one is connected to the network:
# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 26.43.1014
node_guid: b8e9:2403:00e9:4b30
sys_image_guid: b8e9:2403:00e9:4b30
vendor_id: 0x02c9
vendor_part_id: 4127
hw_ver: 0x0
board_id: MT_0000000547
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
I have set the following environment variables:
export I_MPI_OFI_PROVIDER_DUMP=1
export I_MPI_DEBUG=10
export FI_PROVIDER="mlx"
export FI_MLX_DEVICES="mlx5_0:1"
When I start an MPI application that has been built with Intel MPI, I get the output:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: mlx
But the application fails with the following errors:
[cn-0902-01:2383829:0:2383829] ib_mlx5_log.c:179 Local QP operation error on mlx5_0:1/RoCE (synd 0x2 vend 0x68 hw_synd 0/66)
[cn-0902-01:2383829:0:2383829] ib_mlx5_log.c:179 DCI QP 0x8787 wqe[2]: SEND s-e [rqpn 0x10ee6 rmac b8:e9:24:e9:4c:20 sgix 3 dgid ::ffff:10.128.2.2 tc 106] [va 0x7f78ea1fd600 len 32 lkey 0x1b99100]
Any help will be greatly appreciated, as I have exhausted ChatGPT and Google!
Thanks,
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page