MPI Isend/Recv with Waitall using RoCE protocol hangs with large message size

KarenD · ‎03-08-2022

A colleague wrote a small MPI Isend/Recv test case to try to reproduce a performance issue with an application when using RoCE, but the test case hangs with large message sizes when run with 2 or more processes per node across 2 or more nodes. The same test case runs successfully with large message sizes in an environment with Infiniband.

Initially it hung with message sizes larger than 16K but usage of the FI_OFI_RXM_BUFFER_SIZE variable allowed increasing the message size to about 750K. We were trying to get to 1 MB, but no matter how large FI_OFI_RXM_BUFFER_SIZE is set to, the test hangs with a message size of 1 MB. Are there other MPI settings or OS settings that may need to be increased? I also tried setting FI_OFI_RXM_SAR_LIMIT, but that didn't help. Here are the current set of MPI options for the test when using RoCE:

mpi_flags='-genv I_MPI_OFI_PROVIDER=verbs -genv FI_VERBS_IFACE=vlan50 -genv I_MPI_OFI_LIBRARY_INTERNAL=1 -genv I_MPI_FABRICS=shm:ofi -genv FI_OFI_RXM_BUFFER_SIZE=2000000 -genv FI_OFI_RXM_SAR_LIMIT=4000000 -genv I_MPI_DEBUG=30 -genv FI_LOG_LEVEL=debug'

The environment is SLES 15 SP2 with Intel OneAPI Toolkit version 2021.2, with Mellanox CX-6 network adapters in Ethernet mode and 100 Gb Aruba switches. The NICs and switches have been configured for RoCE traffic per guidelines from Mellanox and our Aruba engineering team.

A screenshot of the main loop of MPI code (I will get the full source code from my colleague), along with the output of the test when the message size is 1 MB and I_MPI_DEBUG=30 and FI_LOG_LEVEL=debug are attached. The script that is used to run the test is shown below. The script input parameters are the number of repetitions and message size.

#!/bin/bash
cf_args=()
while [ $# -gt 0 ]; do
cf_args+=("$1")
shift
done

source /opt/intel/oneapi/mpi/2021.3.0/env/vars.sh

set -e
set -u

mpirun --version 2>&1 | grep -i "intel.*mpi"

hostlist='-hostlist perfcomp3,perfcomp4'

mpi_flags='-genv I_MPI_OFI_PROVIDER=verbs -genv FI_VERBS_IFACE=vlan50 -genv I_MPI_OFI_LIBRARY_INTERNAL=1 -genv I_MPI_FABRICS=shm:ofi -genv FI_OFI_RXM_BUFFER_SIZE=2000000 -genv I_MPI_OFI_PROVIDER=verbs -genv FI_OFI_RXM_SAR_LIMIT=4000000 -genv I_MPI_DEBUG=30 -genv FI_LOG_LEVEL=debug -genv I_MPI_OFI_PROVIDER_DUMP=1'

echo "$hostlist"

mpirun -ppn 1 \
$hostlist $mpi_flags \
hostname

num_nodes=$(mpirun -ppn 1 $hostlist $mpi_flags hostname | sort -u | wc -l)
echo "num_nodes=$num_nodes"

mpirun -ppn 2 \
$hostlist $mpi_flags \
singularity run -H `pwd` \
/var/tmp/paulo/gromacs/gromacs_tau.sif \
v4/mpi_isend_recv "${cf_args[@]}"

MPI_Isend_Recv main loop

KarenD · ‎03-09-2022

Attaching the OS kernel settings for the servers, 99-sysctl.conf.txt (.txt extension added to allow it to be uploaded)

VarshaS_Intel · ‎03-09-2022

Hi,

Thanks for posting in Intel Communities.

We are working on your issue internally and will get back to you soon.

Thanks & Regards,

Varsha

KarenD · ‎03-10-2022

Attaching the full source code for the Isend/Recv test. Also, here are the settings configured on the Mellanox CX-6 adapter:

perfcomp4:~ # cat mellanox_advanced_tuning.sh

mlxconfig -d 43:00.0 --yes s ADVANCED_PCI_SETTINGS=1

mlxconfig -d 43:00.1 --yes s ADVANCED_PCI_SETTINGS=1

mlxconfig -d 43:00.1 --yes set MAX_ACC_OUT_READ=32

mlxconfig -d 43:00.0 --yes set MAX_ACC_OUT_READ=32

mlxconfig -d 43:00.0 --yes set PCI_WR_ORDERING=1

mlxconfig -d 43:00.1 --yes set PCI_WR_ORDERING=1

mlxfwreset -d 0000:43:00.0 --level 3 -y reset

perfcomp4:~ # cat config_mellanox_roce.sh

# Use ip link to set priority 3 on vlan_50

ip link set vlan50 type vlan egress 2:3

# set DSCP to 26 and PRIO to 3

mlxconfig -y -d 0000:43:00.0 set CNP_DSCP_P1=26 CNP_802P_PRIO_P1=3

mlxconfig -y -d 0000:43:00.1 set CNP_DSCP_P2=26 CNP_802P_PRIO_P2=3

mlxfwreset -d 0000:43:00.0 --level 3 -y reset

# set tos to 106

echo 106 > /sys/class/infiniband/mlx5_bond_0/tc/1/traffic_class

cma_roce_tos -d mlx5_bond_0 -t 106

# set trust pcp and set pfc for queue 3

mlnx_qos -i ens2f0 --trust pcp

mlnx_qos -i ens2f0 --pfc 0,0,0,1,0,0,0,0

mlnx_qos -i ens2f1 --trust pcp

mlnx_qos -i ens2f1 --pfc 0,0,0,1,0,0,0,0

# enable ECN for TCP

sysctl -w net.ipv4.tcp_ecn=1

KarenD · ‎03-13-2022

My colleague built the test case using Open MPI, and it runs to completion with Open MPI. So it appears, that we're missing some setting to allow it to work with Intel MPI, or there is a network or kernel setting that impacts the test case when using Intel MPI. Appreciate any insight that you may have.

KarenD · ‎03-14-2022

Some additional info when the MPI provider is changed from verbs to the default setting of psm3:

If the following MPI flags are removed from the script to run the testcase, allowing the default settings to be selected, the testcase runs to completion with a message size of 1 MB:

mpi_flags='-genv I_MPI_OFI_PROVIDER=verbs -genv FI_VERBS_IFACE=vlan50 -genv I_MPI_OFI_LIBRARY_INTERNAL=1 -genv I_MPI_FABRICS=shm:ofi'

Setting I_MPI_DEBUG=10 shows that the provider when no flags are set is psm3:

[0] MPI startup(): libfabric provider: psm3

However, the Intel MPI documentation says that psm3 is for Intel NICs, and I am using Mellanox CX-6 NICs:

PSM3 supports standard Ethernet networks and leverages standard RoCEv2 protocols as implemented by the Intel® Ethernet Fabric Suite NICs

Looking at tcpdump data collected during the run of the testcase, the tcpdump data with the psm3 provider differs from tcpdump data with the verbs provider:

The protocol is shown as RoCE, whereas with the verbs provider, the protocol is shown as RRoCE.
With psm3, there are many fewer entries in the tcpdump data that are marked as RoCE as compared to the RRoCE entries seen with the verbs provider. All of the entries are shown as “UD Send Only”, whereas with the verbs provider, there are many types of entries, including “RD Send Only”, RD Acknowledge, RDMA Read Request, RDMA Read Response First, RDMA Read Response Middle, RDMA Read Response End.
In addition, both the NICs and the Aruba switches are configured to set RoCE traffic to priority 3, which is allowed more bandwidth on the switch than other traffic. However, when the psm3 provider is used, the RoCe traffic seen in the tcpdump data is shown with priority 0.

The attached file named mpi_psm3.pcap.png is a sample of the tcpdump data shown in wireshark for the case where the provider is psm3. Note that the only RoCE traffic is “UD Send Only” and the priority is 0.

The attached file named mpi_verbs.pcap.png is a sample of the tcpdump data when the provider is verbs. Notice that there are many types of RRoCE entries and the priority is set to 3.

My understanding was that I should be using the verbs provider to use RoCE protocol with the Mellanox CX-6 provider. What am I missing to get the testcase to run to completion with a message size of 1 MB?

KarenD · ‎03-14-2022

Here are the screenshots of the tcpdump data mentioned in the last post (see attached files)

Vinutha_SV · ‎03-22-2022

Hi Karen,

Can you check the UCX version? With Intel MPI 2021.2 and above UCX version needs to be >1.6.

to check UCX version:

~> ucx_info -v

KarenD · ‎03-23-2022

The UCX version is 1.10.0:

perfcomp3:~ # ucx_info -v
# UCT version=1.10.0 revision 7477e81
# configured with: --host=x86_64-suse-linux-gnu --build=x86_64-suse-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-dependency-tracking --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --disable-numa

KarenD · ‎03-23-2022

I ran some additional tests, varying the number of processes per node (ppn), and running multiple iterations of the test. With 2 ppn, the test hangs on the first iteration. With 4 or 16 ppn, multiple iterations can be run, but eventually the test hangs:

	2 nodes	4 nodes
2 ppn	Hangs on 1st repetition	Hangs on 1st repetition
4 ppn	Hangs on 28th repetition	Hangs on 8th repetition
16 ppn	Hangs on 10th repetition	Hangs on 2nd repetition

When the test hangs, all of the processes are consuming 100% CPU, so perhaps they are all spinning and waiting on some resource. Do you have any recommendations on how to determine what it's waiing for?

Note that the same test case runs successfully with Open MPI. The 4-node 16 ppn test was getting very high latency, but that was resolved by setting UCX_RNDV_THRESH to a value higher than the message size (in this case the message size is 1 MB and I set UCX_RNDV_THRESH to 2 MB). With Intel MPI, I have already set FI_OFI_RXM_BUFFER_SIZE=2000000, but that only allowed running with a message size up to 750 MB for the 2-node, 2 ppn test.

Is it possible to use the UCX options with Intel MPI and the verbs provider? Or is the mlx provider (which I believe requires Infiniband) required to use UCX?

Vinutha_SV · ‎03-29-2022

Please try MLX provider as it covers RoCE as well as IB. With MLX provider you may use same set of UCX-level controls you are using with OpenMPI+UCX if needed.

Can you run command $ ibv_devinfo and send me the details. (need to know which transport it is using)

KarenD · ‎03-29-2022

Here is the output from the ibv_devinfo command:

perfcomp3:~ # ibv_devinfo

hca_id: mlx5_bond_0

transport: InfiniBand (0)

fw_ver: 20.30.1004

node_guid: 9440:c9ff:ffa9:e258

sys_image_guid: 9440:c9ff:ffa9:e258

vendor_id: 0x02c9

vendor_part_id: 4123

hw_ver: 0x0

board_id: MT_0000000453

phys_port_cnt: 1

port: 1

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: Ethernet

I tried using the UCX options with the mlx provider, but am getting errors about using that provider:

perfcomp3:/var/tmp/paulo/mpi_tests # sh launch_mpi_isend_recv_v4_2ppn.sh --repetitions 1 --size 1000000

Intel(R) MPI Library for Linux* OS, Version 2021.3 Build 20210601 (id: 6f90181f1)

-hostlist perfcomp3,perfcomp4

perfcomp3

perfcomp4

num_nodes=2

[0] MPI startup(): Intel(R) MPI Library, Version 2021.3 Build 20210601 (id: 6f90181f1)

[0] MPI startup(): library kind: release

[0] MPI startup(): shm segment size (1580 MB per rank) * (2 local ranks) = 3160 MB total

[2] MPI startup(): shm segment size (1580 MB per rank) * (2 local ranks) = 3160 MB total

[0] MPI startup(): libfabric version: 1.12.1-impi

Abort(1091215) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

MPIR_Init_thread(138)........:

MPID_Init(1169)..............:

MPIDI_OFI_mpi_init_hook(1419): OFI addrinfo() failed (ofi_init.c:1419:MPIDI_OFI_mpi_init_hook:No data available)

Here are the UCX options that I tried:

mpi_flags='-genv I_MPI_OFI_PROVIDER=mlx -genv I_MPI_FABRICS=shm:ofi -genv UCX_TLS=self,rc,dc,ud -genv UCX_NET_DEVICES=mlx5_bond_0:1 -genv UCX_RNDV_THRESH=2000000 -genv UCX_IB_SL=3 -genv UCX_IB_TRAFFIC_CLASS=106 -genv UCX_IB_GID_INDEX=3 -genv I_MPI_DEBUG=10'

All of the UCX options except UCX_TLS have been used successfully with the Open MPI version of this testcase. I added UCX_TLS because I found the following info that says that UCX_TLS must be explicitly set when using Intel MPI with AMD processors (and the test case was also failing with the same error shown above before I added UCX_TLS):

https://github.com/openucx/ucx/issues/6253

Here are the available transports from the ucx_info command:

perfcomp4:/var/tmp/pcap # ucx_info -d | grep Transport

# Transport: posix

# Transport: sysv

# Transport: self

# Transport: tcp

# Transport: rc_verbs

# Transport: rc_mlx5

# Transport: dc_mlx5

# Transport: ud_verbs

# Transport: ud_mlx5

# Transport: cma

Here is the mpirun command line when using the Open MPI version of the test case:

mpirun -np $num_proc \
--machinefile hfile_karen \
--prefix /var/tmp/paulo/openmpi-4.1.2 \
--map-by ppr:8:socket --report-bindings \
--mca btl ^openib \
--mca pml ucx -x UCX_IB_SL=3 -x UCX_NET_DEVICES=mlx5_bond_0:1 -x UCX_IB_TRAFFIC_CLASS=106 -x UCX_IB_GID_INDEX=3 \
--allow-run-as-root \
./runme.sh

What are the proper options to use to allow UCX to be used with Intel MPI?

KarenD · ‎04-04-2022

There have been several email exchanges in the past week, but none of the suggestions have resolved the issue. I tried using I_MPI_OFI_PROVIDER=mlx and FI_MLX_NET_DEVICES=mlx5_bond_0:1, but this results in the message "no mlx device is found". So it's not clear that the mlx provider can be used with the Mellanox CX-6 adapter in Ethernet mode with RoCE protocol. So I'm still trying to understand how to use UCX options with Intel MPI. My test case runs successfully with Open MPI and UCX options, but hangs with Intel MPI with the provider set to verbs if the message size is 1 MB, and it fails with an error with the provider set to mlx. Further details are provided in the email exchange shown below, and the output from setting log level to debug with the mlx provider is attached.

Sent: Thursday, March 31, 2022 11:34 AM

Hi Julia,

Thanks for the suggestion. The error message remains the same when using FI_MLX_NET_DEVICES. The output with log level set to debug is attached. The MPI options that were used and some of the output with the error message is shown below. I’m still puzzled about the suggestion to use the mlx provider, since the Intel MPI documentation indicates that it is for usage with Mellanox Infiniband hardware. I am using the Mellanox CX-6 adapter, but it is configured in Ethernet mode, not Infiniband, and I don’t have Infiniband switches. The network adapters and switches are configured for RoCE. I was trying to use the UCX options with Intel MPI because this test case that fails with Intel MPI is working fine with Open MPI and UCX options on this same hardware. I’m trying to determine if it’s possible to use the UCX options with Intel MPI. Note that originally I was using I_MPI_OFI_PROVIDER=verbs but the test case was hanging with a message size of 1 MB. Any further advice on how to get the test to work with Intel MPI and UCX (or other options) are much appreciated. The full details of the original case are posted here:

https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/MPI-Isend-Recv-with-Waitall-using-RoCE-protocol-hangs-with-large/m-p/1372676

Thanks,

Karen

mpi_flags='-genv I_MPI_OFI_PROVIDER=mlx -genv I_MPI_FABRICS=shm:ofi -genv FI_MLX_NET_DEVICES=mlx5_bond_0:1 -genv I_MPI_DEBUG=10 -genv FI_LOG_LEVEL=debug'

perfcomp3:/var/tmp/paulo/mpi_tests # sh launch_mpi_isend_recv_v4_2ppn.sh --repetitions 1 --size 1000000 2>&1 | tee launch_mpi_isend_recv_v4_2ppn_ompi_debug_MLXdev.out

libfabric:16416:verbs:fabric:verbs_devs_print():880<info> list of verbs devices found for FI_EP_MSG:

libfabric:16416:verbs:fabric:verbs_devs_print():884<info> #1 mlx5_bond_0 - IPoIB addresses:

libfabric:16416:verbs:fabric:verbs_devs_print():894<info> 172.30.217.84

libfabric:16416:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_bond_0: first found active port is 1

libfabric:16419:mlx:core:mlx_getinfo():171<info> no mlx device is found.

libfabric:16419:core:core:ofi_layering_ok():964<info> Need core provider, skipping ofi_rxm

Abort(1091215) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

MPIR_Init_thread(138)........:

MPID_Init(1169)..............:

MPIDI_OFI_mpi_init_hook(1419): OFI addrinfo() failed (ofi_init.c:1419:MPIDI_OFI_mpi_init_hook:No data available)

Sent: Thursday, March 31, 2022 6:56 AM

Hi Karen,
Could you please replace UCX_NET_DEVICES environment variable to FI_MLX_NET_DEVICES and rerun with FI_LOG_LEVEL=debug, FI_PROVIDER=mlx?

BRs,
Julia

Sent: Wednesday, March 30, 2022 8:25 PM

Hi Alex,

Thanks for the suggestions. The output with log level set to debug is attached. With Open MPI, I was using UCX_NET_DEVICES=mlx5_bond_0:1, but when using Intel MPI with that setting and I_MPI_OFI_PROVIDER=mlx, I see that it says no Mellanox device is found:

libfabric:9588:mlx:core:mlx_getinfo():171<info> no mlx device is found.

Looks like it is seeing mlx_bond_0 as a verbs device:

libfabric:31635:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_bond_0: first found active port is 1

But I am trying to use the mlx provider so that I could specify the UCX options. The servers have dual-port CX-6 Mellanox adapters configured with Ethernet mode and the two ports are configured in a bond. The NICs and Aruba 100Gb switches are configured for RoCE. Is that compatible with using the mlx provider? If so, what device name should be specified? With Intel MPI and the verbs provider, I was specifying FI_VERBS_IFACE=vlan50. The vlan device vlan50 is configured on top of the bond for the Mellanox adapter. With OpenMPI, I was using UCX_IB_GID_INDEX=3, which corresponds to the vlan50 device with RoCE v2:

perfcomp4: # /usr/sbin/show_gids

DEV PORT INDEX GID IPv4 VER DEV

--- ---- ----- --- ------------ --- ---

mlx5_bond_0 1 0 fe80:0000:0000:0000:9640:c9ff:fea9:e248 v1 bond0

mlx5_bond_0 1 1 fe80:0000:0000:0000:9640:c9ff:fea9:e248 v2 bond0

mlx5_bond_0 1 2 0000:0000:0000:0000:0000:ffff:ac1e:d954 172.30.217.84 v1 vlan50

mlx5_bond_0 1 3 0000:0000:0000:0000:0000:ffff:ac1e:d954 172.30.217.84 v2 vlan50

n_gids_found=4

I tried setting UCX_NET_DEVICES=vlan50 but that gives the same errors as when using UCX_NET_DEVICES=mlx5_bond_0:1. Also, not specifying UCX_NET_DEVICES at all yields the same error.

Also, you requested, “please align UCX settings for IMPI and OpenMPI”. Which settings are you suggesting that I add or remove? This is what I am using with Intel MPI and UCX options:

mpi_flags='-genv I_MPI_OFI_PROVIDER=mlx -genv I_MPI_FABRICS=shm:ofi -genv UCX_TLS=self,rc,dc,ud -genv UCX_NET_DEVICES=mlx5_bond_0:1 -genv UCX_RNDV_THRESH=2000000 -genv UCX_IB_SL=3 -genv UCX_IB_TRAFFIC_CLASS=106 -genv UCX_IB_GID_INDEX=3 -genv I_MPI_DEBUG=10 -genv FI_LOG_LEVEL=debug'

This is what I am using with Open MPI:

--mca pml ucx -x UCX_IB_SL=3 -x UCX_NET_DEVICES=mlx5_bond_0:1 -x UCX_IB_TRAFFIC_CLASS=106 -x UCX_IB_GID_INDEX=3 -x UCX_RNDV_THRESH=2000000

Note that UCX_TLS=self,rc,dc,ud was only added after receiving errors using I_MPI_OFI_PROVIDER=mlx with Intel MPI. The errors are the same regardless of whether that option is used or not.

Thanks,

Karen

Sent: Wednesday, March 30, 2022 4:21 AM

Hi Karen,

Could you please add FI_LOG_LEVEL=debug and provide a log for this run. That would allow to identify root cause of the problem with mlx provider.

Additionally could you please align UCX settings for IMPI and OpenMPI?
Thanks in advance.

> What are the proper options to use to allow UCX to be used with Intel MPI?

In general, FI_PROVIDER=mlx or I_MPI_OFI_PROVIDER=mlx should be enough.

-

With best regards, Alex.

Rafael_L_Intel · ‎05-13-2022

Hi Karen,

1) can you try it using a newer libfabric? The one shipped by IMPI is 1.12.1 but the newest should be 1.15.0

2) can you reproduce this issue in an Intel machine with the newest libfabric?

3) if yes, can you show where the call hangs, e.g. by using gstack?

Cheers,

Rafael

KarenD · ‎05-22-2022

According to my benchmark output, Intel MPI 2021.4 is using libfabric 1.13.0-impi and 2021.5 is using 1.13.rc1-impi. Where can I get the Intel version of 1.15? I tried downloading and building libfabric 1.15.1 from github and decided to first try it with Intel MPI 2021.4 because 2021.4 was already working with my test case. I modified

/opt/intel/oneapi/mpi/2021.4.0/env/vars.sh

to point to point to the libfabric 1.15.1 directory but the test is getting a segfault, and says that provider mlx is unknown. The github version of 1.15.1 didn't have a libfabric/lib/prov directory, so I had just left it pointing to the original 2021.4 directory. Here's what I changed in vars.sh

case "$i_mpi_ofi_library_internal" in

0|no|off|disable)

;;

*)

PATH="${I_MPI_ROOT}/${PLATFORM}/libfabric/bin:${PATH}"; export PATH

# LD_LIBRARY_PATH="${I_MPI_ROOT}/${PLATFORM}/libfabric/lib:${LD_LIBRARY_PATH}"; export LD_LIBRARY_PATH

LD_LIBRARY_PATH="/root/libfabric/libfabric-1.15.1/src/.libs:${LD_LIBRARY_PATH}"; export LD_LIBRARY_PATH

if [ -z "${LIBRARY_PATH}" ]

then

# LIBRARY_PATH="${I_MPI_ROOT}/${PLATFORM}/libfabric/lib"; export LIBRARY_PATH

LIBRARY_PATH="/root/libfabric/libfabric-1.15.1/src/.libs"; export LIBRARY_PATH

else

# LIBRARY_PATH="${I_MPI_ROOT}/${PLATFORM}/libfabric/lib:${LIBRARY_PATH}"; export LIBRARY_PATH

LIBRARY_PATH="/root/libfabric/libfabric-1.15.1/src/.libs:${LIBRARY_PATH}"; export LIBRARY_PATH

fi

FI_PROVIDER_PATH="${I_MPI_ROOT}/${PLATFORM}/libfabric/lib/prov:/usr/lib64/libfabric"; export FI_PROVIDER_PATH

;;

esac

The output from the test case when using libfabric 1.15.1 is attached. The output is showing:

[0] MPI startup(): libfabric version: 1.15.1

libfabric:56808:1653261977::core:core:verify_filter_names():562<warn> provider mlx is unknown, misspelled or DL provider?

[perfcomp3:56808:0:56808] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

==== backtrace (tid: 56808) ====

0 /usr/lib64/libucs.so.0(ucs_handle_error+0xe4) [0x15548926e6d4]

1 /usr/lib64/libucs.so.0(+0x21a4c) [0x15548926ea4c]

2 /usr/lib64/libucs.so.0(+0x21c02) [0x15548926ec02]

3 /lib64/libpthread.so.0(+0x132d0) [0x1555534962d0]

KarenD · ‎05-23-2022

I found out that it's necessary to enable the providers when building libfabric. Following the directions found here:

https://github.com/ofiwg/libfabric

--enable-<provider>=[yes|no|auto|dl|<directory

Using ./configure --enable-verbs=dl builds the verbs provider, but the mlx provider does not exist (it isn't documented on the site listed above, and an error is received when attempting to use --enable-mlx=dl :

erfcomp3:~/libfabric/libfabric-1.15.1 # ./configure --enable-mlx=dl --enable-verbs=dl

configure: WARNING: unrecognized options: --enable-mlx

***

*** Built-in providers: opx dmabuf_peer_mem hook_hmem hook_debug perf rstream shm rxd mrail rxm tcp udp sockets psm2

*** DSO providers: verbs

Where can I get a copy of libfabric 1.15 that has the mlx provider?

Thanks,

Karen

Rafael_L_Intel · ‎05-24-2022

Hi Karen!

That is correct, mlx provider is available in our build of libfabrics and is not present in the opensource repository. Can you please test with your original settings i.e. I_MPI_OFI_PROVIDER=verbs, or unset?

Cheers!

Rafael

KarenD · ‎05-24-2022

I was able to get the test case to work with the open source version of libfabric 1.15.1 by either

1) specifying the verbs provider with FI_PROVIDER=verbs

2) not explicitly specifying a provider (in which case the verbs provider was used).

Note that using I_MPI_OFI_PROVIDER=verbs didn't work with the open source libfabric. It throws the error "MPIDI_OFI_mpi_init_hook:No data available"

So it appears that the initial problem of the test case hanging with the verbs provider is related to the use of libfabric 1.13 that ships with Intel MPI.

Note that even though the test case works with the verbs provider and libfabric 1.15.1, the performance is very poor as compared to using the mlx provider with Intel MPI 2021.4 and libfabric 1.13. For the latter combination, the test case with a message size of 1 MB had a latency of 325 us, while the same test case with the verbs provider and libfabric 1.15.1 has a latency that was 3820 us, over a factor of 10 longer than with the mlx provider (see output below).

Does Intel have plans to update Intel MPI to use libfabric 1.15.1? Will the mlx provider be included?

Thanks,

Karen

Here is the output of the test case when using Intel MPI 2021.4 and the mlx provider. Note that latency is 325us.

perfcomp3:/var/tmp/paulo/mpi_tests # sh launch_mpi_isend_recv_v4_2ppn_BM.sh --repetitions 1 --size 1000000 2>&1 | tee launch_mpi_isend_recv_v4_2ppn_BM_2021.4.log

[0] MPI startup(): libfabric version: 1.13.0-impi

[0] MPI startup(): libfabric provider: mlx

[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.4.0

[0] MPI startup(): I_MPI_MPIRUN=mpirun

[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc

[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default

[0] MPI startup(): I_MPI_FABRICS=shm:ofi

[0] MPI startup(): I_MPI_OFI_PROVIDER=mlx

[0] MPI startup(): I_MPI_DEBUG=10

nrep = 1

sendcount = 1000000

recvcount = 1000000

wtime = 0.001

Bandwidth per receiver process = 3080.514 MB/s

MPI_Recv latency = 324.621 us

Here is the output of the test case when using Intel MPI 2021.4 with libfabric 1.15.1 and not explicitly specifying a provider, which results in the verbs provider being used. Note that latency is 3821us, an increase of over a factor of 10.

mpi_flags='-genv I_MPI_DEBUG=10'

perfcomp3:/var/tmp/paulo/mpi_tests # sh launch_mpi_isend_recv_v4_2ppn_BM_libfabric1.15.1.sh --repetitions 1 --size 1000000 | tee launch_mpi_isend_recv_v4_2ppn_BM_libfabric1.15.1_noprovider.log

[0] MPI startup(): libfabric version: 1.15.1

[0] MPI startup(): libfabric provider: verbs;ofi_rxm

[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.4.0

[0] MPI startup(): I_MPI_MPIRUN=mpirun

[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc

[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default

[0] MPI startup(): I_MPI_DEBUG=10

nrep = 1

sendcount = 1000000

recvcount = 1000000

wtime = 0.011

Bandwidth per receiver process = 261.718 MB/s

MPI_Recv latency = 3820.901 us

KarenD · ‎05-27-2022

I had been running a single iteration of the mpi_isend_recv test case for debugging purposes, but recalled that running with more iterations (e.g. 100 iterations) significantly reduces the latency reported. With 100 iterations, using the verbs provider with libfabric 1.15.1 actually gives lower latency than the mlx provider with libfabric 1.13. In both cases Intel MPI 2021.4 was used, but it's not strictly an apples-to-apples comparison because different versions of libfabric are used. But I don't have a version of libfabric 1.15.1 with the mlx provider, and the test hangs with the verbs provider and libfabric 1.13.

Here are the results with the verbs provider and libfabric 1.15.1:

perfcomp3:/var/tmp/paulo/mpi_tests # sh launch_mpi_isend_recv_v4_2ppn_BM_libfabric1.15.1.sh --repetitions 100 --size 1000000

[0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)

[0] MPI startup(): libfabric version: 1.15.1

[0] MPI startup(): libfabric provider: verbs;ofi_rxm

nrep = 100

sendcount = 1000000

recvcount = 1000000

wtime = 0.031

Bandwidth per receiver process = 9744.114 MB/s

MPI_Recv latency = 102.626 us

Here are the results with the mlx provider and libfabric 1.13.0:

perfcomp3:/var/tmp/paulo/mpi_tests # sh launch_mpi_isend_recv_v4_2ppn_BM.sh --repetitions 100 --size 1000000

Intel(R) MPI Library for Linux* OS, Version 2021.4 Build 20210831 (id: 758087adf)

[0] MPI startup(): libfabric version: 1.13.0-impi
[0] MPI startup(): libfabric provider: mlx

nrep = 100

sendcount = 1000000

recvcount = 1000000

wtime = 0.047

Bandwidth per receiver process = 6403.591 MB/s

MPI_Recv latency = 156.162 us

KarenD · ‎06-01-2022

Good news: we noticed that Intel MPI 2021.6 is now available and the mpi_isend_recv test case runs successfully for both the verbs and the mlx provider, and both get similar performance. Interestingly, 2021.6 ships with the same version of libfabric as 2021.5.1 (libfabric 1.13.2rc1-impi), so it's not clear that libfabric was the cause of the segfaults with Intel MPI 2021.5.1. Here's a summary of which combinations of Intel MPI, libfabric, and providers worked and which ones failed:

Intel MPI version	libfabric version	provider	result
2021.4.0	1.13.0-impi	mlx	runs
		verbs	hangs
2021.5.1	1.13.2.rc1-impi	mlx	seg fault
		verbs	segfault
2021.6.0	1.13.2.rc1-impi	mlx	runs
		verbs	runs

With 100 iterations of the test, performance (latency) is similar for the mlx and verbs provider with Intel MPI 2021.6:

perfcomp3:/var/tmp/paulo/mpi_tests # sh launch_mpi_isend_recv_v4_2ppn_BM.sh --repetitions 1 --size 1000000

[0] MPI startup(): Intel(R) MPI Library, Version 2021.6 Build 20220227 (id: 28877f3f32)

[0] MPI startup(): libfabric version: 1.13.2rc1-impi

[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)

[0] MPI startup(): libfabric provider: mlx

nrep = 100

sendcount = 1000000

recvcount = 1000000

wtime = 0.039

Bandwidth per receiver process = 7623.593 MB/s

MPI_Recv latency = 131.172 us

perfcomp3:/var/tmp/paulo/mpi_tests # sh launch_mpi_isend_recv_v4_2ppn_BM.sh --repetitions 100 --size 1000000

Intel(R) MPI Library for Linux* OS, Version 2021.6 Build 20220227 (id: 28877f3f32)

[0] MPI startup(): libfabric version: 1.13.2rc1-impi

[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)

[0] MPI startup(): libfabric provider: verbs;ofi_rxm

nrep = 100
sendcount = 1000000
recvcount = 1000000
wtime = 0.040
Bandwidth per receiver process = 7443.520 MB/s
MPI_Recv latency = 134.345 us

Rafael_L_Intel · ‎06-03-2022

Hello Karen,

First, I would like to apologize for the delay in replying. Second, that sounds like great news! I will forward the results of your tests to the engineering team.

If I understood it correctly, the issue has been resolved. Is that correct?

Cheers!

Rafael