- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We're getting errors with our hybrid MPI-OpenMP code on an Intel Cluster when using anything other than TCP.
Abort(1091215) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(136).......: MPID_Init(904)..............: MPIDI_OFI_mpi_init_hook(986): OFI addrinfo() failed (ofi_init.c:986:MPIDI_OFI_mpi_init_hook:No data available)
We are using Intel Fortran 2019 (update 1) and Intel MPI 2019 (update 7) to compile.
We set I_MPI_OFI_PROVIDER to OFI and pass I_MPI_OFI_INTERNAL=1 to mpivars.sh. At this time only FI_PROVIDER=tcp seems to work correctly.
We have also tried options that ucx_info provides using the following command:
ucx_info -d |grep Transport
transport options: tcp,rc,rc_mlx5,dc,dc_mlx5,ud,ud_mlx5,cm,cuda,mm,cma,knem,self
We've set UCX_TLS to all of those and again only tcp appears to work correctly.
ucx_info -v
# UCT version=1.3.0 revision 18cee9d
We saw on Intel webpage (https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html) that UCX v1.4 or higher is required. Has anyone seen this before? Can someone confirm that this change will be needed (since we are using 1.3) and is the cause of errors we are getting?
Thank you.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
>>"We set I_MPI_OFI_PROVIDER to OFI and pass I_MPI_OFI_INTERNAL=1 to mpivars.sh."
You can specify the OFI provider with the IMPI environment variable:
export I_MPI_OFI_PROVIDER=<name>
Where <name> is the OFI provider to load.
Since you have Mellanox libfabric, you can use "mlx" as I_MPI_OFI_PROVIDER.
So, could you please use the below command before running your hybrid MPI-openMP code and let us know if it works?
export I_MPI_OFI_PROVIDER=mlx
If you still face any issues, we recommend you to use the Mellanox UCX* Framework v1.4 or higher.
If this resolves your issue, make sure to accept this as a solution. This would help others with a similar issue. Thank you!
Best Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the response. We tried the export I_MPI_OFI_PROVIDER=mlx setting but it didn't work. I'll paste some output from the debug log to see if that helps.
[0] MPI startup(): I_MPI_FABRICS=shm, but multi node launch is detected. Fallback to shm:ofi fabric.
[0] MPI startup(): libfabric version: 1.10.0a1-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): detected verbs;ofi_rxm provider, set device name to "verbs-ofi-rxm"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrname_len: 16, addrname_firstlen: 16
[0] MPI startup(): selected platform: clx
[0] MPI startup(): I_MPI_OFI_LIBRARY_INTERNAL=1
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_PIN_DOMAIN=socket
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm
[0] MPI startup(): I_MPI_OFI_PROVIDER=verbs
Setting FI_SOCKETS_IFACE=ib0 makes it use the IB network (when FI_PROVIDER=verbs) but we don't see scaling. From your webpage (https://www.intel.com/content/www/us/en/developer/articles/technical/mpi-library-2019-over-libfabric.html) it sounds like FI_SOCKETS_IFACE is meant for Windows and will use TCP. Is that correct?
Is there anything else we can try besides updating UCX framework to v1.4 or greater?
Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>>" it sounds like FI_SOCKETS_IFACE is meant for Windows and will use TCP. Is that correct?"
Yes, The sockets provider is a mostly windows purpose provider for the Intel MPI Library that can be used on any system that supports TCP sockets to apply all libfabric provider requirements and interfaces.
Could you please provide us with the Operating system & CPU details?
Also, please provide the output of the below command:
fi_info -l
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz (2 CPUs per node)
# fi_info -l
ofi_rxm:
version: 1.0
verbs:
version: 1.0
tcp:
version: 1.0
sockets:
version: 2.0
ofi_hook_noop:
version: 1.0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for providing the details.
Please try with any other fabric providers listed in the output of "fi_info -l" command.
Example:
I_MPI_OFI_PROVIDER=<name> I_MPI_DEBUG=10 mpirun -n <total-num-of-processes> -ppn <processes-per-node> ./executable
where <name> is any one of the below OFI providers as listed in "fi-info -l" command :
- verbs
- tcp
- sockets
Could you please try the above command and let us know if it works? If you still face the same error, then please provide us with the complete debug log.
Note: Since "mlx" is not listed in "fi_info -l" output, we can't use mlx as OFI provider. Please recheck your UCX configuration using one of the following:
ibv_devinfo
lspci | grep Mellanox
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
None of the OFI providers listed in fi_info work except for I_MPI_OFI_PROVIDER=tcp
The following setting also appears to work in combination
export FI_PROVIDER=verbs export FI_SOCKETS_IFACE=ib0
However, none of these appear to be using Infiniband the correct way and the code is running much slower than expected.
Here's the output for Mellanox:
Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 16.26.1040 node_guid: 0c42:a103:0091:38bb sys_image_guid: 0c42:a103:0091:38ba vendor_id: 0x02c9 vendor_part_id: 4119 hw_ver: 0x0 board_id: MT_0000000008 phys_port_cnt: 1 Device ports: port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 65535 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 16.26.1040 node_guid: 0c42:a103:0091:38ba sys_image_guid: 0c42:a103:0091:38ba vendor_id: 0x02c9 vendor_part_id: 4119 hw_ver: 0x0 board_id: MT_0000000008 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 2 port_lid: 364 port_lmc: 0x00 link_layer: InfiniBand
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As it is recommended to use the Mellanox UCX Framework v1.4 or higher, could you please try Mellanox UCX Framework v1.4 for running MPI applications?
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have a very similar issue with Intel MPI 2021.6 with all code built with oneAPI 2022.2. My script:
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export MKL_DYNAMIC=FALSE
export UCX_TLS=sm,rc_mlx5,dc_mlx5,ud_mlx5,self
export LD_PRELOAD=$I_MPI_ROOT/lib/libmpi_shm_heap_proxy.so
export I_MPI_HYDRA_BOOTSTRAP=lsf
export I_MPI_HYDRA_RMK=lsf
export I_MPI_HYDRA_TOPOLIB=hwloc
export I_MPI_HYDRA_IFACE=ib0
export I_MPI_PLATFORM=clx-ap
export I_MPI_EXTRA_FILESYSTEM=1
export I_MPI_EXTRA_FILESYSTEM_FORCE=gpfs
export I_MPI_FABRICS=shm:ofi
export I_MPI_SHM=clx-ap
export I_MPI_SHM_HEAP=1
export I_MPI_OFI_PROVIDER=mlx
export I_MPI_PIN_CELL=core
export I_MPI_DEBUG=6
mpirun -n 96 ./executable
The output:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.6 Build 20220227 (id: 28877f3f32)
[0] MPI startup(): Copyright (C) 2003-2022 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
impi_shm_mbind_local(): mbind(p=0x14ad3ea72000, size=4294967296) error=1 "Operation not permitted"
//SNIP//
impi_shm_mbind_local(): mbind(p=0x1458ca7f7000, size=4294967296) error=1 "Operation not permitted"
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(178)........:
MPID_Init(1532)..............:
MPIDI_OFI_mpi_init_hook(1512):
open_fabric(2566)............:
find_provider(2684)..........: OFI fi_getinfo() failed (ofi_init.c:2684:find_provider:No data available)
Abort(2139535) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(178)........:
MPID_Init(1532)..............:
MPIDI_OFI_mpi_init_hook(1512):
open_fabric(2566)............:
find_provider(2684)..........: OFI fi_getinfo() failed (ofi_init.c:2684:find_provider:No data available)
I do have Mellanox UCX Framework v1.8 installed and it is recognized:
[dipasqua@ec-hub1-sc1 ~]$ ucx_info -v
# UCT version=1.8.0 revision
# configured with: --prefix=/apps/rocs/2020.08/cascadelake/software/UCX/1.8.0-GCCcore-9.3.0 --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --with-rdmacm=/apps/rocs/2020.08/prefix/usr --with-verbs=/apps/rocs/2020.08/prefix/usr --with-knem=/apps/rocs/2020.08/prefix/usr --enable-optimizations --enable-cma --enable-mt --without-java --disable-doxygen-doc
[dipasqua@ec-hub1-sc1 ~]$ fi_info -l
psm2:
version: 113.20
mlx:
version: 1.4
psm3:
version: 1102.0
ofi_rxm:
version: 113.20
verbs:
version: 113.20
tcp:
version: 113.20
sockets:
version: 113.20
shm:
version: 114.0
ofi_hook_noop:
version: 113.20
[dipasqua@ec-hub1-sc1 ~]$ ucx_info -d | grep Transport
# Transport: posix
# Transport: sysv
# Transport: self
# Transport: tcp
# Transport: tcp
# Transport: rc_verbs
# Transport: rc_mlx5
# Transport: dc_mlx5
# Transport: ud_verbs
# Transport: ud_mlx5
# Transport: rc_verbs
# Transport: rc_mlx5
# Transport: ud_verbs
# Transport: ud_mlx5
# Transport: rc_verbs
# Transport: rc_mlx5
# Transport: dc_mlx5
# Transport: ud_verbs
# Transport: ud_mlx5
# Transport: cma
# Transport: knem
Everything works just fine with oneAPI 2022.1 (Intel MPI 2021.5), however, with all settings the same. Any ideas or do we have a bug?
Regards,
Antonio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @agg23,
We haven't heard back from you. Could you please provide us an update on your issue?
Hi @Antonio_D,
Could you please post a new thread for your query in the link given below? We are pleased to help you there.
Intel® oneAPI HPC Toolkit community URL: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/bd-p/oneapi-hpc-toolkit
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have communicated to the user to update the UCX version but we are not sure if it'll happen at this time. Thanks for your help, Santosh.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the update on your issue. Can we go ahead and close this issue for the time being?
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are closing this issue. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks & Regards,
Santosh
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page