Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
346 Views

MPI_IRECV sporadically hangs for Intel 2019/2020 but not Intel 2018

We are experiencing strange behavior where mpi_irecv calls sometimes hang for Intel 2019 and Intel 2020 but not Intel 2018.  The issue seems to be related to the fabric.  For Intel 2018 we could use the DAPL or OFA fabric but with Intel 2019/2020 these fabrics were removed and you need to use OFI. 

I've attached a small test case that exhibits the problem on our Linux cluster.  The test case is for 2 MPI processes.  The issue only occurs if the 2 MPI processes are on two distinct physical nodes in the cluster.  If you assign the 2 MPI processes to a single physical node, the hang does not occur.  The run.sh script drives the test cases, and you can select different Intel versions.  I've attached the output we see on our cluster in the screen*.txt files for the different Intel versions.

We've scoured the code and it seems to be correct.  With Intel 2018, our production code runs flawlessly on Intel 2018 over a wide-range of problems and number of MPI processes/cluster nodes, but quite a few of these problems hang with Intel 2019/2020. 

We know that Intel MPI 2019 had a lot of changes from 2018, so we are wondering if there is some default setting that changed, e.g., MPI buffer sizes, that might be the cause of the problem. 

Thanks,

John

0 Kudos
11 Replies
Highlighted
299 Views

John,

I am seeing similar problems in the 2020 MPI libraries when executing on multiple physical nodes (but not on a single physical node).  I have not been able to find a solution.

Rob

0 Kudos
Highlighted
Beginner
289 Views

Has someone from Intel been able to take a look a this yet?  This is currently a showstopper bug for our code.  We have confirmed it on our cluster as well as one of our customer's cluster (with Intel 2020).

Thanks,

John

0 Kudos
Highlighted
Moderator
253 Views

Hi John,


We have tested your code and ran it several times in our environment with both 2019 and 2020 versions.

The only difference I have made is changing the provider from verbs to mlx. Since 2019u5 version mlx is recommended over verbs on InfiniBand.

To change the provider set FI_PROVIDER=mlx, else if you don't provide any value it will automatically select mlx in latest versions.

I ran it over fifty times and found no lag using below command:

for run in {1..50}; do mpirun -env I_MPI_PIN_DOMAIN auto -env I_MPI_FABRICS=shm:ofi -f hosts -n 2 -ppn 1 ./a.out ;done


Please check with mlx meanwhile, we will get back to you after further investigation.


Regards

Prasanth


0 Kudos
Highlighted
Beginner
241 Views

Prasanth,


1. Thank you for looking at this.  I just want to confirm whether or not you ran the two mpi processes on a single physical node or two distinct physical nodes (one process per node).  The hang does not occur for us  you run the two mpi processes on a single physical node. It requires two distinct nodes (which I guess prevents shared memory communication).

2. I tried to set the FI_PROVIDER=mlx, but the code crashes at startup.  The output I see is below (also attached).  Does this indicate some issue in our cluster setup?

John

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
libfabric:107610:core:mr:ofi_default_cache_size():56<info> default cache size=2109042048
libfabric:22236:core:mr:ofi_default_cache_size():56<info> default cache size=2109042048
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: sockets (110.10)
libfabric:107610:core:core:ofi_register_provider():446<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: sockets (110.10)
libfabric:22236:core:core:ofi_register_provider():446<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:107610:core:core:ofi_reg_dl_prov():578<warn> dlopen(/opt/ohpc/pub/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/libfabric/lib/prov/libpsmx2-fi.so): libpsm2.so.2: cannot open shared object file: No such file or directory
libfabric:22236:core:core:ofi_reg_dl_prov():578<warn> dlopen(/opt/ohpc/pub/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/libfabric/lib/prov/libpsmx2-fi.so): libpsm2.so.2: cannot open shared object file: No such file or directory
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: ofi_rxm (110.10)
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: ofi_rxm (110.10)
libfabric:107610:core:core:ofi_reg_dl_prov():578<warn> dlopen(/opt/ohpc/pub/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/libfabric/lib/prov/libefa-fi.so): libefa.so.1: cannot open shared object file: No such file or directory
libfabric:22236:core:core:ofi_reg_dl_prov():578<warn> dlopen(/opt/ohpc/pub/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/libfabric/lib/prov/libefa-fi.so): libefa.so.1: cannot open shared object file: No such file or directory
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: tcp (110.10)
libfabric:107610:core:core:ofi_register_provider():446<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: tcp (110.10)
libfabric:22236:core:core:ofi_register_provider():446<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: shm (110.10)
libfabric:107610:core:core:ofi_register_provider():446<info> "shm" filtered by provider include/exclude list, skipping
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: shm (110.10)
libfabric:22236:core:core:ofi_register_provider():446<info> "shm" filtered by provider include/exclude list, skipping
libfabric:107610:core:mr:ofi_default_cache_size():56<info> default cache size=2109042048
libfabric:107610:verbs:fabric:verbs_devs_print():869<info> list of verbs devices found for FI_EP_MSG:
libfabric:107610:verbs:fabric:verbs_devs_print():873<info> #1 mlx4_0 - IPoIB addresses:
libfabric:107610:verbs:fabric:verbs_devs_print():883<info> 10.30.18.103
libfabric:107610:verbs:fabric:verbs_devs_print():883<info> fe80::202:c903:14:de71
libfabric:107610:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:107610:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:107610:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: verbs (110.10)
libfabric:107610:core:core:ofi_register_provider():446<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: mlx (1.4)
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: ofi_hook_noop (110.10)
libfabric:107610:core:core:fi_getinfo_():1092<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:107610:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:107610:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.6.0
libfabric:107610:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:107610:core:core:fi_getinfo_():1092<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:107610:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:107610:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.6.0
libfabric:107610:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
[0] MPI startup(): libfabric provider: mlx
libfabric:107610:mlx:core:mlx_fabric_open():172<info>
libfabric:107610:core:core:fi_fabric_():1372<info> Opened fabric: mlx
libfabric:107610:mlx:core:ofi_check_rx_attr():782<info> Tx only caps ignored in Rx caps
libfabric:107610:mlx:core:ofi_check_tx_attr():880<info> Rx only caps ignored in Tx caps
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
libfabric:107610:mlx:core:ofi_check_rx_attr():782<info> Tx only caps ignored in Rx caps
libfabric:107610:mlx:core:ofi_check_tx_attr():880<info> Rx only caps ignored in Tx caps
[0] MPI startup(): addrnamelen: 1024
libfabric:107610:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [127]...
libfabric:22236:core:mr:ofi_default_cache_size():56<info> default cache size=2109042048
libfabric:22236:verbs:fabric:verbs_devs_print():869<info> list of verbs devices found for FI_EP_MSG:
libfabric:22236:verbs:fabric:verbs_devs_print():873<info> #1 mlx4_0 - IPoIB addresses:
libfabric:22236:verbs:fabric:verbs_devs_print():883<info> 10.30.18.104
libfabric:22236:verbs:fabric:verbs_devs_print():883<info> fe80::202:c903:14:ddf1
libfabric:22236:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:22236:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:22236:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: verbs (110.10)
libfabric:22236:core:core:ofi_register_provider():446<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: mlx (1.4)
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: ofi_hook_noop (110.10)
libfabric:22236:core:core:fi_getinfo_():1092<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:22236:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:22236:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.6.0
libfabric:22236:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:22236:core:core:fi_getinfo_():1092<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:22236:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:22236:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.6.0
libfabric:22236:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:22236:mlx:core:mlx_fabric_open():172<info>
libfabric:22236:core:core:fi_fabric_():1372<info> Opened fabric: mlx
libfabric:22236:mlx:core:ofi_check_rx_attr():782<info> Tx only caps ignored in Rx caps
libfabric:22236:mlx:core:ofi_check_tx_attr():880<info> Rx only caps ignored in Tx caps
libfabric:22236:mlx:core:ofi_check_rx_attr():782<info> Tx only caps ignored in Rx caps
libfabric:22236:mlx:core:ofi_check_tx_attr():880<info> Rx only caps ignored in Tx caps
libfabric:22236:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [127]...
libfabric:22236:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x7f2000132a00
[1605269643.201248] [cnode003:22236:0] select.c:410 UCX ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(1149)..............:
MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed
[1605269643.201281] [cnode002:107610:0] select.c:410 UCX ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
libfabric:107610:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x7f200002cb80
libfabric:107610:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:107610:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x7f200002cb80
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(1149)..............:
MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed

0 Kudos
Highlighted
Beginner
238 Views

Also, here is the output of ucx_info since I also see a message about message transport. We do not have all the transport methods that I see in some other related posts.

~/scratch/>ucx_info -d | grep Transport
7:# Transport: mm
43:# Transport: mm
79:# Transport: self
113:# Transport: tcp

0 Kudos
Highlighted
Moderator
176 Views

Hi John,


Your system does not have all the required transports to use mlx. It might be due to due to a driver misconfiguration, missing libraries, or other fabric software problems.

Could you please check your UCX configuration or contact your system administrator regarding the installation of required transports.

For more information regarding required transports please refer: https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with...


Regards

Prasanth


0 Kudos
Highlighted
Beginner
150 Views

Prasanth,

Thank you for your help.  Our cluster administrator installed the UCX library (v1.9.0) and enabled the compile-time InfiniBand features.  I can now use the FI_PROVIDER=mlx and the hang in the test case now seems to be resolved.  I will verify that the issue is resolved with our production code shortly.

We have seen this issue on two separate clusters.  Maybe the Intel documentation should be updated to clarify (or emphasize) that these libraries need to be installed separately for proper operation.  It seems that it may not be clear to all cluster administrators that this is important since it you can get programs to run without these libraries but sub-optimally and with (apparently) sporadic run-time issues.

Thanks again.  You've been a great help in resolving this issue.

John

0 Kudos
Highlighted
Moderator
110 Views

Hi John,


These transport requirements were more related to hardware rather than Intel MPI, but I will forward your suggestion to the internal team regarding the documentation.

Have you verified the fix in your production code? if yes, let us know the results based on which we can go forward.


Regards

Prasanth


0 Kudos
Highlighted
Moderator
50 Views

Hi John,


We haven't heard back from you.

Please confirm whether your problem is resolved or not.


Regards

Prasanth


0 Kudos
Highlighted
Beginner
35 Views


Prasanth,

Yes, our production code runs fine now after installing the UCX transports. Here is a summary of what we did to avoid the discussed MPI hangs.


Intel MPI 2018:

Pass "-env I_MPI_FABRICS shm:dapl" to mpirun. This is all we needed to do and never observed any hangs.

Intel MPI 2019:

We were unable to use the 'mlx' provider on Intel 2019. I don't know if this is due to our cluster or due to Intel MPI 2019. The hang always occurs unless we choose the 'sockets' provider :

load the UCX 1.9.0 module

export UCX_TLS=rc,ud,sm,self # This doesn't seem to be necessary anymore.
export FI_PROVIDER=sockets # This IS necessary on our cluster

and pass "-env I_MPI_FABRICS shm:ofi" to mpirun

Intel MPI 2020:

load the UCX 1.9.0 module

export UCX_TLS=rc,ud,sm,self # This doesn't seem to be necessary anymore, but doesn't cause any issues.
export FI_PROVIDER=mlx # This also no longer seems to be necessary, but doesn't cause any issues.

and pass "-env I_MPI_FABRICS shm:ofi" to mpirun


So, the primary issue seems to have been not having the UCX library installed. Our cluster admin built the UCX module with


./contrib/configure-release --with-rc \
--with-ud \
--with-dc \
--with-mlx5-dv \
--with-ib-hw-tm \
--with-dm \
--with-cm \
--prefix=$INSTALL_LOCATION

 

Best,

John

0 Kudos
Highlighted
Moderator
19 Views

Hi John,


Thanks for providing the steps you have followed.

It has been mentioned in the release notes that the minimum required UCX version is 1.5+ (Intel® MPI Library Release Notes for Linux* OS).

Since your issue has been resolved we are closing this thread. Please raise a new thread for any further assistance from intel.


Regards

Prasanth


0 Kudos