Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
288 Views

MPI_IRECV sporadically hangs for Intel 2019/2020 but not Intel 2018

We are experiencing strange behavior where mpi_irecv calls sometimes hang for Intel 2019 and Intel 2020 but not Intel 2018.  The issue seems to be related to the fabric.  For Intel 2018 we could use the DAPL or OFA fabric but with Intel 2019/2020 these fabrics were removed and you need to use OFI. 

I've attached a small test case that exhibits the problem on our Linux cluster.  The test case is for 2 MPI processes.  The issue only occurs if the 2 MPI processes are on two distinct physical nodes in the cluster.  If you assign the 2 MPI processes to a single physical node, the hang does not occur.  The run.sh script drives the test cases, and you can select different Intel versions.  I've attached the output we see on our cluster in the screen*.txt files for the different Intel versions.

We've scoured the code and it seems to be correct.  With Intel 2018, our production code runs flawlessly on Intel 2018 over a wide-range of problems and number of MPI processes/cluster nodes, but quite a few of these problems hang with Intel 2019/2020. 

We know that Intel MPI 2019 had a lot of changes from 2018, so we are wondering if there is some default setting that changed, e.g., MPI buffer sizes, that might be the cause of the problem. 

Thanks,

John

0 Kudos
8 Replies
Highlighted
241 Views

John,

I am seeing similar problems in the 2020 MPI libraries when executing on multiple physical nodes (but not on a single physical node).  I have not been able to find a solution.

Rob

0 Kudos
Highlighted
Beginner
231 Views

Has someone from Intel been able to take a look a this yet?  This is currently a showstopper bug for our code.  We have confirmed it on our cluster as well as one of our customer's cluster (with Intel 2020).

Thanks,

John

0 Kudos
Highlighted
Moderator
195 Views

Hi John,


We have tested your code and ran it several times in our environment with both 2019 and 2020 versions.

The only difference I have made is changing the provider from verbs to mlx. Since 2019u5 version mlx is recommended over verbs on InfiniBand.

To change the provider set FI_PROVIDER=mlx, else if you don't provide any value it will automatically select mlx in latest versions.

I ran it over fifty times and found no lag using below command:

for run in {1..50}; do mpirun -env I_MPI_PIN_DOMAIN auto -env I_MPI_FABRICS=shm:ofi -f hosts -n 2 -ppn 1 ./a.out ;done


Please check with mlx meanwhile, we will get back to you after further investigation.


Regards

Prasanth


0 Kudos
Highlighted
Beginner
183 Views

Prasanth,


1. Thank you for looking at this.  I just want to confirm whether or not you ran the two mpi processes on a single physical node or two distinct physical nodes (one process per node).  The hang does not occur for us  you run the two mpi processes on a single physical node. It requires two distinct nodes (which I guess prevents shared memory communication).

2. I tried to set the FI_PROVIDER=mlx, but the code crashes at startup.  The output I see is below (also attached).  Does this indicate some issue in our cluster setup?

John

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
libfabric:107610:core:mr:ofi_default_cache_size():56<info> default cache size=2109042048
libfabric:22236:core:mr:ofi_default_cache_size():56<info> default cache size=2109042048
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: sockets (110.10)
libfabric:107610:core:core:ofi_register_provider():446<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: sockets (110.10)
libfabric:22236:core:core:ofi_register_provider():446<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:107610:core:core:ofi_reg_dl_prov():578<warn> dlopen(/opt/ohpc/pub/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/libfabric/lib/prov/libpsmx2-fi.so): libpsm2.so.2: cannot open shared object file: No such file or directory
libfabric:22236:core:core:ofi_reg_dl_prov():578<warn> dlopen(/opt/ohpc/pub/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/libfabric/lib/prov/libpsmx2-fi.so): libpsm2.so.2: cannot open shared object file: No such file or directory
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: ofi_rxm (110.10)
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: ofi_rxm (110.10)
libfabric:107610:core:core:ofi_reg_dl_prov():578<warn> dlopen(/opt/ohpc/pub/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/libfabric/lib/prov/libefa-fi.so): libefa.so.1: cannot open shared object file: No such file or directory
libfabric:22236:core:core:ofi_reg_dl_prov():578<warn> dlopen(/opt/ohpc/pub/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/libfabric/lib/prov/libefa-fi.so): libefa.so.1: cannot open shared object file: No such file or directory
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: tcp (110.10)
libfabric:107610:core:core:ofi_register_provider():446<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: tcp (110.10)
libfabric:22236:core:core:ofi_register_provider():446<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: shm (110.10)
libfabric:107610:core:core:ofi_register_provider():446<info> "shm" filtered by provider include/exclude list, skipping
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: shm (110.10)
libfabric:22236:core:core:ofi_register_provider():446<info> "shm" filtered by provider include/exclude list, skipping
libfabric:107610:core:mr:ofi_default_cache_size():56<info> default cache size=2109042048
libfabric:107610:verbs:fabric:verbs_devs_print():869<info> list of verbs devices found for FI_EP_MSG:
libfabric:107610:verbs:fabric:verbs_devs_print():873<info> #1 mlx4_0 - IPoIB addresses:
libfabric:107610:verbs:fabric:verbs_devs_print():883<info> 10.30.18.103
libfabric:107610:verbs:fabric:verbs_devs_print():883<info> fe80::202:c903:14:de71
libfabric:107610:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:107610:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:107610:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: verbs (110.10)
libfabric:107610:core:core:ofi_register_provider():446<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: mlx (1.4)
libfabric:107610:core:core:ofi_register_provider():418<info> registering provider: ofi_hook_noop (110.10)
libfabric:107610:core:core:fi_getinfo_():1092<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:107610:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:107610:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.6.0
libfabric:107610:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:107610:core:core:fi_getinfo_():1092<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:107610:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:107610:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.6.0
libfabric:107610:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
[0] MPI startup(): libfabric provider: mlx
libfabric:107610:mlx:core:mlx_fabric_open():172<info>
libfabric:107610:core:core:fi_fabric_():1372<info> Opened fabric: mlx
libfabric:107610:mlx:core:ofi_check_rx_attr():782<info> Tx only caps ignored in Rx caps
libfabric:107610:mlx:core:ofi_check_tx_attr():880<info> Rx only caps ignored in Tx caps
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
libfabric:107610:mlx:core:ofi_check_rx_attr():782<info> Tx only caps ignored in Rx caps
libfabric:107610:mlx:core:ofi_check_tx_attr():880<info> Rx only caps ignored in Tx caps
[0] MPI startup(): addrnamelen: 1024
libfabric:107610:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [127]...
libfabric:22236:core:mr:ofi_default_cache_size():56<info> default cache size=2109042048
libfabric:22236:verbs:fabric:verbs_devs_print():869<info> list of verbs devices found for FI_EP_MSG:
libfabric:22236:verbs:fabric:verbs_devs_print():873<info> #1 mlx4_0 - IPoIB addresses:
libfabric:22236:verbs:fabric:verbs_devs_print():883<info> 10.30.18.104
libfabric:22236:verbs:fabric:verbs_devs_print():883<info> fe80::202:c903:14:ddf1
libfabric:22236:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:22236:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:22236:verbs:fabric:vrb_get_device_attrs():615<info> device mlx4_0: first found active port is 1
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: verbs (110.10)
libfabric:22236:core:core:ofi_register_provider():446<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: mlx (1.4)
libfabric:22236:core:core:ofi_register_provider():418<info> registering provider: ofi_hook_noop (110.10)
libfabric:22236:core:core:fi_getinfo_():1092<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:22236:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:22236:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.6.0
libfabric:22236:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:22236:core:core:fi_getinfo_():1092<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:22236:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:22236:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.6.0
libfabric:22236:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:22236:mlx:core:mlx_fabric_open():172<info>
libfabric:22236:core:core:fi_fabric_():1372<info> Opened fabric: mlx
libfabric:22236:mlx:core:ofi_check_rx_attr():782<info> Tx only caps ignored in Rx caps
libfabric:22236:mlx:core:ofi_check_tx_attr():880<info> Rx only caps ignored in Tx caps
libfabric:22236:mlx:core:ofi_check_rx_attr():782<info> Tx only caps ignored in Rx caps
libfabric:22236:mlx:core:ofi_check_tx_attr():880<info> Rx only caps ignored in Tx caps
libfabric:22236:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [127]...
libfabric:22236:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x7f2000132a00
[1605269643.201248] [cnode003:22236:0] select.c:410 UCX ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(1149)..............:
MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed
[1605269643.201281] [cnode002:107610:0] select.c:410 UCX ERROR no active messages transport to <no debug data>: mm/posix - Destination is unreachable, mm/sysv - Destination is unreachable, self/self - Destination is unreachable
libfabric:107610:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x7f200002cb80
libfabric:107610:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:107610:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x7f200002cb80
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(1149)..............:
MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed

0 Kudos
Highlighted
Beginner
180 Views

Also, here is the output of ucx_info since I also see a message about message transport. We do not have all the transport methods that I see in some other related posts.

~/scratch/>ucx_info -d | grep Transport
7:# Transport: mm
43:# Transport: mm
79:# Transport: self
113:# Transport: tcp

0 Kudos
Highlighted
Moderator
118 Views

Hi John,


Your system does not have all the required transports to use mlx. It might be due to due to a driver misconfiguration, missing libraries, or other fabric software problems.

Could you please check your UCX configuration or contact your system administrator regarding the installation of required transports.

For more information regarding required transports please refer: https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with...


Regards

Prasanth


0 Kudos
Highlighted
Beginner
92 Views

Prasanth,

Thank you for your help.  Our cluster administrator installed the UCX library (v1.9.0) and enabled the compile-time InfiniBand features.  I can now use the FI_PROVIDER=mlx and the hang in the test case now seems to be resolved.  I will verify that the issue is resolved with our production code shortly.

We have seen this issue on two separate clusters.  Maybe the Intel documentation should be updated to clarify (or emphasize) that these libraries need to be installed separately for proper operation.  It seems that it may not be clear to all cluster administrators that this is important since it you can get programs to run without these libraries but sub-optimally and with (apparently) sporadic run-time issues.

Thanks again.  You've been a great help in resolving this issue.

John

0 Kudos
Highlighted
Moderator
52 Views

Hi John,


These transport requirements were more related to hardware rather than Intel MPI, but I will forward your suggestion to the internal team regarding the documentation.

Have you verified the fix in your production code? if yes, let us know the results based on which we can go forward.


Regards

Prasanth


0 Kudos