MPI_Alltoallw performs poorly with Intel MPI Library to Version 2021.2

SVDB · ‎07-16-2021

On our cluster we are testing an upgrade of our Intel MPI Library to Version 2021.2 and we observe something similar as the original post. Specifically for MPI_Alltoallw, the performance is significantly worse than for previous Intel MPI versions. In an attempt to simplify the code, I made a single-core program that performs a matrix transpose by constructing a strided MPI data type that allows to change between row-major and column-major storage. For this case, it is possible to use MPI_Alltoall (or even a simple Fortran transpose), but in our actual code the use of MPI_Alltoallw is required.

Here are the timings (in seconds) for transposing a [512x512x512] array along the first two dimensions on a Intel(R) Xeon(R) Gold 6140:

	TRANSPOSE	ALLTOALL	ALLTOALLW
Version 2018 Update 5	1.29	1.50	1.30
Version 2021.2	1.28	1.49	2.12

A first interesting observation is that ALLTOALL is only significantly slower than TRANSPOSE if the strided MPI Data type is at the receiving side. If the sender has the strided MPI Data type, the difference is only a few percent.

The more important issue for us is the serious slow down (timing increase of more than 50%) of ALLTOALLW when switching to the new Intel MPI library.
I added the code to get these numbers in attachment. It can be simply compiled with "mpiifort -O2 -xHost bench.f90" and run with "I_MPI_PIN_PROCESSOR_LIST=0 mpirun -np 1 ./a.out 512". Here is the output when setting I_MPI_DEBUG=12 for the latest version:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.2  Build 20210302 (id: f4f7c92cd)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (2084 MB per rank) * (1 local ranks) = 2084 MB total
[0] MPI startup(): libfabric version: 1.11.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[0] MPI startup(): File "/vsc-hard-mounts/leuven-apps/skylake/2021a/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0/etc/tuning_skx_shm-ofi_mlx.dat" not found
[0] MPI startup(): Load tuning file: "/vsc-hard-mounts/leuven-apps/skylake/2021a/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       28548    r22i13n16  0
[0] MPI startup(): I_MPI_ROOT=/vsc-hard-mounts/leuven-apps/skylake/2021a/software/impi/2021.2.0-intel-compilers-2021.2.0/mpi/2021.2.0
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_PIN_PROCESSOR_LIST=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=12

ShanmukhS_Intel · ‎07-19-2021

Hi,

Thanks for reaching out to us.

We need more information to investigate your issue. Could you please provide us I_MPI_DEBUG information after running the code using 2018 Update 5.

Best Regards,

Shanmukh.SS

SVDB · ‎07-19-2021

Hello ShanmukS,

Here is the output from 2018 Update 5 with I_MPI_DEBUG=12:

I_MPI_DEBUG=12 I_MPI_PIN_PROCESSOR_LIST=0 mpirun -np 1 ./a.out 512
[0] MPI startup(): Intel(R) MPI Library, Version 2018 Update 5  Build 20190404 (id: 18839)
[0] MPI startup(): Copyright (C) 2003-2019 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[0] MPI startup(): Device_reset_idx=8
[0] MPI startup(): Allgather: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       5936     r22i13n01  0
[0] MPI startup(): Recognition=2 Platform(code=512 ippn=0 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[0] MPI startup(): Topology split mode = 1

| rank | node | space=1
|  0  |  0  |
[0] MPI startup(): I_MPI_DEBUG=12
[0] MPI startup(): I_MPI_INFO_BRAND=Intel(R) Xeon(R) Gold 6140
[0] MPI startup(): I_MPI_INFO_CACHE1=0,1,2,3,4,8,9,10,11,16,17,18,19,20,24,25,26,27,32,33,34,35,36,40,41,42,43,48,49,50,51,52,56,57,58,59
[0] MPI startup(): I_MPI_INFO_CACHE2=0,1,2,3,4,8,9,10,11,16,17,18,19,20,24,25,26,27,32,33,34,35,36,40,41,42,43,48,49,50,51,52,56,57,58,59
[0] MPI startup(): I_MPI_INFO_CACHE3=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_CACHES=3
[0] MPI startup(): I_MPI_INFO_CACHE_SHARE=2,2,64
[0] MPI startup(): I_MPI_INFO_CACHE_SIZE=32768,1048576,25952256
[0] MPI startup(): I_MPI_INFO_CORE=0,1,2,3,4,8,9,10,11,16,17,18,19,20,24,25,26,27,0,1,2,3,4,8,9,10,11,16,17,18,19,20,24,25,26,27
[0] MPI startup(): I_MPI_INFO_C_NAME=Unknown
[0] MPI startup(): I_MPI_INFO_DESC=1342177280
[0] MPI startup(): I_MPI_INFO_FLGB=-744488965
[0] MPI startup(): I_MPI_INFO_FLGC=2147417087
[0] MPI startup(): I_MPI_INFO_FLGCEXT=24
[0] MPI startup(): I_MPI_INFO_FLGD=-1075053569
[0] MPI startup(): I_MPI_INFO_FLGDEXT=-1677712384
[0] MPI startup(): I_MPI_INFO_LCPU=36
[0] MPI startup(): I_MPI_INFO_MODE=263
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx5_0:0,mlx5_1:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_INFO_PACK=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
[0] MPI startup(): I_MPI_INFO_SIGN=329300
[0] MPI startup(): I_MPI_INFO_STATE=0
[0] MPI startup(): I_MPI_INFO_THREAD=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
[0] MPI startup(): I_MPI_INFO_VEND=1
[0] MPI startup(): I_MPI_PIN_INFO=0
[0] MPI startup(): I_MPI_PIN_MAPPING=1:0 0
           TRANSPOSE    N   512  FWD    0.645729  BCK    0.644933  TOT    1.290661
            ALLTOALL    N   512  FWD    0.653792  BCK    0.850857  TOT    1.504649
           ALLTOALLW    N   512  FWD    0.657849  BCK    0.657091  TOT    1.314940

ShanmukhS_Intel · ‎07-20-2021

Hi,

Thanks for providing I_MPI_DEBUG information. But, we didn't find any libfabric details being mentioned by I_MPI_DEBUG in the debug log that you have provided.

Could you please confirm with us your environment and hardware details?

Also provide the interconnect hardware and OFI provider that has been used for both 2018u5 & 2021.2 versions?

Could you please confirm with us whether you are executing the code on the same machine using both 2018 Update 5 and 2021.2 versions?

Best Regards,

Shanmukh.SS

SVDB · ‎07-20-2021

Could you please confirm with us your environment and hardware details?

The operating system is CentOS Linux release 7.9.2009. Tests are performed on a node that has 2 Xeon Gold 6140 CPUs@2.3 GHz (Skylake). Whenever I compare timings, these were obtained on the same machine.

Also provide the interconnect hardware...

Nodes are connected using an Infiniband EDR network. I am not sure whether that is relevant for this test which runs on a single core on a single node.

...and OFI provider that has been used for both 2018u5 & 2021.2 versions?

I thought prior to version 2019 the OFA fabric was used instead of OFI? Again I am not sure whether this is relevant as the debug output for 2018u5 indicates: "[0] MPI startup(): shm data transfer mode".

The OFI provider for 2021.2 is mlx running over ucx 1.10.0, but thus far I assumed that communication would go through shm for this single-core example (based on the debug output mentioning "shm segment size (2084 MB per rank) * (1 local ranks) = 2084 MB total")

Could you please confirm with us whether you are executing the code on the same machine using both 2018 Update 5 and 2021.2 versions?

Yes.

ShanmukhS_Intel · ‎07-26-2021

Hi,

Thanks for sharing the information.

We are currently working on your issue.

Meanwhile could you please try with latest Intel oneAPI version 2021.3 and let us know whether you are facing the same issues.

Best Regards,

Shanmukh.SS

SVDB · ‎08-02-2021

Hello Shanmukh,

I tried the same example with oneAPI 2021.3, but there is no significant difference with 2021.2. For consistency, I reran the example with different versions on the same node and these are the timings:

	TRANSPOSE	ALLTOALL	ALLTOALLW
Version 2018 Update 5	1.27	1.48	1.28
Version 2021.2	1.27	1.48	2.10
Version 2021.3	1.27	1.48	2.09

This is the verbose output when using oneAPI version 2021.3

[0] MPI startup(): Intel(R) MPI Library, Version 2021.3  Build 20210601 (id: 6f90181f1)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): shm segment size (2084 MB per rank) * (1 local ranks) = 2084 MB total
[0] MPI startup(): libfabric version: 1.12.1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[0] MPI startup(): File "/vsc-hard-mounts/leuven-data/337/vsc33716/Software/myapps/genius/skylake/2021a/software/impi/2021.3.0-intel-compilers-2021.3.0/mpi/2021.3.0/etc/tuning_skx_shm-ofi_mlx.dat" not found
[0] MPI startup(): Load tuning file: "/vsc-hard-mounts/leuven-data/337/vsc33716/Software/myapps/genius/skylake/2021a/software/impi/2021.3.0-intel-compilers-2021.3.0/mpi/2021.3.0/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       21923    r23i13n21  0
[0] MPI startup(): I_MPI_ROOT=/vsc-hard-mounts/leuven-data/337/vsc33716/Software/myapps/genius/skylake/2021a/software/impi/2021.3.0-intel-compilers-2021.3.0/mpi/2021.3.0
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=pbs
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_PIN_PROCESSOR_LIST=0
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=12

ShanmukhS_Intel · ‎08-04-2021

Hi,

Thanks for sharing us the required details.

We reproduced the issue at our end. We are working on your issue internally. We will get back to you soon.

Best Regards,

Shanmukh.SS

Jennifer_D_Intel · ‎08-30-2021

This is a known issue, and your regression report should help the developers fix it.

SVDB · ‎01-14-2022

Will the 2022.1 version be made available as standalone components? At the moment I cannot see them at https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html The oneAPI Base Toolkit offline installer is not working for me (it is stuck on Wait while the installer is preparing...), while with the individual components I usually don't have a problem.

Xiao_Z_Intel · ‎03-09-2022

Hi Steven,

Yes, Intel® MPI Library (version 2021.5) is available as a standard alone component. It is available in the link you posted earlier and also included in the Intel® oneAPI HPC Toolkit (version 2022.1). In addition, the required changes for the reported regression of MPI_Alltoallw will not be ready to be included in upcoming Intel® MPI Library release of version 2021.6. Thank you very much for your patience.

Best,

Xiao

Xiao_Z_Intel · ‎03-23-2022

Hi Steven,

Please refer to the Intel® MPI Library Release Notes for the fix of the reported regression (https://www.intel.com/content/www/us/en/developer/articles/release-notes/mpi-library-release-notes.html). I also addressed your question of the availability of standalone Intel® MPI Library. We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Best,

Xiao

MPI_Alltoallw performs poorly with Intel MPI Library to Version 2021.2

Fortran

MPI

Performance