Huge performance drop on Windows with new Intel oneAPI MPI

Frank_R_1 · ‎10-11-2022

Topic:

Huge performance drop on Windows with new Intel oneAPI MPI 2021.6/2021.7 libraries and hang in mellanox driver.

Hi,

We use Intel C/C++/Fortran compiler for many years on both platforms (Linux and Windows) as well as the Intel MPI library.

At the moment we use the following setup in our product:
OS:
Windows 10/11 64bit
Linux RHEL 7.9 64bit

Hardware (Linux and Windows) with no Hyper Threading:
Dual socket Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz, 3000 Mhz, 18 Core(s), 18 Logical Processor(s)

current C/C++/Fortran compiler suite (Intel oneAPI 2022.2/2022.3 not usable at the moment due to compiler issues):
parallel_studio_xe_2020.1.086 with compilers_and_libraries_2020.1.216 (Windows)
parallel_studio_xe_2020.1.102 with compilers_and_libraries_2020.1.217 (Linux)

current mpi library:
Intel MPI 2018.3.210 (Windows)
Intel MPI 2018.3.222 (Linux)

new libraries we want to use:
Intel oneAPI MPI 2021.6/2021.7

What we found out is a very good performance boost on Linux (workstation and cluster simulation with 16 and 32 cores) with Intel MPI 2021.6/2021.7 compared to Intel MPI 2018.3.222.
But we also encountered an opposit performance drop on Windows (workstation only, simulation with 16 cores) with Intel MPI 2021.6/2021.7 compared to Intel MPI 2018.3.210.

On Windows we use the following command to call mpi:
2018.3.210
mpiexec.hydra.exe -delegate -localroot -genvall -print-all-exitcodes -genv I_MPI_ADJUST_ALLREDUCE 7 -genv I_MPI_ADJUST_REDUCE_SCATTER 4 -genv I_MPI_ADJUST_REDUCE 2 -envall -np #cpus #path_to_product
2021.6/2021.7
mpiexec.hydra.exe -delegate -localroot -genvall -print-all-exitcodes -genv I_MPI_HYDRA_BSTRAP_KEEP_ALIVE 1 -genv I_MPI_CBWR 2 -envall -np #cpus #path_to_product

On Linux we use the following command to call mpi:
2018.3.222
mpiexec.hydra -genvall -print-all-exitcodes -genv I_MPI_ADJUST_ALLREDUCE 7 -genv I_MPI_ADJUST_REDUCE_SCATTER 4 -genv I_MPI_ADJUST_REDUCE 2 -envall -np #cpus #path_to_product
2021.6/2021.7
mpiexec.hydra -genvall -print-all-exitcodes -genv I_MPI_HYDRA_BSTRAP_KEEP_ALIVE 1 -genv I_MPI_CBWR 2 -envall -np #cpus #path_to_product

We compared in taskmanager on Windows the cpu utilization of 2018.3.210 and 2021.6/2021.7 with 16 cores.
We found out that a huge portion of kernel time is used in 2021.6/2021.7 compared to 2018.3.210 which is rather small.

That leads to doubled simulation time on 2021.6/2021.7 compared to 2018.3.210.

Is this behavior known?
Is there a MPI parameter which we miss on new 2021.6/2021.7 libraries?

Another problem occurs on Linux on more than one cluster node. Here the mpi libraries 2021.6/2021.7 hang in the mellanox driver (callstack):
redhat driver
libibverbs-22.4-6.el7_9.x86_64
(gdb) up
#1 0x00007f4d8b50ebef in uct_rc_verbs_iface_progress () from /usr/lib64/libuct.so.0
(gdb) down
#0 0x00007f4d8aea3945 in mlx5_poll_cq_v1 () from /usr/lib64/libmlx5.so.1
(gdb) bt
#0 0x00007f4d8aea3945 in mlx5_poll_cq_v1 () from /usr/lib64/libmlx5.so.1
#1 0x00007f4d8b50ebef in uct_rc_verbs_iface_progress () from /usr/lib64/libuct.so.0
#2 0x00007f4d8b98aee2 in ucp_worker_progress () from /lib64/libucp.so.0
#3 0x00007f4d8bbc37a1 in mlx_ep_progress () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmlx-fi.so
#4 0x00007f4d8bbdbb0d in ofi_cq_progress () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmlx-fi.so
#5 0x00007f4d8bbdba97 in ofi_cq_readfrom () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmlx-fi.so
#6 0x00007f502b872f3e in MPIDI_OFI_progress () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#7 0x00007f502b446181 in MPID_Progress_wait () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#8 0x00007f502b9d9441 in MPIR_Wait_impl () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#9 0x00007f502b5bc390 in MPIC_Ssend () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#10 0x00007f502b561160 in MPIR_Gatherv_allcomm_linear () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#11 0x00007f502b56056e in MPIR_Gatherv_intra_auto () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#12 0x00007f502b3e4dce in MPIDI_coll_invoke () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#13 0x00007f502b3b0a80 in MPIDI_coll_select () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#14 0x00007f502b4b2bbc in MPIR_Gatherv () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#15 0x00007f502b566f2e in PMPI_Gatherv () from /clusterhead/software/MAGMA/MAGMA55/v5.5.1.5-22660/v5.5.1////LINUX64/bin/../impi/lib/libmpi.so.12
#16 0x0000000000ba9ed8 in ?? ()
#17 0x0000000000b30d31 in ?? ()
#18 0x00000000007bcec8 in ?? ()
#19 0x00000000007b8127 in ?? ()
#20 0x00000000007c1f67 in ?? ()
#21 0x0000000000745bd1 in ?? ()
#22 0x0000000000743a5a in ?? ()
#23 0x000000000074346e in ?? ()
#24 0x000000000043f079 in ?? ()
#25 0x000000000043b860 in ?? ()
#26 0x000000000043f1a9 in ?? ()
#27 0x00007f5021ecb555 in __libc_start_main () from /lib64/libc.so.6
#28 0x0000000000438227 in ?? ()

Is this a known issue?
Our workaround for this is to use -genv I_MPI_ADJUST_GATHERV 3 instead of -genv I_MPI_CBWR 2.

Best regards,
Frank

Frank_R_1 · ‎10-14-2022

No one else experienced this behavior on Windows?

With no changes in code, only swapping mpi libraries leads to performance drop.

How can I get MPI statistics in Intel MPI 2021.6/2021.7 to analyze the problem?

Frank

ShivaniK_Intel · ‎10-17-2022

Hi,

Thanks for posting in the Intel forums.

Could you please provide us with the sample reproducer and steps to reproduce the issue at our end?

>>>Another problem occurs on Linux on more than one cluster node. Here the mpi libraries 2021.6/2021.7 hang in the Mellanox driver (call stack):

Redhat driver

Could you please provide us with the complete debug log exporting I_MPI_DEBUG=30?

>>>How can I get MPI statistics in Intel MPI 2021.6/2021.7 to analyze the problem?

Could you please try using Intel Trace Analyzer and collector and let us know the output?

Thanks & Regards

Shivani

Frank_R_1 · ‎10-18-2022

Hi,

It seems that I_MPI_STATS only works for Linux, since a a program of aps is missing on Windows!

mpiexec.exe -np 4 -genv I_MPI_STATS 0 -delegate -localroot mpi_test_intel.exe
[proxy:0:0@aws1de051] HYD_spawn (..\windows\src\hydra_spawn.c:286): unable to create process aps --collection-mode=omp,mpi mpi_test_intel.exe (error code 2)
[proxy:0:0@aws1de051] launch_processes (proxy.c:596): error creating process (error code 2). The system cannot find the file specified.

[proxy:0:0@aws1de051] main (proxy.c:969): error launching_processes
[mpiexec@aws1de051] wmain (mpiexec.c:2165): assert (pg->intel.exitcodes != NULL) failed
[mpiexec@aws1de051] HYD_sock_write (..\windows\src\hydra_sock.c:387): write error (errno = 34)

A sample reproducer is a problem at the moment. I want to use MPI statistics on Windows to get an impression which mpi 2021.7 functions are called and which are slow compared to mpi 2018.3. See above for the problem with statistics...

Nevertheless the same code runs faster on Linux with mpi 2021.7 compared to mpi 2018.3 but slows down on Windows!

We suspect gather algortihms but for that we need the statistics!

Concerning the hang in the mellanox driver, we will try to get mpi output and post it here.

Frank

ShivaniK_Intel · ‎10-21-2022

Hi,

Could you please try using Intel trace Analyzer and collector to analyze your application?

For more details regarding the Intel trace analyzer and collector please refer to the below link.

https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-itac/top/trace-your-mpi-application.html

Thanks & Regards

Shivani

Frank_R_1 · ‎10-21-2022

Hi,

I uploaded a file "intel_mpi_comparison.zip"

There you will find:

mpi2018.3_debug.txt (mpi debug output)

mpi2021.6_debug.txt (mpi debug output)

2018.3_16cpu_win11.png (task manager which shows cpu utilization with kernel times)

2021.7_16cpu_win11.png (task manager which shows cpu utilization with kernel times)

2018.3vs2021.6_16cpu_win11.png (vtune comparison impi.dll utilization)

2018.3vs2021.6_16cpu_win11_ws2_32.png (vtune comparison impi.dll utilization against ws2_32.dll)

For completeness:

OS Name Microsoft Windows 11 Enterprise (same problem occurs on Windows 10)
Version 10.0.22000 Build 22000

Hardware (no Hyper Threading):
Dual socket Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz, 3000 Mhz, 18 Core(s), 18 Logical Processor(s)

MPI calls:

/e/p4ws/fro_ms5.5.1_w/vobs/install/W/R/WINDOWS64/impi_2018.3/bin/mpiexec.exe -np 16 -delegate -genvall -print-all-exitcodes -genv I_MPI_DEBUG 500 -genv I_MPI_HYDRA_DEBUG 1 -genv I_MPI_HYDRA_BSTRAP_KEEP_ALIVE 1 -genv I_MPI_CBWR 2 -genv I_MPI_ADJUST_GATHERV 3 -envall -localroot /e/p4ws/fro_ms5.5.1_w/vobs/install/W/R/WINDOWS64/bin/MAGMAsimulation2_intel.exe &> mpi2018.3_debug.txt

/e/p4ws/fro_ms5.5.1_w/vobs/install/W/R/WINDOWS64/impi_2021.6/bin/mpiexec.exe -np 16 -delegate -genvall -print-all-exitcodes -genv I_MPI_DEBUG 500 -genv I_MPI_HYDRA_DEBUG 1 -genv I_MPI_HYDRA_BSTRAP_KEEP_ALIVE 1 -genv I_MPI_CBWR 2 -genv I_MPI_ADJUST_GATHERV 3 -envall -localroot /e/p4ws/fro_ms5.5.1_w/vobs/install/W/R/WINDOWS64/bin/MAGMAsimulation2_intel.exe &> mpi2021.6_debug.txt

This is really a severe issue. We have first customers complaining about the poor performance and we can't offer them a workaround at the moment. So please take this opportunity to investigate it further.

Best regards and thanks in advance

Frank

ALaza1 · ‎10-21-2022

I just installed 2022.3.0.9564 and like you, I've run into a serious network performance drop. I was hoping to replace team (adaptive load balancing) with multiple-subnet usage, only to find this variant runs slower than using a single (non-teamed) subnet. My best performance has been with w_mpi_p_2018.5.287 and adaptive load balancing.

Separately I'm preparing my own problem documentation. The most current oneapi package appears to be w_mpi_oneapi_p_2021.7.0.9549_offline and the matching toolkit w_HPCKit_p_2022.3.0.9564_offline.

Each of my nodes has a 2 NIC 10gig-e card, and the first NIC of each node is connected to 10gig-e switch #1 and the second NIC of each node is connected to another 10gig-e switch #2. In a non-teamed configuration, each NIC #1 is on the same subnet as the other node's NIC #1. Same for NIC #2s. The teaming configuration pairs each node's NICs into a single virtual NIC. Timings using adaptive load balancing manage to produce almost 20 gbps. Timings on the non-teamed configuration don't even get close to 10gbps, and timings using non-teamed NICs (multirail?) are slower than on a single NIC. Using something like FI_SOCKETS_IFACE=eth0 constrains multirail to a single NIC and produces better performance.

I'm running fully updated Windows 10 Pro on each of the nodes. Paging is disabled, several TCP/IP tuning options include jumbo frames, 8x sized receive and transmit buffers, and anti-virus checking is disabled for the MPI process and for the application itself.

The application I'm using for this test is one of the NAS parallel benchmarks, FT. The code does 2 matrix inversions per timestep on very large arrays using calls to MPI__ALLTOALL on about 115GB of array data.

I was hoping a multiraii approach might be the solution for Intel dropping support for teaming.

Art

Frank_R_1 · ‎10-24-2022

Hi,

We found out that with the option

-env FI_PROVIDER sockets

we get much more performance than with standard (which is in my opinion tcp)

It is even faster than the old mpi2018.3.

Why is the standard behavior so slow???

With -env FI_PROVIDER sockets we see in the taskmanager very low kernel time compared to high kernel time in standard behavior without -env FI_PROVIDER sockets.

Could be helpful if this is documented in release notes or elsewhere.

Best regards

Frank

ShivaniK_Intel · ‎10-28-2022

Hi,

We have observed that you are working with I_MPI_CBWR in 2021.x version and did not use it in the 2018 version (probably not available in 2018) version. This option is known to impact performance because it is switching off several optimizations.

For a fair performance comparison, could you please set this option in both cases or ignored it in both cases?

Could you please let us know why -genv I_MPI_HYDRA_BSTRAP_KEEP_ALIVE 1 is used for the 2021.x version?

Thanks & Regards

Shivani

Frank_R_1 · ‎11-02-2022

Hi,

I_MPI_CBWR is crucial for us to get binary identical results ond Linux,Windows, debug, and release.

In 2018.3 we fixed some algorithms to achieve this. In 2021.6/7 I_MPI_CBWR does the right thing.

-genv I_MPI_OFI_PROVIDER sockets solves the performance problem!

2021.6/7 is really faster than 2018.3! But it is not documented that one needs this on Windows!

-genv I_MPI_HYDRA_BSTRAP_KEEP_ALIVE 1 is used to get a process tree which is killable!

The ctrl-c/kill behavior of mpiexec is different on 2021.6/7 to 2018.3.

Best regards

Frank

Frank_R_1 · ‎11-02-2022

Concerning -genv I_MPI_HYDRA_BSTRAP_KEEP_ALIVE 1

have a look here:

https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/InteloneAPI-MPI-2021-2-0-behavior-on-Linux-and-Windows-differ/td-p/1290020

ShivaniK_Intel · ‎11-14-2022

Hi,

We are working on it and will get back to you.

Thanks & Regards

Shivani

ShivaniK_Intel · ‎11-16-2022

Hi,

>>> "-genv I_MPI_OFI_PROVIDER sockets solves the performance problem!"

As this resolves your issue, could you please let us know whether we can close this issue?

>>>But it is not documented that one needs this on Windows!

Thank you for your feedback.

Thanks & Regards

Shivani

ShivaniK_Intel · ‎11-27-2022

Hi,

As we did not hear back from you could you please provide us an update on this issue?

Thanks & Regards

Shivani

ShivaniK_Intel · ‎12-01-2022

Hi,

As your issue is resolved, we are going ahead and closing this thread. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks & Regards

Shivani

Huge performance drop on Windows with new Intel oneAPI MPI

MPI

Performance