Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2229 Discussions

Time-consuming issue at MPI startup

youn__kihang
Novice
3,683 Views

 

Hello All,

There is a problem that takes a lot of time during MPI startup, so I ask you a question.
The section that takes time is: library kind -> libfabric version -> libfabric provider -(which takes the most)> load tuning file.

 

Tue Apr 13 13:50:08 UTC 2021
[0] MPI startup(): Intel(R) MPI Library, Version 2021.2 Build 20210302 (id: f4f7c92cd)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.11.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): Load tuning file: "/opt/local/mpi/2021.2.0/etc/tuning_icx_shm-ofi_mlx.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 276153 ****0721.maru 0
[0] MPI startup(): 1 276154 ****0721.maru 1
[0] MPI startup(): 2 276155 ****0721.maru 2

 

The executed executable file is IMB-MPI1 and the execution script is as follows.

export I_MPI_HYDRA_PMI_CONNECT=alltoall
export I_MPI_DEBUG=5
export I_MPI_FABRICS=shm:ofi
export I_MPI_PIN=1
export I_MPI_PIN_PROCESSOR_LIST=0-75
export FI_PROVIDER=mlx
export UCX_TLS=rc,dc_mlx5,sm,self

{time mpiexec.hydra -genvall -f ./hostlist -n 33972 -ppn 76 IMB-MPI1 Bcast Allreduce -npmin 33972; } >> ${OUTFILE} 2>&1
#{ time mpiexec.hydra -genvall -f ./hostlist -n 67944 -ppn 76 IMB-MPI1 Bcast Allreduce -npmin 67944; } >> ${OUTFILE} 2>&1
#{ time mpiexec.hydra -genvall -f ./hostlist -n 131 328 -ppn 76 IMB-MPI1 Bcast Allreduce -npmin 67944; } >> ${OUTFILE} 2>&1

 

I testd 3 cases of mpirank (33,972 ranks, 67,944 ranks, 131,328 ranks), and it took about 33 seconds, 79 seconds and 131 seconds respectively. Startup takes a large part of the overall execution time, so please give us your opinion on what work to do to reduce it.

*Intel MPI version is 2021.2.0, UCX is 1.10.0 & MOFED 5.2-1.0.4.0

 

Thanks, Kihang

 

0 Kudos
1 Solution
Heinrich_B_Intel
Employee
3,609 Views

Reply from our architect:

 

  1. I would recommend to try UD (you may simply remove UCX_TLS) as a way to improve startup time
  2. Could you please ask them to clarify the way they come to conclusions that it is the tuning file reading?
  3. Please ask them to try:

export I_MPI_STARTUP_MODE=pmi_shm_netmod

 

 

best regards,

Heinrich


View solution in original post

0 Kudos
6 Replies
youn__kihang
Novice
3,661 Views

 

For information,

I can reproduce with MPI_INIT function only.
When I use 76,000 mpi rank, it take 68~72 seconds in MPI startup(MPI_INIT).
Here is more detailed log(I_MPI_DEBUG=1000).

[0] MPI startup(): libfabric version: 1.11.0-impi
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: ofi_rxm (111.0)
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: verbs (111.0)
libfabric:361875:core:core:ofi_register_provider():455<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: tcp (111.0)
libfabric:361875:core:core:ofi_register_provider():455<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: mlx (1.4)
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: shm (111.0)
libfabric:361875:core:core:ofi_register_provider():455<info> "shm" filtered by provider include/exclude list, skipping
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: sockets (111.0)
libfabric:361875:core:core:ofi_register_provider():455<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: ofi_hook_noop (111.0)
libfabric:361875:core:core:fi_getinfo_():1117<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:361875:core:core:fi_getinfo_():1117<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
libfabric:361875:core:core:fi_fabric_():1406<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
<Here is most time-consuming part>
[3323] MPI startup(): selected platform: icx
[2667] MPI startup(): selected platform: icx
[3561] MPI startup(): selected platform: icx
[2743] MPI startup(): selected platform: icx

 

Please let me know if there are any suggestions.

0 Kudos
youn__kihang
Novice
3,660 Views

 

For information,

I can reproduce with MPI_INIT function only.
When I use 76,000 mpi rank, it take 68~72 seconds in MPI startup(MPI_INIT).
Here is more detailed log(I_MPI_DEBUG=1000).

[0] MPI startup(): libfabric version: 1.11.0-impi
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: ofi_rxm (111.0)
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:361875:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: verbs (111.0)
libfabric:361875:core:core:ofi_register_provider():455<info> "verbs" filtered by provider include/exclude list, skipping
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: tcp (111.0)
libfabric:361875:core:core:ofi_register_provider():455<info> "tcp" filtered by provider include/exclude list, skipping
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: mlx (1.4)
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: shm (111.0)
libfabric:361875:core:core:ofi_register_provider():455<info> "shm" filtered by provider include/exclude list, skipping
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: sockets (111.0)
libfabric:361875:core:core:ofi_register_provider():455<info> "sockets" filtered by provider include/exclude list, skipping
libfabric:361875:core:core:ofi_register_provider():427<info> registering provider: ofi_hook_noop (111.0)
libfabric:361875:core:core:fi_getinfo_():1117<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:361875:core:core:fi_getinfo_():1117<info> Found provider with the highest priority mlx, must_use_util_prov = 0
[0] MPI startup(): libfabric provider: mlx
libfabric:361875:core:core:fi_fabric_():1406<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
<Here is most time-consuming part>
[3323] MPI startup(): selected platform: icx
[2667] MPI startup(): selected platform: icx
[3561] MPI startup(): selected platform: icx
[2743] MPI startup(): selected platform: icx

 

Please let me know if there are any suggestions.
Thanks in advance, Kihang

0 Kudos
ShivaniK_Intel
Moderator
3,636 Views

Hii,


Thanks for reaching out to us.


We are working on it and will get back to you soon.


Thanks & Regards

Shivani



0 Kudos
Heinrich_B_Intel
Employee
3,610 Views

Hi Kihang,


I was sending a request to the IMPI architect for better startup parameters.


My first idea would be:

Do you have a file system that is faster than /opt/kma_local/mpi/2021.2.0/etc/ ?

Maybe it helps to read the tuning file from another file system. You may try to use variables from the autotuner to provide a different location for the tuning file:


$ export I_MPI_TUNING_BIN=<tuning-results.dat>


see:

https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/tuning-environment-variables/autotuning.html


best regards,

Heinrich





0 Kudos
Heinrich_B_Intel
Employee
3,610 Views

Reply from our architect:

 

  1. I would recommend to try UD (you may simply remove UCX_TLS) as a way to improve startup time
  2. Could you please ask them to clarify the way they come to conclusions that it is the tuning file reading?
  3. Please ask them to try:

export I_MPI_STARTUP_MODE=pmi_shm_netmod

 

 

best regards,

Heinrich


0 Kudos
youn__kihang
Novice
3,602 Views

 

Hi Heinrich,


The option "I_MPI_STARTUP_MODE=pmi_shm_netmod" you recommend is works!

Could you explain the pmi_shm_netmod means? or Is there any manual about that?

 

Thanks, Kihang

0 Kudos
Reply