Solved: Intel MPI reduce memory comsumption

Thomas_G_3 · ‎12-10-2019

Dear all,

I am currently looking into the problem of memory consumption for all-to-all based MPI software.
I far as I understand, for releases before Intel MPI 2017, we could use the DAPL_UD mode through the following variables: I_MPI_DAPL_UD and I_MPI_DAPL_UD_PROVIDER.

Since the support of DAPL has been removed in the 2019 version, what do we have to use for InfiniBand interconnect?
My additional question is also, what are the variables that reproduce the same behavior for OmniPath interconnect?

Thank you for your help.
Best,
Thomas

Zhiqi_T_Intel · ‎12-13-2019

Hi Thomas,

Could you please advise which Intel MPI release you are running?

Is there any chance that you might try Intel MPI 2019 update 6 which has improved support on MLNX IB. If you could do so, please try the following settings.

export I_MPI_FABRIC=shm:ofi
export FI_PROVIDER=mlx
export UCX_TLS=ud,sm,self

Could you please also advise what exactly is the problem of memory consumption? For example, symptom and impact. What's your system configuration?

Thanks,
Zhiqi

View solution in original post

PrasanthD_intel · ‎12-10-2019

Hi Thomas,

Thanks for reaching out to us.

We are working on this issue and will get back to you.

Prasanth

PrasanthD_intel · ‎12-11-2019

Hi Thomas,

As you are aware, the DAPL, TMI, and OFA fabrics are deprecated from MPI 2019.
You can use ofi fabric(OpenFabrics Interfaces* (OFI)-capable network fabrics) these use a library called libfabric which provides a fixed application-facing API while talking to one of several "OFI providers" which communicate with the interconnect hardware.

To Select the particular fabrics to be used.
Syntax:
I_MPI_FABRICS=<ofi | shm:ofi | shm>

Intel® MPI Library supports psm2, sockets, verbs, and RxM OFI* providers. Each OFI provider is built as a separate dynamic library to ensure that a single libfabric* library can be run on top of different network adapters.

To define the name of the OFI provider to load.
Syntax:
I_MPI_OFI_PROVIDER=<name>

For InfiniBand interconnect use verbs, and for OmniPath interconnect use psm2 provider.

For more information please refer the following links:

https://software.intel.com/en-us/mpi-developer-reference-windows-communication-fabrics-control
https://software.intel.com/en-us/mpi-developer-guide-linux-ofi-providers-support

Hope my answer solves your query.

Thomas_G_3 · ‎12-11-2019

Hey Dwadasi,

Thanks for your feedback.
Is there a way to reduce memory consumption with OFI?

Similar to what was possible with the DAPL, by switching to UD behavior?

Thks a lot,
Thomas

Zhiqi_T_Intel · ‎12-13-2019

Hi Thomas,

Could you please advise which Intel MPI release you are running?

Is there any chance that you might try Intel MPI 2019 update 6 which has improved support on MLNX IB. If you could do so, please try the following settings.

export I_MPI_FABRIC=shm:ofi
export FI_PROVIDER=mlx
export UCX_TLS=ud,sm,self

Could you please also advise what exactly is the problem of memory consumption? For example, symptom and impact. What's your system configuration?

Thanks,
Zhiqi

Thomas_G_3 · ‎12-13-2019

Hi Zhiqi,

Thank you for your detailed answer. I haven't tried the 2019.6 version yet (I couldn't find it actually).

Could you also give me the same kind of commands but for the OmniPath? (I am running on both IB and OP.)
Especially, what is the equivalent to UCX_TLS? How can I ask a less memory consuming transport protocol?

For the systems I am running on, I tried both Juwels (https://www.top500.org/system/179424) and MareNostrum (https://www.bsc.es/discover-bsc/the-centre/marenostrum).
On Juwels, using ParaStation MPI, we could reach 73k procs but it failed using Intel MPI 2019.4 (at 9k if I remember correctly).
We have the same kind of issues on MareNostrum, we are not able to go higher than 9k.
Despite the use of sub-communicators etc, the error is always the same on both clusters: the job freezes in an all-to-all MPI communication, without any additional messages.

Thank you for your help.
Best,
Thomas

Zhiqi_T_Intel · ‎12-13-2019

Hi Thomas,

So the issue is that you can't get more than 9K ranks with Intel MPI.

Could you please try to use -genv I_MPI_DEBUG=5? It would print more debug info.

https://software.intel.com/en-us/mpi-developer-guide-windows-displaying-mpi-debug-information ;

Thanks,

Zhiqi

Thomas_G_3 · ‎12-13-2019

Hi Zhiqi,

Sure, I will give it a try on MareNostrum (OmniPath).

Beforehand, could you give me the configuration equivalent to UD for the Omnipath?
So I can use them with the debug mode.

Thanks,
Thomas

drMikeT · ‎12-13-2019

Hello ZhiQi,

should I assume that OFI now can use UCX as a transport?

What is the recommended provider for Mellanox EDR / HDR? Any suggestions on the MOFED versions ?

Intel MPI 2019 update 6 is not out yet. Any idea as to the release date?

thanks

--Michael

PS (Did you use to work for the Lustre team? :)

Zhiqi_T_Intel · ‎12-16-2019

Hi Thomas,

There is no "UD" version on Omni-Path. It would be the best to use "FI_PROVIDER=psm2" when you run Intel MPI on the Omni-Path cluster. PSM2 would be the lowest memory-footprint.

I discussed with the Omni-Path Engineering team. They recommended that you open a fabric support case by emailing to fabricsupport@intel.com. Let us make sure that the Omni-Path configurations are optimal.

Best Regards,

Zhiqi

Thomas_G_3 · ‎12-16-2019

Hi Zhiqi,

Thanks a lot for your answers and explanation.
I will give it a try.

Best,
Thomas

Zhiqi_T_Intel · ‎12-16-2019

Hi Michael,

When running Intel MPI with Mellanox EDR/HDR, please use Intel MPI 2019 update 5 or later.

Please use "export FI_PROVIDER=mlx".

OFI/mlx requires UCX 1.5+.

I just realized that the Intel MPI 2019 release note https://software.intel.com/en-us/articles/intel-mpi-library-release-notes has not been updated to show the 2019 update 6 version. I have reported this issue to the release note owner.

In the mean while, you could download Intel MPI alone, https://software.intel.com/en-us/mpi-library/choose-download, you can find the Intel MPI 2019 update 6. I have validated it myself.

Yes, I was part of the Lustre team. :)

Best Regards,

Zhiqi

Zhiqi_T_Intel · ‎12-17-2019

Hi Michael,

The Parallel Studio 2020 is released today. https://software.intel.com/en-us/parallel-studio-xe

It consists of the Intel MPI 2019 update 6 https://software.intel.com/sites/default/files/managed/b9/6e/IPSXE_2020_Release_Notes_EN.pdf (page 4).

Best Regards,

Zhiqi

drMikeT · ‎12-18-2019

Zhiqi T. (Intel) wrote:
Hi Michael,

When running Intel MPI with Mellanox EDR/HDR, please use Intel MPI 2019 update 5 or later.
Please use "export FI_PROVIDER=mlx".
OFI/mlx requires UCX 1.5+.

I just realized that the Intel MPI 2019 release note https://software.intel.com/en-us/articles/intel-mpi-library-release-notes has not been updated to show the 2019 update 6 version. I have reported this issue to the release note owner.
In the mean while, you could download Intel MPI alone, https://software.intel.com/en-us/mpi-library/choose-download, you can find the Intel MPI 2019 update 6. I have validated it myself.

Yes, I was part of the Lustre team. :)

Best Regards,
Zhiqi

Hi Zhiqi,

So IntelMPI may now leverage UCX as one of the providers for the OFI framework? This is great. UCX is a quite capable transport and optimized on the Mellanox h/w stack. Can we also use HCOLL from Mellanox's HPC_X? I understand that IntelMPI has its own optimized collectives. HCOLL and UCX can leverage all the h/w accelerators that came builtin the Mellanox h/w.

So I assume we need to install UCX 1.5+ run-time libs somewhere and point OFI to them? Looking at the MPI docs I am not able to find complete instructions as to how to integrate IntelMPI 2019/2020 to use efficiently Mellanox gear.

We cannot launch IntelMPI 2019 update 5 on hosts with the mlx FI_PROVIDER on Mellanox network. Provider verbs runs but is is quiet inefficient:

$ FI_PROVIDER=verbs  I_MPI_DEBUG=1000  $(which mpiexec.hydra) -hosts sntc0008,sntc0009 -np 2  -ppn 1  $I_MPI_ROOT/intel64/bin/IMB-MPI1
[0] MPI startup(): libfabric version: 1.7.2a-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrname_len: 16, addrname_firstlen: 16
[1] MPI startup(): selected platform: hsw
[0] MPI startup(): selected platform: hsw
[0] MPI startup(): Load tuning file: /vend/intel/parallel_studio_xe_2019_update5/compilers_and_libraries_2019.5.281/linux/mpi/intel64/etc/tuning_skx_shm-ofi.dat
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       4245     sntc0008   0
[0] MPI startup(): 1       744      sntc0009   0
[0] MPI startup(): I_MPI_ROOT=/vend/intel/parallel_studio_xe_2019_update5/compilers_and_libraries_2019.5.281/linux/mpi
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_PIN_PROCESSOR_LIST=allcores:map=scatter
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_FABRICS=shm:ofi
[0] MPI startup(): I_MPI_DEBUG=1000
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 4, MPI-1 part
#------------------------------------------------------------
# Date                  : Wed Dec 18 18:05:03 2019
...
# Barrier

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         2.70         0.00
            1         1000         2.67         0.38
            2         1000         2.41         0.83
            4         1000         2.25         1.78
            8         1000         2.24         3.57
           16         1000         2.58         6.20
           32         1000         2.37        13.50
           64         1000         2.33        27.52
          128         1000         2.58        49.53
          256         1000         3.05        83.81
          512         1000         4.19       122.33
         1024         1000         9.09       112.66
         2048         1000        14.20       144.22
         4096         1000        29.37       139.44
         8192         1000        48.97       167.28
        16384         1000        93.37       175.48
        32768         1000       142.47       230.01
        65536          640       264.65       247.63
       131072          320       483.01       271.37
       262144          160      2970.47        88.25
       524288           80      3433.73       152.69
      1048576           40      6836.81       153.37
      2097152           20      9197.29       228.02
      4194304           10     10543.25       397.82

#---------------------------------------------------

 $ FI_PROVIDER=mlx   I_MPI_DEBUG=1000  $(which mpiexec.hydra) -hosts sntc0008,sntc0009 -np 2  -ppn 1  $I_MPI_ROOT/intel64/bin/IMB-MPI1
[0] MPI startup(): libfabric version: 1.7.2a-impi
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(703).......:
MPID_Init(923)..............:
MPIDI_OFI_mpi_init_hook(846): OFI addrinfo() failed (ofi_init.c:846:MPIDI_OFI_mpi_init_hook:No data available)

Thanks!

Michael

PS: "Long time no see"

Dmitry_D_Intel · ‎12-23-2019

Hi Michael,

Very first recommendation is to switch to IMPI 2019 U6 if possible.

We have HCOLL support on Intel MPI level starting IMPI 2019 U5. (available via I_MPI_COLL_EXTERNAL=1)

https://software.intel.com/ru-ru/mpi-developer-reference-linux-i-mpi-adjust-family-environment-variables

The following algorithms will be redirected to HCOLL:

I_MPI_ADJUST_ALLREDUCE=24, I_MPI_ADJUST_BARRIER=11, I_MPI_ADJUST_BCAST=16, I_MPI_ADJUST_REDUCE=13, I_MPI_ADJUST_ALLGATHER=6, I_MPI_ADJUST_ALLTOALL=5, I_MPI_ADJUST_ALLTOALLV=5

Minimal requirement for OFI/mlx provider is UCX 1.4+ (starting IMPI 2019 U6)

Yes, you have to have the runtime available on the nodes in order to use FI_PROVIDER=mlx

There is no any kind of additional requirements/knobs for EDR/HDR

In case you have FDR (and not Connect-IB) on the nodes, then you may need to set UCX_TLS=ud,sm,self for large scale runs and for small scale you may play with UCX_TLS=rc,sm,self

We are working on a way to make it smoother.

BR,

Dmitry

drMikeT · ‎01-07-2020

Dmitry,

Thanks for the details on the interaction between IMPI 2019.06 and HCOLL and UCX. It's a good ability to chose HCOLL or Intel collective implementation by just specifying the "algorithm" in I_MPI_ADJUST_XXX.

I understand that UCX/HCOLL are not Intel s/w stacks, but there is a large base of HPC users that use these with Mellanox h/w. These are already optimized and can use the low-level h/w accelerators on Mellanox h/w. Leveraging these via IntelMPI is quiet beneficial for all as we won't have to resort to different MPI stacks.

Thanks!

Michael