Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

memory footprint alltoallv

Jaapw
Beginner
1,780 Views

Hello

I performed simple tests on our cluster which has 28 cpu's on each node. I used mpi_alltoallv to send and receive "datasize" 32 bit integers, so sendcounts=="datasize" for all processes.  The figures below plots the maximum memory load on a node (measured with vtune, aps or via (/proc/self/status)). The memory footprint is huge and limits the number of nodes that can be used. Using the MPI_ADJUST_ALLTOALLV=2 setting reduces the memory usage significantly but only for a small datasize (second figure below, below approx datasize==300 the memory is limited). How to reduce the memory footprint? 

I use oneapi/mpi/2021.4.0. 

best regards

Jaap

Jaapw_0-1704811091658.png

Jaapw_1-1704811272135.png

 

 

Labels (1)
0 Kudos
13 Replies
TobiasK
Moderator
1,727 Views

@Jaapw 
We need more information to look at it, could you please provide the OS, CPU, network, workload manager and the source code of your example?
Also please check with the latest 2021.11 release.

Best
Tobias

0 Kudos
Jaapw
Beginner
1,712 Views

Dear Tobias

 

We are using:

 

Rocky Linux, version 8.8

2 Intel E5-2660 v4 (2.00GHz) 14 core CPUs (28 cores)

FDR Infiniband HBA

slurm/23.02.4

 

I was not able to test 2021.11 yet but found that the problem is introduced by setting FI_PROVIDER=verbs. The slurm file contains the result with sendcounts==10, 100 or 1000 and using FI_PROVIDER=verbs or not. 

best regards

 

Jaap

 

 

0 Kudos
TobiasK
Moderator
1,698 Views

We strongly recommend to use either the MLX provider or the PSM3 provider, the latter is our new and preferred provider for supporting NVIDIA Mellanox hardware. Why do you use the verbs provider?

Please refer to the Infiniband related information here:

https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

 

Best
Tobias

0 Kudos
Jaapw
Beginner
1,666 Views

Dear Tobias,

 

Thanks for the advice. We had problems with some tests using a high number of core counts (between 1000-2000). I will run the tests again using psm3.

 

best regards

 

Jaap

0 Kudos
Jaapw
Beginner
1,664 Views

The problems where at another cluster with:

  • CentOS Linux, version 8
  • 2 Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz 12 core CPUs
  • Infiniband HBA
  • slurm/19.05.8

 

lspci | grep Mell gives:

d8:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

 

On cluster used above (the Rocky Linux cluster):

82:00.0 InfiniBand: Mellanox Technologies MT25408A0-FCC-QI ConnectX, Dual Port 40Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In... (rev b0)

0 Kudos
Jaapw
Beginner
1,607 Views

Dear Tobias,

Using psm3 solves all memory problems which is nice. Unfortunately, the performance is not very reliable. The figure at the left shows the scaling result (a performance test using our code, solving the flow around a propeller) for a 21 million cells case (Current line) using verbs. The figure at the right shows the result using psm3. I repeated the calculation using 1440 cores using verbs and psm3 and the results are consistent/repeatable, i.e. showing the same scaling. Any suggestions what can be the reason?

best regards

Jaap

Jaapw_0-1705569986559.pngJaapw_2-1705570138396.png

 

 

 

 

0 Kudos
TobiasK
Moderator
1,578 Views

@Jaapw


sorry, but I cannot assist you here with this kind of in-depth application support / debugging.


Best

Tobias


0 Kudos
Jaapw
Beginner
1,561 Views

Dear Tobias,

 

I understand that. Another question related to using

 

export $FI_PROVIDER=psm3

 

as an environment variable.

 

If I use "mpirun -n 1 ./executable" it works fine but if I use "./executable" I get the message "[0] MPI startup(): FI_PSM3_UUID was not generated, please set it to avoid possible resources ownership conflicts between MPI processes".

 

How to avoid this?

 

best regards

 

Jaap

0 Kudos
TobiasK
Moderator
1,507 Views

If you run a single task, you may just specify I_MPI_FABRICS=shm to not initialize OFI.



0 Kudos
Jaapw
Beginner
1,357 Views

Thanks.

 

I did a series of tests using grids from 10 million to 350 million cells using 240 to 1920 cores and it all works well. I am quite happy about that. However, at the end of the tests I sometimes get an error:

 

Fatal Send CQ Async Event on mlx5_0 port 1: CQ error

 

Can this be related to selecting psm3?

 

best regards

 

Jaap

0 Kudos
TobiasK
Moderator
1,337 Views

@Jaapw


Can you please be more explicit about when it appears and how often? "Sometimes and at the end" usually need in depth debugging of the application. Do you see that behavior also with IMB-MPI1 benchmarks?


Does it work with the MLX provider?


If it really happens after / at mpi_finalize please make sure that you meet all conditions that are required to call mpi_finalize correctly.


You can also run with "-check-mpi" and see if there are any correctness issues in your program.


0 Kudos
Jaapw
Beginner
1,308 Views

Running a 173 million cells case on 480 cores did crash once and did run well once. It is too slow and expensive to do this test multiple times. I did not try with mlx yet. The crash case gives:

 

node001:rank0: Fatal Send CQ Async Event on mlx5_0 port 1: CQ error
^[[0m^[[1m^[[91m
Loguru caught a signal: SIGABRT
^[[0m^[[0m^[[31mStack trace:
5 0x15554496cdc3 clone + 67
4 0x155547d8217a /lib64/libpthread.so.0(+0x817a) [0x155547d8217a]
3 0x15544c2aebbc /cm/shared/apps/intel/oneapi/mpi/2021.4.0/libfabric/lib/prov/libpsm3-fi.so(+0xd2bbc) [0x15544c2aebbc]
2 0x15544c275076 /cm/shared/apps/intel/oneapi/mpi/2021.4.0/libfabric/lib/prov/libpsm3-fi.so(+0x99076) [0x15544c275076]
1 0x155544891db5 abort + 295
0 0x1555448a737f gsignal + 271^[[0m
^[[0m^[[31m(2624.425s) [ 4659F700] :0 FATL| Signal: SIGABRT^[[0m
srun: error: node007: task 3: Broken pipe

 

I did run a smaller case and as usual check-mpi does not give any errors. If I run on more than 100 cores It still runs well but I get a warning, coming from an external library that we use:

[1] WARNING: LOCAL:REQUEST:NOT_FREED: warning
[1] WARNING: The current number of requests in this process is 100.
[1] WARNING: This matches the CHECK-MAX-REQUESTS threshold
[1] WARNING: and may indicate that the application is not freeing or
[1] WARNING: completing all requests that it creates.
[1] WARNING: 1. 1 time:
[1] WARNING: MPI_Isend(*buf=0x35f82c0, count=4812, datatype=MPI_INT, dest=0, tag=1, comm=0xffffffff84000006 DUP COMM_WORLD [0:479], *request=0x309ce90)

and finally:

[0] INFO: LOCAL:REQUEST:NOT_FREED: found 1 time (0 errors + 1 warning), 0 reports were suppressed
[0] INFO: Found 1 problem (0 errors + 1 warning), 0 reports were suppressed.

Is this important or is it caused by the CHECK-MAX_REQUEST setting? (how to specify this?)

 

The IBM-MPI1 benchmarks (2021.7) runs well. On 24 cores it takes 129 s (psm3) , 138 s (mlx), 137 s (verbs) on 48 cores it takes 390 s (psm3), 349 s (mlx), 361 s (verbs) and on 96 cores it takes 944 s (psm3), 807 s (mlx), 755 s (verbs).

 

best regards

 

Jaap

0 Kudos
TobiasK
Moderator
1,287 Views

@Jaapw


If this is expected or not, you have to clarify with the vendor of the external library, I would be suspicious at least.


0 Kudos
Reply