Re: Re:memory footprint alltoallv

Jaapw · ‎01-09-2024

Hello

I performed simple tests on our cluster which has 28 cpu's on each node. I used mpi_alltoallv to send and receive "datasize" 32 bit integers, so sendcounts=="datasize" for all processes. The figures below plots the maximum memory load on a node (measured with vtune, aps or via (/proc/self/status)). The memory footprint is huge and limits the number of nodes that can be used. Using the MPI_ADJUST_ALLTOALLV=2 setting reduces the memory usage significantly but only for a small datasize (second figure below, below approx datasize==300 the memory is limited). How to reduce the memory footprint?

I use oneapi/mpi/2021.4.0.

best regards

Jaap

TobiasK · ‎01-11-2024

@Jaapw
We need more information to look at it, could you please provide the OS, CPU, network, workload manager and the source code of your example?
Also please check with the latest 2021.11 release.

Best
Tobias

Jaapw · ‎01-12-2024

Dear Tobias

We are using:

Rocky Linux, version 8.8

2 Intel E5-2660 v4 (2.00GHz) 14 core CPUs (28 cores)

FDR Infiniband HBA

slurm/23.02.4

I was not able to test 2021.11 yet but found that the problem is introduced by setting FI_PROVIDER=verbs. The slurm file contains the result with sendcounts==10, 100 or 1000 and using FI_PROVIDER=verbs or not.

best regards

Jaap

TobiasK · ‎01-12-2024

We strongly recommend to use either the MLX provider or the PSM3 provider, the latter is our new and preferred provider for supporting NVIDIA Mellanox hardware. Why do you use the verbs provider?

Please refer to the Infiniband related information here:

https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

Best
Tobias

Jaapw · ‎01-15-2024

Dear Tobias,

Thanks for the advice. We had problems with some tests using a high number of core counts (between 1000-2000). I will run the tests again using psm3.

best regards

Jaap

Jaapw · ‎01-15-2024

The problems where at another cluster with:

CentOS Linux, version 8
2 Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz 12 core CPUs
Infiniband HBA
slurm/19.05.8

lspci | grep Mell gives:

d8:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

On cluster used above (the Rocky Linux cluster):

82:00.0 InfiniBand: Mellanox Technologies MT25408A0-FCC-QI ConnectX, Dual Port 40Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In... (rev b0)

Jaapw · ‎01-18-2024

Dear Tobias,

Using psm3 solves all memory problems which is nice. Unfortunately, the performance is not very reliable. The figure at the left shows the scaling result (a performance test using our code, solving the flow around a propeller) for a 21 million cells case (Current line) using verbs. The figure at the right shows the result using psm3. I repeated the calculation using 1440 cores using verbs and psm3 and the results are consistent/repeatable, i.e. showing the same scaling. Any suggestions what can be the reason?

best regards

Jaap

TobiasK · ‎01-19-2024

@Jaapw

sorry, but I cannot assist you here with this kind of in-depth application support / debugging.

Best

Tobias

Jaapw · ‎01-19-2024

Dear Tobias,

I understand that. Another question related to using

export $FI_PROVIDER=psm3

as an environment variable.

If I use "mpirun -n 1 ./executable" it works fine but if I use "./executable" I get the message "[0] MPI startup(): FI_PSM3_UUID was not generated, please set it to avoid possible resources ownership conflicts between MPI processes".

How to avoid this?

best regards

Jaap

TobiasK · ‎01-22-2024

If you run a single task, you may just specify I_MPI_FABRICS=shm to not initialize OFI.

Jaapw · ‎01-30-2024

Thanks.

I did a series of tests using grids from 10 million to 350 million cells using 240 to 1920 cores and it all works well. I am quite happy about that. However, at the end of the tests I sometimes get an error:

Fatal Send CQ Async Event on mlx5_0 port 1: CQ error

Can this be related to selecting psm3?

best regards

Jaap

TobiasK · ‎01-30-2024

@Jaapw

Can you please be more explicit about when it appears and how often? "Sometimes and at the end" usually need in depth debugging of the application. Do you see that behavior also with IMB-MPI1 benchmarks?

Does it work with the MLX provider?

If it really happens after / at mpi_finalize please make sure that you meet all conditions that are required to call mpi_finalize correctly.

You can also run with "-check-mpi" and see if there are any correctness issues in your program.

Jaapw · ‎01-31-2024

Running a 173 million cells case on 480 cores did crash once and did run well once. It is too slow and expensive to do this test multiple times. I did not try with mlx yet. The crash case gives:

node001:rank0: Fatal Send CQ Async Event on mlx5_0 port 1: CQ error
^[[0m^[[1m^[[91m
Loguru caught a signal: SIGABRT
^[[0m^[[0m^[[31mStack trace:
5 0x15554496cdc3 clone + 67
4 0x155547d8217a /lib64/libpthread.so.0(+0x817a) [0x155547d8217a]
3 0x15544c2aebbc /cm/shared/apps/intel/oneapi/mpi/2021.4.0/libfabric/lib/prov/libpsm3-fi.so(+0xd2bbc) [0x15544c2aebbc]
2 0x15544c275076 /cm/shared/apps/intel/oneapi/mpi/2021.4.0/libfabric/lib/prov/libpsm3-fi.so(+0x99076) [0x15544c275076]
1 0x155544891db5 abort + 295
0 0x1555448a737f gsignal + 271^[[0m
^[[0m^[[31m(2624.425s) [ 4659F700] :0 FATL| Signal: SIGABRT^[[0m
srun: error: node007: task 3: Broken pipe

I did run a smaller case and as usual check-mpi does not give any errors. If I run on more than 100 cores It still runs well but I get a warning, coming from an external library that we use:

[1] WARNING: LOCAL:REQUEST:NOT_FREED: warning
[1] WARNING: The current number of requests in this process is 100.
[1] WARNING: This matches the CHECK-MAX-REQUESTS threshold
[1] WARNING: and may indicate that the application is not freeing or
[1] WARNING: completing all requests that it creates.
[1] WARNING: 1. 1 time:
[1] WARNING: MPI_Isend(*buf=0x35f82c0, count=4812, datatype=MPI_INT, dest=0, tag=1, comm=0xffffffff84000006 DUP COMM_WORLD [0:479], *request=0x309ce90)

and finally:

[0] INFO: LOCAL:REQUEST:NOT_FREED: found 1 time (0 errors + 1 warning), 0 reports were suppressed
[0] INFO: Found 1 problem (0 errors + 1 warning), 0 reports were suppressed.

Is this important or is it caused by the CHECK-MAX_REQUEST setting? (how to specify this?)

The IBM-MPI1 benchmarks (2021.7) runs well. On 24 cores it takes 129 s (psm3) , 138 s (mlx), 137 s (verbs) on 48 cores it takes 390 s (psm3), 349 s (mlx), 361 s (verbs) and on 96 cores it takes 944 s (psm3), 807 s (mlx), 755 s (verbs).

best regards

Jaap

TobiasK · ‎02-01-2024

@Jaapw

If this is expected or not, you have to clarify with the vendor of the external library, I would be suspicious at least.