Solved: MPI with large message sizes (again)

fanselm · ‎09-09-2020

Hi,

We have a program that needs to use MPI to send large messages - up to many GBs - of double precision floating point numbers. Out first problem was that the MPI standard assumes all message sizes can fit in a 32 bit integer which should allow us to send ~16 GB of doubles. To send more than that we simply send the messages in chunks of 16 GBs. This seemed to work in some cases and on some machines, but we then experienced wrong results on certain machines. After days/weeks of debugging we tracked it down to MPI_Allreduce silently truncating messages above ~6 GB.

I now see that others have had similar problem a long time ago, e.g:

https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/MPI-Allgatherv-with-large-message-sizes/m-p/920258

https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/BCAST-error-for-message-size-greater-than-2-GB/m-p/1162705

https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/PMPI-Bcast-Message-truncated/td-p/953850

I also see that Intel MPI supports 64 bit integers with the ILP64 interface (https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/compiling-and-linking/ilp64-support.html) but only for Fortran compiled with the Intel Fortran compiler. Unfortunately our program is written in C++ and we compile with GCC on Unix and MSVC on Windows, so this isn't really any help.

I have attached a simple program that showcases the problem. I compile it with MPICH-3.2 wrappers around GCC 5.4.0 like so:

mpicc -g mpi-allreduce-size.c -o mpi-allreduce-size

Then I run it on a cluster of Huawei XH620 v3 nodes with two Xeon_E5-2650_v4 CPUs with an OmniPath network. I run the program like so:

mpirun -bootstrap slurm -n 2 -ppn 1 ./mpi-allreduce-size 805306368

where the number 805306368 is the message size. The Intel MPI executable is "Intel(R) MPI Library for Linux* OS, Version 2018 Update 5 Build 20190404 (id: 18839)".

The program runs but the data summed across all the processes is wrong. It is correct for a data size one smaller, i.e. 805306367.

So my question is: 1) Is there a bug in Intel MPI? or 2) is it a hardware problem or 3) is it simply an undocumented limitation of Intel MPI?

In case of 3) what exactly is the largest message size? The other posts hinted at 2 GB, but clearly we have it working for up to 6 GB. Is there message size that is guaranteed to work on all platforms, all versions of Intel MPI and other implementations of MPI (like MPICH), and on all hardware? We have to hardcode a chunk size, so we have to know! Also if this is the case, please inform you users about this restriction on your website and in your manual.

Also, it's 2020! One would think that problems with large data sizes should have been solved long ago for HPC! Those problems belonged to the floppy disk era.

PrasanthD_intel · ‎09-18-2020

Hi Filip,

There are a lot of factors that contribute to the limitations and we cannot give a standard response or a formula for that. So we cannot say the largest and smallest message sizes that work everywhere.

In Intel MPI for every collective operation, there exist various algorithm's that MPI chooses based on parameters like size, hardware, interconnect etc.

In your case, there might be a bug that for sizes around 6GB MPI is choosing an incompatible algorithm which might be causing the issue.

If you check this page https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html , MPI_Reduce has a total of 11 algorithms to choose.

You can manually choose the algorithm to use by giving I_MPI_ADJUST_ALLREDUCE=<algorithm number>.

For more information on how to use I_MPI_ADJUST please refer to the above link.

I have written a sample script which runs the program for every algorithm and you can report us which one is failing. This will be helpful for us in correcting the bug and for you to continue with 2018 version.

Please change the extension of the test.txt to test.sh and add she-bang(#!/bin/bash ) at the top before running. Also, provide executable permissions chmod +x test.sh.

Hope this helps.

Thanks, and regards

Prasanth

View solution in original post

PrasanthD_intel · ‎09-10-2020

Hi Filip,

We have tested your program with data of 10GB and got valid result. (results were posted at the bottom)

Could you please try and provide an answer for the following :

1)At which range you started getting incorrect results?

2)What is the file system you were using?

3)Could you update to the latest version (2019u8) and check.

Also, post us the debug info while running the program by setting

export I_MPI_DEBUG=10

u48346@s001-n052:~/mpi$ I_MPI_DEBUG=10 mpirun -n 4 -f hostfile ./allred 1305306368

[0] MPI startup(): Intel(R) MPI Library, Version 2021.1-beta08 Build 20200715 (id: b94b8b058)

[0] MPI startup(): library kind: release

[0] MPI startup(): libfabric version: 1.10.1-impi

[0] MPI startup(): libfabric provider: tcp;ofi_rxm

size=1305306368

[0] MPI startup(): Rank Pid Node name Pin cpu

[0] MPI startup(): 0 1381 s001-n052 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}

[0] MPI startup(): 1 11101 s001-n009 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}

[0] MPI startup(): 2 9996 s001-n055 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}

[0] MPI startup(): 3 25756 s001-n013 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}

[0] MPI startup(): I_MPI_LIBRARY_KIND=release_mt

[0] MPI startup(): I_MPI_ROOT=/glob/development-tools/versions/oneapi/beta08/inteloneapi/mpi/2021.1-beta08

[0] MPI startup(): I_MPI_MPIRUN=mpirun

[0] MPI startup(): I_MPI_HYDRA_RMK=pbs

[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc

[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default

[0] MPI startup(): I_MPI_DEBUG=10

size=1305306368

Communication succeeded!

Regards

Prasanth

fanselm · ‎09-11-2020

Hi Prasanth,

Thanks for looking into this. I already replied you once yesterday, but I just discovered that the reply does not show up. Either I forgot to hit enter or your forum system randomly deletes messages.

Anyway, to answer your questions:

1) The incorrect results start at exactly 805306368 doubles. If i try to send one less, i.e. 805306367, I get correct results.

2) The OS is: CentOS Linux release 7.8.2003 with Linux kernel version 3.10.0-1127.19.1.el7.x86_64

3) I tried installing Intel MPI v. 2019 update 5, and with that it worked! I tested with a number of elements up to the maximum integer = 2^31-1 and it also worked.

I have attached output with debug information for all different runs in case you want to give it a look.

We probably need to make a workaround, since not all our users will be able to upgrade to 2019u5, so do you have information about the exact message size limitations for the different versions of Intel MPI? Do they only depend on the version of Intel MPI or is there some hardware or libfabric limitation as well? (ethernet, Infiniband, OmniPath)?

Thanks for your help!

PrasanthD_intel · ‎09-18-2020

Hi Filip,

There are a lot of factors that contribute to the limitations and we cannot give a standard response or a formula for that. So we cannot say the largest and smallest message sizes that work everywhere.

In Intel MPI for every collective operation, there exist various algorithm's that MPI chooses based on parameters like size, hardware, interconnect etc.

In your case, there might be a bug that for sizes around 6GB MPI is choosing an incompatible algorithm which might be causing the issue.

If you check this page https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html , MPI_Reduce has a total of 11 algorithms to choose.

You can manually choose the algorithm to use by giving I_MPI_ADJUST_ALLREDUCE=<algorithm number>.

For more information on how to use I_MPI_ADJUST please refer to the above link.

I have written a sample script which runs the program for every algorithm and you can report us which one is failing. This will be helpful for us in correcting the bug and for you to continue with 2018 version.

Please change the extension of the test.txt to test.sh and add she-bang(#!/bin/bash ) at the top before running. Also, provide executable permissions chmod +x test.sh.

Hope this helps.

Thanks, and regards

Prasanth

fanselm · ‎09-30-2020

I have tried to run your script (btw, it had an error in that I_MPI_ADJUST_REDUCE was set to "i" not "$i", fixed that) on the same cluster as before and with two processes, each on different nodes (so that it can't use shared memory transfer) and it fails for all algorithms on IMPI 2018. I have attached the full output.

PrasanthD_intel · ‎09-24-2020

Hi Filip,

We haven't heard back from you. Is your problem resolved?

Please let us know if you have any issues in providing those details we have asked for.

Regards

Prasanth

PrasanthD_intel · ‎09-30-2020

Hi Filip,

We are closing this thread assuming your issue is resolved.

Please raise a new thread for any further questions. Any further interaction in this thread will be considered community only

Regards

Prasanth

fanselm · ‎09-30-2020

Hi Prasanth,

Sorry about the long delay. I have enough information to solve my issue.

Thanks for the detailed insight.

Filip

PrasanthD_intel · ‎10-07-2020

Hi Filip,

We have observed that you are using mpicc(gcc+mpich) instead of mpiicc (icc+IMPI) while compiling, so could you please try compiling with mpiicc once.

Because we have tried in our environment with 2018u5 and haven't gotten any error.

Also, for I_MPI_ADJUST_ALLREDUCE, there are 25 (0-24) algorithms available not 12 as said in the article. You can see those by giving this command:

impi_info -v I_MPI_ADJUST_ALLREDUCE

So, could you please use the below script and test for the remaining algorithms too?

#!/bin/bash

for i in {0..25..1}
do
        echo "Currently running $i"
        I_MPI_DEBUG=10 I_MPI_ADJUST_REDUCE=$i mpirun -np 2 -f hostfile ./a.out 905306368
        if [ $? -ne 0 ]
        then
                echo "Failed to run algorithm $i"
        fi
done

Thanks and Regards

Prasanth

PrasanthD_intel · ‎10-15-2020

Hi Filip,

We haven't heard back from you.

Have you tried with all the algorithms as we have asked?

Let us know if you still getting errors after compiling with mpiicc.

Regards

Prasanth

PrasanthD_intel · ‎10-27-2020

Hi Filip,

We are closing this thread as we haven't gotten any response from you and we are assuming your issue has been resolved.

Please raise a new thread for any further questions. Any further interaction in this thread will be considered community only

Regards

Prasanth