Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
1910 Discussions

MPI Fatal error in PMPI_Allgatherv: MPIR_Localcopy, 4028 bytes received but buffer size is 4008

Xiaoqiang
Beginner
1,304 Views

OpenFOAM is the Open Source CFD Toolbox.

I can use Intel parallel studio 2018u3 to successfully run OpenFOAM test case.

But, there is a fatal error when I use OneAPI HPCKit_p_2021.1.0.2684.

Why does OneAPI report this error? How can I fix ?

The fatal error is:

 

Abort(740365582) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Allgatherv: Message truncated, error stack:
PMPI_Allgatherv(437)..........................: MPI_Allgatherv(sbuf=0x21a8280, scount=1007, MPI_INT, rbuf=0x21c0820, rcounts=0x265dc00, displs=0x265dc10, datatype=MPI_INT, comm=comm=0x84000003) failed
MPIDI_Allgatherv_intra_composition_alpha(1764):
MPIDI_NM_mpi_allgatherv(394)..................:
MPIR_Allgatherv_intra_recursive_doubling(75)..:
MPIR_Localcopy(42)............................: Message truncated; 4028 bytes received but buffer size is 4008

 

 

icc -v

 

icc version 2021.1 (gcc version 7.3.0 compatibility)

 

 

mpirun --version

 

Intel(R) MPI Library for Linux* OS, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
Copyright 2003-2020, Intel Corporation.

 

 

mpi debug info

 

[0] MPI startup(): Intel(R) MPI Library, Version 2021.1  Build 20201112 (id: b9c9d2fc5)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): Size of shared memory segment (112 MB per rank) * (8 local ranks) = 902 MB total
[0] MPI startup(): libfabric version: 1.11.0-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       1610443  Hnode5     {0,1,2,3,7,8}
[0] MPI startup(): 1       1610444  Hnode5     {12,13,14,18,19,20}
[0] MPI startup(): 2       1610445  Hnode5     {4,5,6,9,10,11}
[0] MPI startup(): 3       1610446  Hnode5     {15,16,17,21,22,23}
[0] MPI startup(): 4       1610447  Hnode5     {24,25,26,27,31,32}
[0] MPI startup(): 5       1610448  Hnode5     {36,37,38,42,43,44}
[0] MPI startup(): 6       1610449  Hnode5     {28,29,30,33,34,35}
[0] MPI startup(): 7       1610450  Hnode5     {39,40,41,45,46,47}
[0] MPI startup(): I_MPI_ROOT=/Oceanfile/kylin/Intel-One-API/mpi/2021.1.1
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10

 

Labels (1)
0 Kudos
1 Solution
Michael_Intel
Moderator
965 Views

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


View solution in original post

10 Replies
PrasanthD_intel
Moderator
1,252 Views

Hi Chen,

 

Looks like there has been a mismatch between the expected and received buffer size.

The recieve_buffer might has been allocated only 4008bytes while the sent data from all processes results in 4028bytes.

Please recheck the receive buffer size matches with the global_size*count*INT.

In your code, as we can see that scount is 1007 and total bytes received are 4028 we can infer that total ranks are 4 and rbuf should be allocated memory as :

rbuf = (int *)malloc(global_size*1007*sizeof(int));

 

Could you please provide us the code snippet involving this mpi_Allgatherv call?

That would help us a lot in debugging the error.

 

Regards

Prasanth

 

PrasanthD_intel
Moderator
1,229 Views

Hi Chen,


We haven't heard back from you.

Let us know if your issue is resolved and the given workaround fixed the issue.


Regards

Prasanth


Xiaoqiang
Beginner
1,205 Views

Hi Prasanth,

Sorry for the late reply. I spent a lot of time trying to solve this problem.

The OpenFOAM source code is complex and contains some third-party dependent libraries.

Fortunately, a third-party library called scotch_6.0.6 was found to cause the problem.

I think the root cause is that there are some overly aggressive optimizations in mpiicc with O3 option.

Here are some tests to recompile scotch_6.0.6:

Software stack Compiler Compilation Options Whether this issue occurs
gcc9.3 and openmpi4.0.3 mpicc -O3 no
OneAPI HPCKit_p_2021.1.0.2684 mpicc -O3 no
OneAPI HPCKit_p_2021.1.0.2684 mpiicc -O1 no
OneAPI HPCKit_p_2021.1.0.2684 mpiicc -O3 yes

 

Intel MPI provides both mpiicc and mpicc.

My question is, what is the difference between them?

When using the O3 option, which optimization of mpiicc is optimal may cause this error?

Thanks for your help.

 

Regards

Chen

PrasanthD_intel
Moderator
1,186 Views

Hi Chen,


So you are facing the issue only when O3 optimization is enabled.

As you have said the error is maybe due to some optimization. We will look into the issue of what might be causing this error.

The difference between mpicc and mpiicc is that mpiicc uses Intel compilers (icc ) and mpicc uses gnu compilers (gcc).

Could you also please test with O2 once that would be helpful.


Regards

Prasanth



Xiaoqiang
Beginner
1,160 Views

Hi Prasanth,

 

Unfortunately,  using mpiicc with O2 option still cause this error.

For better performance, I currently use mpicc to compile programs.

 

Regards

Chen

 

PrasanthD_intel
Moderator
1,070 Views

Hi Chen,


We have tried to reproduce the issue.

I have downloaded Thirdparty-7 from the OpenFoam repository and build scotch6.0.6.

I have replaced the Makefile.inc with Makefile.inc.x86-64_pc_linux2.icc.impi and build the scotch.

It ran fine and the optimizations used were all -O0.

Could you let me know how to reproduce your error? As the error seems to be with the received buffer but you were able to compile it with different optimizations and with GCC+OpenMPI.


Regards

Prasanth


Xiaoqiang
Beginner
1,050 Views

Hi Prasanth,

You can reproduce the problem by performing the following steps:

  • Downloading Openfoam and Third-Party Library Source Codes

https://altushost-swe.dl.sourceforge.net/project/openfoam/v1906/OpenFOAM-v1906.tgz 

https://altushost-swe.dl.sourceforge.net/project/openfoam/v1906/ThirdParty-v1906.tgz

  • Decompress the source code package to the same directory

Xiaoqiang_1-1617101775047.png

  • Configuring the Environment Variables of the Intel OneAPI

Xiaoqiang_2-1617101930338.png

  • Configuring OpenFOAM Environment Variables

 

cd OpenFOAM-v1906
vim etc/bashrc

 

Modify the configuration file to use the Intel OneAPI.

Xiaoqiang_3-1617102589907.png

After the modification, load environment variables.

 

source etc/bashrc

 

When you load environment variables for the first time, a message is displayed indicating that the software is not installed. Rest assured, don't worry about this tip.

  • Compile the program using the built-in script.

 

./Allwmake -j 8 -s -k -q

 

It takes a long time to compile, which may take up to 2 hours.

After the compilation is complete, load environment variables again.

Check whether the OpenFOAM is successfully installed.

(scotch6.0.6 is compiled through mpiicc with O3 option)

Xiaoqiang_4-1617103318570.png

Xiaoqiang_5-1617103372034.png

  • Run test cases to reproduce the problem

 

cd tutorials/incompressible/pisoFoam/LES/motorBike/motorBike/
./Allclean
./Allrun

 

You can see the error information in the log.snappyHexMesh file.

  • Recompile scotch_6.0.6

 

cd ThirdParty-v1906/scotch_6.0.6/src

 

Modify the Makefile.inc file and use mpicc.

Xiaoqiang_6-1617104348173.png

Compiling Library Files, and add the newly generated library to the environment variable.

 

make clean && make libptscotch -j
cd libscotch
export LD_LIBRARY_PATH=/home/openfoam/ThirdParty-v1906/scotch_6.0.6/src/libscotch:$LD_LIBRARY_PATH

 

  •  Run test cases

 

cd OpenFOAM-v1906/tutorials/incompressible/pisoFoam/LES/motorBike/motorBike
./Allclean
./Allrun

 

If all goes well, the test cases can run properly.

I hope this case is helpful to Intel OneAPI.

PrasanthD_intel
Moderator
1,006 Views

Hi Chen,


Thanks for providing the steps. They were very helpful.

We are looking into it and will get back to you.


Regards

Prasanth


Michael_Intel
Moderator
995 Views

Hello,


I can reproduce the issue but it looks like an issue in the application code of ptscotch.


Please feel free to attach our ITAC message checker tool (export LD_PRELOAD=libVTmc.so:libmpi.so), which will report the following.


[0] WARNING: LOCAL:MEMORY:OVERLAP: warning

[0] WARNING:  Data transfer addresses the same bytes at address 0x195a3f4

[0] WARNING:  in the receive buffer multiple times, which is only

[0] WARNING:  allowed for send buffers.

[0] WARNING:  Control over new buffer is about to be transferred to MPI at:

[0] WARNING:    MPI_Allgatherv(*sendbuf=0x1911b04, sendcount=191, sendtype=MPI_INT, *recvbuf=0x195a3f4, *recvcounts=0x33353f8, *displs=0x33353e0, recvtype=MPI_INT, comm=0xffffffffc4000000 SPLIT COMM_WORLD [0:3])


Best regards,

Michael


Michael_Intel
Moderator
966 Views

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


Reply