I have recently switched a program from the single-threaded Intel mpi library to the thread-safe library (using the -mpi_mt compiler flag) and have run into an odd error. When using the -check_mpi compiler flag, I receive the following:
 ERROR: GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED: error  ERROR: Mismatch found in local rank  (global rank ),  ERROR: other processes may also be affected.  ERROR: Data was corrupted during transmission to local rank  (same as global rank):  ERROR: mpi_gatherv_(*sendbuf=0x1b54160, sendcount=6667030, sendtype=MPI_INTEGER4, *recvbuf=0x7f6b3687d010, *recvcounts=0x7ffff21397e0, *displs=0x7ffff21397d0, recvtype=MPI_INTEGER4, root=0, comm=0xffffffff84000004 SPLIT CREATE COMM_WORLD [0:2], *ierr=0x1364720)  ERROR: No problem found in local rank  (same as global rank):  ERROR: mpi_gatherv_(*sendbuf=0x7fd66307d010, sendcount=1401736, sendtype=MPI_INTEGER4, *recvbuf=NULL, *recvcounts=0x7fffa2c10160, *displs=0x7fffa2c10150, recvtype=MPI_INTEGER4, root=0, comm=0xffffffff84000004 SPLIT CREATE COMM_WORLD [0:2], *ierr=0x1363f60)  ERROR: No problem found in local rank  (same as global rank):  ERROR: mpi_gatherv_(*sendbuf=0x100, sendcount=0, sendtype=MPI_INTEGER4, *recvbuf=NULL, *recvcounts=0x7fff8e7f40e0, *displs=0x7fff8e7f40d0, recvtype=MPI_INTEGER4, root=0, comm=0xffffffff84000004 SPLIT CREATE COMM_WORLD [0:2], *ierr=0x1364180)
This was run with three nodes, each with one process. The error occurs at a point before the processes spawn any OMP threads.
The few things that make this odd to me are: 1. This error does not occur when using the single-threaded library. 2. This error does not occur when using two nodes, it only occurs with three or more nodes. 3. This gatherv occurs in a loop where each iteration executes a gatherv into a different process. This error only occurs for the gatherv into process 0.
More detail on the data in the section of code where this occurs: You can see from the send counts in the error message the amount of data each process is sending to proc0. I have compared the src arrays to be sent for both the single-threaded run and the thread-safe run and found them to be identical. When looking at the dest array after the gatherv in the thread-safe run, I find that all of the elements that proc0 is sending to itself are there, but only the first thirty elements that proc1 is sending to proc0 are there. The rest of the array elements in the dest array are not touched from the gatherv call.
Can anyone offer some ideas of what is going on and/or how to fix this?
Message Checker library just checks source buffers with the collected buffer. So, it just shows you where the issue is. To understand the reason of the issue we need to know version of the Intel MPI Library and ideally to get a reproducer. Could you please attachthe extract from your code with Gatherv call. It would be great if you could create a short reproducer.