we have compiled our parallel code by using the latest Intel's software stack. We do use a lot of passive RMA one-sided PUT/GET operations along with a derived datatypes. Now we are expericincing problem that sometimes our application fails with either segmentation fault or with the following error message:
 Assertion failed in file ../../segment.c at line 669: cur_elmp->curcount >= 0
 internal ABORT - process 6
The Intel's inspector shows a problem inside the Intel MPI library:
libmpi_dbg.so.4!MPID_Segment_blkidx_m2m - segment_packunpack.c:313
libmpi_dbg.so.4!MPID_Segment_manipulate - segment.c:552
libmpi_dbg.so.4!MPID_Segment_unpack - segment_packunpack.c:88
libmpi_dbg.so.4!MPIDI_CH3U_Receive_data_found - ch3u_handle_recv_pkt.c:190
libmpi_dbg.so.4!MPIDI_CH3_PktHandler_GetResp - ch3u_rma_sync.c:3691
libmpi_dbg.so.4!MPID_nem_handle_pkt - ch3_progress.c:1477
libmpi_dbg.so.4!MPIDI_CH3I_Progress - ch3_progress.c:498
libmpi_dbg.so.4!MPIDI_Win_unlock - ch3u_rma_sync.c:1959
libmpi_dbg.so.4!PMPI_Win_unlock - win_unlock.c:119
Does it mean that the something is wrong with the derived datatypes? If yes, how I can debug the problem? The problem never appears within OpenMPI.
The SW stack used:
Intel C/Fortran compilers v15.0.0.090
Intel MPI Library v5.0.1.035
Any help will be greatly appreciated!
James Tullos (Intel) wrote:
Can you provide a reproducer code?
I am trying to narrow code. However, right now I am facing an another problem with derived datatypes. Enclosed please find a reproducer code. Just compile it and pass the following parameters:
mpicc mpi_tvec2_rma.c -o mpi_tvec2_rma mpirun -np 8 ./mpi_tvec2_rma 128 40000
When I am using the Intel MPI (Intel C compiler) v4.1.3.048 (v15.0.0) it crashes with the following error message:
Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61: win_ptr->at_completion_counter >= 0 internal ABORT - process 0
The MPICH developers claimed that this problem has been probably fixed in development version of MPICH3. I will check it out. However, if I switch to Intel MPI v5.0.1.035, then it is getting more and more interesting:
Fatal error in MPI_Win_lock: Other MPI error, error stack: MPI_Win_lock(165)......................: MPI_Win_lock(lock_type=234, rank=1, assert=0, win=0xa0000000) failed MPIDI_Win_lock(2702)...................: MPIDI_CH3I_Acquire_local_lock(3615)....: Detected an error while in progress wait for RMA messages MPIDI_CH3I_Progress(504)...............: MPID_nem_handle_pkt(1368)..............: MPIDI_CH3_PktHandler_EagerSend(748)....: failure occurred while posting a receive for message data (MPIDI_CH3_PKT_EAGER_SEND) MPIDI_CH3U_Receive_data_unexpected(253): Out of memory (unable to allocate -1703399408 bytes)
It seems to me as an integer overflow problem somewhere inside Intel MPI. Could you please have a look at it?
With best regards,