Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Trace Collector fails on big data

davvad
Beginner
645 Views
Hello,

I have the following problem with Trace Collector.

When I run my instrumented application, I get this message:

"
[0] Intel Trace Collector INFO: Writing tracefile Konraz29.run.stf in /home/users/vadim/2_new_Konraz/2_new_Konraz_ITAC
Assertion failed in file ../../dapl_module_poll.c at line 3473: rreq != ((void *)0)
internal ABORT - process 8
[24:node1-128-07] unexpected disconnect completion event from [8:node1-128-05]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 24
...
"
and so on (whole err message is rather big, but almost the same is repeated).

I run this application on 196 proccesses. When I use smaller input (so the application works less time) and smaller number of proccesses, it seems to work OK.

What can be the problem with it?


P.S. I use impi-4.0.1 MPI library and intel-12.0 compiler.
0 Kudos
5 Replies
Dmitry_K_Intel2
Employee
645 Views
Hi Vadim,

This happens because of "unexpected disconnect completion event". As you may guess this comes from DAPL module (communication). The reason why it happens is unclear. Can you try to run this application on other set of nodes? (exclude node1-128-05).

Also, it's quite important to know how you compile that application and how you run it.
BTW what version of the Intel TraceCollector and Analyzer do you use?

Could you set I_MPI_FABRICS=shm:dapl and I_MPI_DAPL_UD=on and give it a try.

Regards!
Dmitry
0 Kudos
davvad
Beginner
645 Views
Hello Dmitry,

I've tried to use other nodes, but the same has happened.

Also, I've tried to set I_MPI_FABRICS=shm:dapl and I_MPI_DAPL_UD=on, but got this error messages:
"
[4] dapl fabric is not available and fallback fabric is not enabled
...
"

This message has repeated for several other numbers, not only [4]. The number means process number? If so, then this message is shown not for all processes.


Compile commands:
mpicxx -O3 -DUSE_MPI -c Konraz29.cpp
mpicxx Konraz29.o -L$VT_LIB_DIR -lVT $VT_ADD_LIBS -o Konraz29.run


Run command:
sbatch --partition=test -n 256 impi ./Konraz29.run


Version of ITAC is 8.0.1.009.

0 Kudos
Dmitry_K_Intel2
Employee
645 Views
Hi Vadim,

Do you have Intel Compiler?
Could you please compile your application in the following way:
mpiicpc -O3 -DUSE_MPI -trace -o Konraz29.run Konraz29.cpp
and run it as usual?

I hope that you are using Intel MPI Library.

To get additional information you can set I_MPI_DEBUG=5. It's very strange that in your first message I see an error from DAPL library but in the previous message I see that DAPL fabric was not available.
Could you try to run your application with I_MPI_FABRICS=shm:tcp in this case.
And after that with I_MPI_FABRICS=shm:dapl. Might be not all nodes in your cluster have Infiniband cards.
Seeting I_MPI_DEBUG may help you to understand what is going on.

Althoough, I hope that you understand that running 256 processes you'll get a hage trace.

Regards!
Dmitry
0 Kudos
davvad
Beginner
645 Views
Hello Dmirty,

I'ver tried to compile with

mpiicpc -O3 -DUSE_MPI -trace -o Konraz29.run Konraz29.cpp

But the same happened.

After that I've used I_MPI_DEBUG=5 and I_MPI_FABRICS=shm:tcp, and got this error message:

Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(176)..........................: MPI_Send(buf=0xc175a0, count=65520, MPI_CHAR, dest=0, tag=1045, comm=0x84000000) failed
MPIDI_CH3I_Progress(401)...............:
MPID_nem_tcp_poll(2332)................:
MPID_nem_tcp_connpoll(2582)............:
state_commrdy_handler(2208)............:
MPID_nem_tcp_recv_handler(2098)........:
MPID_nem_tcp_handle_pkt(1821)..........:
MPIDI_CH3_PktHandler_EagerSend(618)....: failure occurred while posting a receive for message data (MPIDI_CH3_PKT_EAGER_SEND)
MPIDI_CH3U_Receive_data_unexpected(250): Out of memory
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(176)................: MPI_Send(buf=0xbd23a0, count=65520, MPI_CHAR, dest=8, tag=1045, comm=0x84000000) failed
MPIDI_CH3I_Progress(401).....:
MPID_nem_tcp_poll(2332)......:
MPID_nem_tcp_connpoll(2504)..:
state_commrdy_handler(2213)..:
MPID_nem_tcp_send_queued(122): writev to socket failed - Connection reset by peer
Fatal error in MPI_Send: Other MPI error, error stack:


Also, now I use 64 processes.
0 Kudos
Dmitry_K_Intel2
Employee
645 Views
Contacting through e-mail to get more information.
0 Kudos
Reply