I have the following problem with Trace Collector.
When I run my instrumented application, I get this message:
"  Intel Trace Collector INFO: Writing tracefile Konraz29.run.stf in /home/users/vadim/2_new_Konraz/2_new_Konraz_ITAC Assertion failed in file ../../dapl_module_poll.c at line 3473: rreq != ((void *)0) internal ABORT - process 8 [24:node1-128-07] unexpected disconnect completion event from [8:node1-128-05] Assertion failed in file ../../dapl_module_util.c at line 2682: 0 internal ABORT - process 24 ... " and so on (whole err message is rather big, but almost the same is repeated).
I run this application on 196 proccesses. When I use smaller input (so the application works less time) and smaller number of proccesses, it seems to work OK.
What can be the problem with it?
P.S. I use impi-4.0.1 MPI library and intel-12.0 compiler.
This happens because of "unexpected disconnect completion event". As you may guess this comes from DAPL module (communication). The reason why it happens is unclear. Can you try to run this application on other set of nodes? (exclude node1-128-05).
Also, it's quite important to know how you compile that application and how you run it. BTW what version of the Intel TraceCollector and Analyzer do you use?
Could you set I_MPI_FABRICS=shm:dapl and I_MPI_DAPL_UD=on and give it a try.
Do you have Intel Compiler? Could you please compile your application in the following way: mpiicpc -O3 -DUSE_MPI -trace -o Konraz29.run Konraz29.cpp and run it as usual?
I hope that you are using Intel MPI Library.
To get additional information you can set I_MPI_DEBUG=5. It's very strange that in your first message I see an error from DAPL library but in the previous message I see that DAPL fabric was not available. Could you try to run your application with I_MPI_FABRICS=shm:tcp in this case. And after that with I_MPI_FABRICS=shm:dapl. Might be not all nodes in your cluster have Infiniband cards. Seeting I_MPI_DEBUG may help you to understand what is going on.
Althoough, I hope that you understand that running 256 processes you'll get a hage trace.