Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

MPI Error - Assertion failed

Leonardo_Oliveira
1,917 Views

Hello everyone,
My name is Leonardo from Brazil and this is my first post.
I'm running an MPI program with Intel implementation (Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910).
The "program" is well-know algorithm called k-means. The algorithm is used to identify natural clusters within sets of data points; its input is a set of data points and an integer k, and its output is an assignment of each point to one of k clusters.
When I run it with 160 data-points on 3 nodes, everything goes fine. With 1.6K was ok too, but when I run it with 160K data-points the fowlling error appears:

Assertion failed in file ../../dapl_module_poll.c at line 3608: *p_vc_unsignal_sr_before_read == 0
internal ABORT - process 1
[2:super3] unexpected disconnect completion event from [1:super3]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 2
[0:super3] unexpected disconnect completion event from [1:super3]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 0
srun: error: super3: tasks 0-2: Exited with exit code 1
srun: Terminating job step 75987.0


I have no idea what can be...
Has anyone experienced this before? Or any idea what is going on? ...

Thanks and sorry about my english.

Obrigado,
Leonardo Fernandes
 

0 Kudos
8 Replies
James_T_Intel
Moderator
1,916 Views
Hi Leonardo,

Additional information will be needed to determine the exact cause of this error message. What program are you using? What command line are you using to run the program? How are you compiling the program?What is your operating system version? What is your hardware configuration(processor, memory, node interconnect method, etc.)?

My immediate guess based on the behavior you are seeing is that you are overusing a system resource somewhere, since you are not experiencing this until you are at a larger number of data points.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Leonardo_Oliveira
1,916 Views
Hi James, thank you for reply.
I'll get all this information about our cluster. But let me update...
I changed the interface from Infiniband to Ethernet (I_MPI_FABRICS=tcp), and after that, I could increase the number of data-points (and nodes)... but another problem appeared.

Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113)................
.: MPI_Probe(src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x2b99c1574108) failed
MPIDI_CH3I_Progress(401)......
.:
MPID_nem_tcp_poll(2332).......
.:
MPID_nem_tcp_connpoll(2582)...
.:
state_commrdy_handler(2208)...
.:
MPID_nem_tcp_recv_handler(
2081): socket closed
slurmd[super10]: * STEP 76081.0 CANCELLED AT 2012-01-31T11:19:04 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: super11: tasks 3-4: Terminated
srun: Terminating job step 76081.0
srun: error: super10: tasks 0-2: Exited with exit code 1

....
0 Kudos
James_T_Intel
Moderator
1,916 Views
Hi Leonardo,

Have you been able to get the cluster information? Is there a way I can get a copy of this program to run and attempt to reproduce the behavior you are seeing?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Leonardo_Oliveira
1,916 Views
Hi James,
Sorry for the late (as this is a master thesis project..last month I needed to write instead of haking).
Well, in my project I'm using a Haskell framwork for distributed enviroments, that was implemented with TCP sockets (very similar to Erlang), and changed the transport layer to using MPI (so well summarized).
I can send you a code, but you'll need some work to run Haskell code.
-------
About our cluster:
There are 72 Bull NovaScale nodes running GNU/Linux 2.6.18-128. Each one with 2 processors Intel Xeon quad-core.
The MPI implementation is: Intel MPI Library for Linux, 64-bit applications, Version 4.0 Update 1 Build 20100910
Copyright (C) 2003-2010 Intel Corporation. All rights reserved
-------
Returning to the problem...
I set I_MPI_DEBUG=100 and the output was:
[bash][0] MPI startup(): Intel MPI Library, Version 4.0 Update 1 Build 20100910
[0] MPI startup(): Copyright (C) 2003-2010 Intel Corporation. All rights reserved.
[1] MPI startup(): tcp data transfer mode
[9] MPI startup(): tcp data transfer mode
[2] MPI startup(): tcp data transfer mode
[10] MPI startup(): tcp data transfer mode
[3] MPI startup(): tcp data transfer mode
[11] MPI startup(): tcp data transfer mode
[4] MPI startup(): tcp data transfer mode
[12] MPI startup(): tcp data transfer mode
[5] MPI startup(): tcp data transfer mode
[6] MPI startup(): tcp data transfer mode
[13] MPI startup(): tcp data transfer mode
[7] MPI startup(): tcp data transfer mode
[14] MPI startup(): tcp data transfer mode
[0] MPI startup(): tcp data transfer mode
[8] MPI startup(): tcp data transfer mode
[15] MPI startup(): tcp data transfer mode
[1] MPI startup(): Recognition level=1. Platform code=1. Device=4
[1] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[8] MPI startup(): Recognition level=1. Platform code=1. Device=4
[8] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[2] MPI startup(): Recognition level=1. Platform code=1. Device=4
[2] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[10] MPI startup(): Recognition level=1. Platform code=1. Device=4
[10] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[3] MPI startup(): Recognition level=1. Platform code=1. Device=4
[3] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[9] MPI startup(): Recognition level=1. Platform code=1. Device=4
[9] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[4] MPI startup(): Recognition level=1. Platform code=1. Device=4
[4] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[11] MPI startup(): Recognition level=1. Platform code=1. Device=4
[11] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[5] MPI startup(): Recognition level=1. Platform code=1. Device=4
[5] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[12] MPI startup(): Recognition level=1. Platform code=1. Device=4
[12] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[7] MPI startup(): Recognition level=1. Platform code=1. Device=4
[7] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[13] MPI startup(): Recognition level=1. Platform code=1. Device=4
[13] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[6] MPI startup(): Recognition level=1. Platform code=1. Device=4
[15] MPI startup(): Recognition level=1. Platform code=1. Device=4
[15] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[6] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[14] MPI startup(): Recognition level=1. Platform code=1. Device=4
[0] MPI startup(): Recognition level=1. Platform code=1. Device=4
[0] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)

Device_reset_idx=0
[0] MPI startup(): Allgather: 1: 0-128 & 16-511
[0] MPI startup(): Allgather: 1: 0-16 & 0-2147483647
[0] MPI startup(): Allgather: 4: 17-512 & 0-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 1: 0-1024 & 0-2147483647
[0] MPI startup(): Allgatherv: 1: 1024-2048 & 32-511
[0] MPI startup(): Allgatherv: 1: 2048-4096 & 32-63
[0] MPI startup(): Allgatherv: 1: 2048-4096 & 256-511
[0] MPI startup(): Allgatherv: 2: 1024-16384 & 512-2147483647
[0] MPI startup(): Allgatherv: 2: 2048-4096 & 64-255
[0] MPI startup(): Allgatherv: 4: 4096-65536 & 256-511
[0] MPI startup(): Allgatherv: 4: 16384-262144 & 512-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 0-255 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 256-511 & 0-63
[0] MPI startup(): Allreduce: 1: 256-511 & 256-511
[0] MPI startup(): Allreduce: 2: 512-1048575 & 16-511
[0] MPI startup(): Allreduce: 2: 256-2097151 & 64-255
[0] MPI startup(): Allreduce: 2: 1024-2147483647 & 256-2147483647
[0] MPI startup(): Allreduce: 5: 256-1023 & 512-2147483647
[0] MPI startup(): Allreduce: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 1: 0-16 & 9-2147483647
[0] MPI startup(): Alltoall: 1: 17-256 & 17-2147483647
[0] MPI startup(): Alltoall: 1: 129-512 & 9-64
[0] MPI startup(): Alltoall: 2: 17-128 & 9-16
[0] MPI startup(): Alltoall: 2: 513-1024 & 0-16
[0] MPI startup(): Alltoall: 2: 1025-524288 & 0-8
[0] MPI startup(): Alltoall: 2: 2049-2147483647 & 9-16
[0] MPI startup(): Alltoall: 3: 4097-2147483647 & 33-2147483647
[0] MPI startup(): Alltoall: 3: 4097-16384 & 17-32
[0] MPI startup(): Alltoall: 3: 32769-1048576 & 17-32
[14] MPI startup(): Parent configuration:(intra=6 inter=6 flags=0), (code=1 ppn_idx=3)
[0] MPI startup(): Alltoall: 3: 2097153-2147483647 & 17-2147483647
[0] MPI startup(): Alltoall: 4: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 2: 0-2147483647 & 32-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 4: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 2: 0-255 & 32-2147483647
[0] MPI startup(): Gather: 2: 256-2048 & 32-127
[0] MPI startup(): Gather: 2: 256-1024 & 128-255
[0] MPI startup(): Gather: 2: 256-511 & 256-511
[0] MPI startup(): Gather: 2: 131072-262143 & 512-2147483647
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 1024-65536 & 256-2147483647
[0] MPI startup(): Reduce_scatter: 1: 512-32768 & 16-511
[0] MPI startup(): Reduce_scatter: 1: 5-512 & 16-63
[0] MPI startup(): Reduce_scatter: 1: 128-512 & 64-127
[0] MPI startup(): Reduce_scatter: 1: 256-512 & 128-256
[0] MPI startup(): Reduce_scatter: 2: 32768-131072 & 16-31
[0] MPI startup(): Reduce_scatter: 2: 524288-2147483647 & 16-31
[0] MPI startup(): Reduce_scatter: 2: 1048576-2147483647 & 32-63
[0] MPI startup(): Reduce_scatter: 4: 0-4 & 16-511
[0] MPI startup(): Reduce_scatter: 4: 131072-1048576 & 16-63
[0] MPI startup(): Reduce_scatter: 4: 262144-2147483647 & 64-127
[0] MPI startup(): Reduce_scatter: 4: 524288-2097152 & 128-255
[0] MPI startup(): Reduce_scatter: 4: 1048576-2147483647 & 256-2147483647
[0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 1: 0-63 & 32-2147483647
[0] MPI startup(): Scatter: 1: 64-127 & 32-511
[0] MPI startup(): Scatter: 1: 128-255 & 32-255
[0] MPI startup(): Scatter: 1: 256-511 & 32-127
[0] MPI startup(): Scatter: 1: 512-1023 & 32-63
[0] MPI startup(): Scatter: 2: 128-255 & 256-511
[0] MPI startup(): Scatter: 2: 256-511 & 128-255
[0] MPI startup(): Scatter: 2: 512-2047 & 64-127
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2147483647
[0] Rank Pid Node name Pin cpu
[0] 0 21266 super2 n/a
[0] 1 21267 super2 n/a
[0] 2 21268 super2 n/a
[0] 3 21269 super2 n/a
[0] 4 21270 super2 n/a
[0] 5 21271 super2 n/a
[0] 6 21272 super2 n/a
[0] 7 21273 super2 n/a
[0] 8 12678 super3 n/a
[0] 9 12679 super3 n/a
[0] 10 12680 super3 n/a
[0] 11 12681 super3 n/a
[0] 12 12682 super3 n/a
[0] 13 12683 super3 n/a
[0] 14 12684 super3 n/a
[0] 15 12685 super3 n/a
[0] MPI startup(): I_MPI_DEBUG=100
[0] MPI startup(): I_MPI_FABRICS=tcp
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x2b5b75904468) failed
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=4, MPI_COMM_WORLD, status=0x2b1215f664b8) failed
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=4, MPI_COMM_WORLD, status=0x2b46d71664b8) failed
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x2b1216204468) failed
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=4, MPI_COMM_WORLD, status=0x2ba934a664b8) failed
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed
....
srun: error: super3: tasks 8-15: Exited with exit code 1
srun: Terminating job step 77374.0
srun: error: super2: tasks 1-7: Exited with exit code 1
[/bash]
-------------------------------------------------------------------------------

Was 16 copies of the program.



0 Kudos
James_T_Intel
Moderator
1,916 Views
Hi Leonardo,

Unfortunately,wedonotofficially support Haskell. I will do what I can to help resolve your issue.

It looks like you are having a network connection issue. What network card(s) are in the nodes? What is the output from ifconfig?

Also, how much memory is your program using? How much is available per node? Have you tried running on more nodes, with less processes per node?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Leonardo_Oliveira
1,916 Views
Hi James,
My Haskell code is just a binding to MPI C library (Intel in the case of our cluster)...
Let me test as you said...more nodes with less process. But I don't think this is the problem. Because we have the same Haskell library that uses sockets TCP as low-level comunication...and it works fine.
0 Kudos
Leonardo_Oliveira
1,916 Views
Hi James,
Some problems were resolved, but this one continues.
..and one message left me doubts.

When I run the example with the variable set to:
export I_MPI_FABRICS=dapl
It show me the error:
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=dapl
[0] MPI startup(): I_MPI_PLATFORM=auto
[1:super7] unexpected disconnect completion event from [0:super7]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 1
[12:super8] unexpected disconnect completion event from [0:super7]
Assertion failed in file ../../dapl_module_util.c at line 2682: 0
internal ABORT - process 12
...

When I set:
export I_MPI_DAPL_UD=on
It shows me:
[7] dapl fabric is not available and fallback fabric is not enabled
[8] dapl fabric is not available and fallback fabric is not enabled
[15] dapl fabric is not available and fallback fabric is not enabled
...

When I run with:
export I_MPI_FABRICS=tcp
It shows me:
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=tcp
[0] MPI startup(): I_MPI_PLATFORM=auto
Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(113).................: MPI_Probe(src=MPI_ANY_SOURCE, tag=0, MPI_COMM_WORLD, status=0x2ba149084ef0) failed
MPIDI_CH3I_Progress(401).......:
MPID_nem_tcp_poll(2332)........:
MPID_nem_tcp_connpoll(2582)....:
state_commrdy_handler(2208)....:
MPID_nem_tcp_recv_handler(2081): socket closed

It`s dificult understand this errors...
What can I do to find the root of this problem...?
0 Kudos
James_T_Intel
Moderator
1,916 Views
Hi Leonardo,

The first and third look like network errors. Are the systems having any connectivity or stability issues?

The second could be due to an incorrect configuration. Can you please send the /etc/dat.conf file from the system?

Can you run with -verbose and send the output?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Reply