Community
cancel
Showing results for 
Search instead for 
Did you mean: 
AShte
Beginner
780 Views

MPI error [../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c:2482]

I've a coarray program which I compile for distributed memory execution.

I then run it on a single 16-core node with different numbers of processors.

It runs fine with 2, 4  and 8 processes, but give the following error with 16 processes.

Can I get any clue from the error message?

Thanks

Anton

===> co_back1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 2 ./co_back1.x
188.85user 29.04system 1:55.30elapsed 188%CPU (0avgtext+0avgdata 66640maxresident)k
1624inputs+945432outputs (2major+13770minor)pagefaults 0swaps
===> co_back1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 4 ./co_back1.x
263.14user 94.69system 2:51.91elapsed 208%CPU (0avgtext+0avgdata 71376maxresident)k
0inputs+2791464outputs (0major+22881minor)pagefaults 0swaps
===> co_back1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 8 ./co_back1.x
420.93user 292.96system 2:41.95elapsed 440%CPU (0avgtext+0avgdata 88192maxresident)k
0inputs+8998288outputs (0major+48387minor)pagefaults 0swaps
===> co_back1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./co_back1.x
application called MPI_Abort(comm=0x84000000, 3) - process 0
[1:node43-038] unexpected disconnect completion event from [0:node43-038]
[1:node43-038][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c:2482] Intel MPI fatal error
: OpenIB-cma DTO operation posted for [0:node43-038] completed with error. status=0x1. cookie=0x150008000
0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c at line 2485: 0
internal ABORT - process 1

0 Kudos
7 Replies
James_T_Intel
Moderator
780 Views

This looks like a communication fabric error.  Try running the following command:

mpirun -genvall -genv I_MPI_FABRICS shm:dapl -genv I_MPI_HYDRA_DEBUG 1 -n 16 -machinefile ./nodes IMB-MPI1

James.

AShte
Beginner
780 Views

I get these errors:

[proxy:0:0@node43-038] HYDU_create_process (../../utils/launch/launch.c:622): execvp error on file I_MPI_HYDRA_DEBUG (No such file or directory)

Thanks

Anton

Artem_R_Intel1
Employee
780 Views

Hi Anton,

There's a misprint in James' command, could you please try:

mpirun -genvall -genv I_MPI_FABRICS shm:dapl -genv I_MPI_HYDRA_DEBUG 1 -n 16 -machinefile ./nodes IMB-MPI1

Also could you please provide some details about your environment (Intel MPI version, OS, OFED/DAPL versions).

AShte
Beginner
780 Views

Thanks, that command worked. The output is long, I put it here:

http://eis.bris.ac.uk/~mexas/9lap.o4579303

Other info you asked:

$ mpirun --version

Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)

$ uname -a
Linux newblue2 2.6.32-220.23.1.el6.x86_64 #1 SMP Mon Jun 18 09:58:09 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Regarding "OFED/DAPL versions" - not sure how to find this.

Is this helpful:

$ cat /etc/dat.conf

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
OpenIB-cma-roe-eth2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
OpenIB-cma-roe-eth3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth3 0" ""
OpenIB-scm-roe-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-scm-roe-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
 

Thanks

Anton

James_T_Intel
Moderator
780 Views

I've corrected the command line in my previous post, thanks Artem.

Use the following commands to get OFED and DAPL versions:

ofed_info
rpm -qa | grep dapl

Since the IMB test ran successfully, try running your program directly.

mpirun -genvall -genv I_MPI_DEBUG 5 -genv I_MPI_HYDRA_DEBUG 1 -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./co_back1.x

You can redirect to a file and directly attach the file to your post here.

AShte
Beginner
780 Views

I'm really sorry, but I cannot reproduce the error anymore.

I can now run with 2, 4, 8, 10, 16, 20, 25 and 40 images (MPI processes)

over 2 16-core nodes.

Perhaps there was some transient problem.

I apologise for wasting your time and thank you for valuable debugging hints.

Anton

James_T_Intel
Moderator
780 Views

No worries, good to hear that everything is working now.  Let us know if it shows up again.

Reply