- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've a coarray program which I compile for distributed memory execution.
I then run it on a single 16-core node with different numbers of processors.
It runs fine with 2, 4 and 8 processes, but give the following error with 16 processes.
Can I get any clue from the error message?
Thanks
Anton
===> co_back1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 2 ./co_back1.x
188.85user 29.04system 1:55.30elapsed 188%CPU (0avgtext+0avgdata 66640maxresident)k
1624inputs+945432outputs (2major+13770minor)pagefaults 0swaps
===> co_back1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 4 ./co_back1.x
263.14user 94.69system 2:51.91elapsed 208%CPU (0avgtext+0avgdata 71376maxresident)k
0inputs+2791464outputs (0major+22881minor)pagefaults 0swaps
===> co_back1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 8 ./co_back1.x
420.93user 292.96system 2:41.95elapsed 440%CPU (0avgtext+0avgdata 88192maxresident)k
0inputs+8998288outputs (0major+48387minor)pagefaults 0swaps
===> co_back1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./co_back1.x
application called MPI_Abort(comm=0x84000000, 3) - process 0
[1:node43-038] unexpected disconnect completion event from [0:node43-038]
[1:node43-038][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c:2482] Intel MPI fatal error
: OpenIB-cma DTO operation posted for [0:node43-038] completed with error. status=0x1. cookie=0x150008000
0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_poll_rc.c at line 2485: 0
internal ABORT - process 1
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This looks like a communication fabric error. Try running the following command:
mpirun -genvall -genv I_MPI_FABRICS shm:dapl -genv I_MPI_HYDRA_DEBUG 1 -n 16 -machinefile ./nodes IMB-MPI1
James.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I get these errors:
[proxy:0:0@node43-038] HYDU_create_process (../../utils/launch/launch.c:622): execvp error on file I_MPI_HYDRA_DEBUG (No such file or directory)
Thanks
Anton
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anton,
There's a misprint in James' command, could you please try:
mpirun -genvall -genv I_MPI_FABRICS shm:dapl -genv I_MPI_HYDRA_DEBUG 1 -n 16 -machinefile ./nodes IMB-MPI1
Also could you please provide some details about your environment (Intel MPI version, OS, OFED/DAPL versions).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, that command worked. The output is long, I put it here:
http://eis.bris.ac.uk/~mexas/9lap.o4579303
Other info you asked:
$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)
$ uname -a
Linux newblue2 2.6.32-220.23.1.el6.x86_64 #1 SMP Mon Jun 18 09:58:09 CDT 2012 x86_64 x86_64 x86_64 GNU/Linux
Regarding "OFED/DAPL versions" - not sure how to find this.
Is this helpful:
$ cat /etc/dat.conf
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0" ""
OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 1" ""
OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mthca0 2" ""
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
OpenIB-ipath0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 2" ""
OpenIB-ehca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ehca0 1" ""
OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
OpenIB-cma-roe-eth2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" ""
OpenIB-cma-roe-eth3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth3 0" ""
OpenIB-scm-roe-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-scm-roe-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 2" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ipath0 2" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mthca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 1" ""
ofa-v2-mthca0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mthca0 2" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-scm-roe-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
Thanks
Anton
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've corrected the command line in my previous post, thanks Artem.
Use the following commands to get OFED and DAPL versions:
ofed_info rpm -qa | grep dapl
Since the IMB test ran successfully, try running your program directly.
mpirun -genvall -genv I_MPI_DEBUG 5 -genv I_MPI_HYDRA_DEBUG 1 -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./co_back1.x
You can redirect to a file and directly attach the file to your post here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm really sorry, but I cannot reproduce the error anymore.
I can now run with 2, 4, 8, 10, 16, 20, 25 and 40 images (MPI processes)
over 2 16-core nodes.
Perhaps there was some transient problem.
I apologise for wasting your time and thank you for valuable debugging hints.
Anton
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No worries, good to hear that everything is working now. Let us know if it shows up again.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page