- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We've successfully used IMPI 3.2.1 and 3.2.2 on our RHEL 4 cluster, but with 4.0.0 I get this:
$ mpirun -n 16 -r ssh -v -env I_MPI_FABRICS shm:dapl ./hello_c
running mpdallexit on n103-ib
LAUNCHED mpd on n103-ib via
RUNNING: mpd on n103-ib
LAUNCHED mpd on n104-ib via n103-ib
RUNNING: mpd on n104-ib
[1] dapl fabric is not available and fallback fabric is not enabled
[0] dapl fabric is not available and fallback fabric is not enabled
[6] dapl fabric is not available and fallback fabric is not enabled
[2] dapl fabric is not available and fallback fabric is not enabled
[5] dapl fabric is not available and fallback fabric is not enabled
[4] dapl fabric is not available and fallback fabric is not enabled
[3] dapl fabric is not available and fallback fabric is not enabled
[7] dapl fabric is not available and fallback fabric is not enabled
[8] dapl fabric is not available and fallback fabric is not enabled
[9] dapl fabric is not available and fallback fabric is not enabled
[12] dapl fabric is not available and fallback fabric is not enabled
[11] dapl fabric is not available and fallback fabric is not enabled
[14] dapl fabric is not available and fallback fabric is not enabled
rank 15 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 15: killed by signal 9
[15] dapl fabric is not available and fallback fabric is not enabled
[13] dapl fabric is not available and fallback fabric is not enabled
[10] dapl fabric is not available and fallback fabric is not enabled
rank 14 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 14: return code 254
rank 13 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 13: return code 254
rank 12 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 12: killed by signal 9
rank 11 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 11: killed by signal 9
rank 10 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 10: return code 254
rank 7 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 7: killed by signal 9
rank 6 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 6: killed by signal 9
rank 5 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 5: killed by signal 9
rank 4 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 4: killed by signal 9
rank 3 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 3: killed by signal 9
rank 2 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 2: return code 254
rank 1 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 1: return code 254
rank 0 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
The cluster uses (quite old) InfiniServ IB drivers.
$ mpirun -n 16 -r ssh -v -env I_MPI_FABRICS shm:dapl ./hello_c
running mpdallexit on n103-ib
LAUNCHED mpd on n103-ib via
RUNNING: mpd on n103-ib
LAUNCHED mpd on n104-ib via n103-ib
RUNNING: mpd on n104-ib
[1] dapl fabric is not available and fallback fabric is not enabled
[0] dapl fabric is not available and fallback fabric is not enabled
[6] dapl fabric is not available and fallback fabric is not enabled
[2] dapl fabric is not available and fallback fabric is not enabled
[5] dapl fabric is not available and fallback fabric is not enabled
[4] dapl fabric is not available and fallback fabric is not enabled
[3] dapl fabric is not available and fallback fabric is not enabled
[7] dapl fabric is not available and fallback fabric is not enabled
[8] dapl fabric is not available and fallback fabric is not enabled
[9] dapl fabric is not available and fallback fabric is not enabled
[12] dapl fabric is not available and fallback fabric is not enabled
[11] dapl fabric is not available and fallback fabric is not enabled
[14] dapl fabric is not available and fallback fabric is not enabled
rank 15 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 15: killed by signal 9
[15] dapl fabric is not available and fallback fabric is not enabled
[13] dapl fabric is not available and fallback fabric is not enabled
[10] dapl fabric is not available and fallback fabric is not enabled
rank 14 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 14: return code 254
rank 13 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 13: return code 254
rank 12 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 12: killed by signal 9
rank 11 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 11: killed by signal 9
rank 10 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 10: return code 254
rank 7 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 7: killed by signal 9
rank 6 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 6: killed by signal 9
rank 5 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 5: killed by signal 9
rank 4 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 4: killed by signal 9
rank 3 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 3: killed by signal 9
rank 2 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 2: return code 254
rank 1 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 1: return code 254
rank 0 in job 1 n103-ib_35999 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
The cluster uses (quite old) InfiniServ IB drivers.
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jon,
Could you add '-env I_MPI_FALLBACK_DEVICE 0' to your 3.2.2 run?
$ mpirun -r ssh -rr -genv I_MPI_DEVICE rdssm -genv I_MPI_FALLBACK_DEVICE 0 -n 2./hello_c
You probably used sock device before because there was silent fallback.
Add `-env I_MPI_DEBUG 5` to your 4.0 command line and provide the output. 2 processes will be enough:
$ mpirun -r ssh -rr -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DEBUG 5 -n 2./hello_c
Regards!
Dmitry
Could you add '-env I_MPI_FALLBACK_DEVICE 0' to your 3.2.2 run?
$ mpirun -r ssh -rr -genv I_MPI_DEVICE rdssm -genv I_MPI_FALLBACK_DEVICE 0 -n 2./hello_c
You probably used sock device before because there was silent fallback.
Add `-env I_MPI_DEBUG 5` to your 4.0 command line and provide the output. 2 processes will be enough:
$ mpirun -r ssh -rr -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DEBUG 5 -n 2./hello_c
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitry,
With 3.2.2.006 (with debugging on):
$ mpirun -r ssh -rr -genv I_MPI_DEVICE rdssm -genv I_MPI_FALLBACK_DEVICE 0 -genv I_MPI_DEBUG 5 -n 2 ./hello_c
[0] MPI startup(): DAPL provider InfiniHost0 specified in DAPL configuration file /etc/dat.conf
[1] MPI startup(): DAPL provider InfiniHost0 specified in DAPL configuration file /etc/dat.conf
[0] MPI startup(): RDMA, shared memory, and socket data transfer modes
[1] MPI startup(): RDMA, shared memory, and socket data transfer modes
[0] MPI Startup(): [1] MPI Startup(): process is pinned to CPU01 on node n021
process is pinned to CPU00 on node n021
Hello world from process 1 of 2
[0] Rank Pid Node name Pin cpu
[0] 0 10212 n021 0
[0] 1 10211 n021 1
[0] Init(): I_MPI_DEBUG=5
[0] Init(): I_MPI_DEVICE=rdssm
[0] Init(): I_MPI_FALLBACK_DEVICE=0
Hello world from process 0 of 2
With 4.0.0.025:
$ mpirun -r ssh -rr -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DEBUG 5 -n 2 ./hello_c
[0] dapl fabric is not available and fallback fabric is not enabled
[1] dapl fabric is not available and fallback fabric is not enabled
rank 1 in job 1 n021-ib_55314 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
Jon
With 3.2.2.006 (with debugging on):
$ mpirun -r ssh -rr -genv I_MPI_DEVICE rdssm -genv I_MPI_FALLBACK_DEVICE 0 -genv I_MPI_DEBUG 5 -n 2 ./hello_c
[0] MPI startup(): DAPL provider InfiniHost0 specified in DAPL configuration file /etc/dat.conf
[1] MPI startup(): DAPL provider InfiniHost0 specified in DAPL configuration file /etc/dat.conf
[0] MPI startup(): RDMA, shared memory, and socket data transfer modes
[1] MPI startup(): RDMA, shared memory, and socket data transfer modes
[0] MPI Startup(): [1] MPI Startup(): process is pinned to CPU01 on node n021
process is pinned to CPU00 on node n021
Hello world from process 1 of 2
[0] Rank Pid Node name Pin cpu
[0] 0 10212 n021 0
[0] 1 10211 n021 1
[0] Init(): I_MPI_DEBUG=5
[0] Init(): I_MPI_DEVICE=rdssm
[0] Init(): I_MPI_FALLBACK_DEVICE=0
Hello world from process 0 of 2
With 4.0.0.025:
$ mpirun -r ssh -rr -genv I_MPI_FABRICS shm:dapl -genv I_MPI_DEBUG 5 -n 2 ./hello_c
[0] dapl fabric is not available and fallback fabric is not enabled
[1] dapl fabric is not available and fallback fabric is not enabled
rank 1 in job 1 n021-ib_55314 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
Jon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jon,
It's unclear why InfiniHost0 has not been chosen by 4.0 library. Probably your DAPL library is quite old. Intel MPI Library 4.0 doesn't support DAPL 1.1 any more. In this case you need to install new library.
Could you provide your /etc/dat.conf file?
And it may be useful to take a look at the output of 4.0 with I_MPI_DEBUG set to 100.
Regards!
Dmitry
It's unclear why InfiniHost0 has not been chosen by 4.0 library. Probably your DAPL library is quite old. Intel MPI Library 4.0 doesn't support DAPL 1.1 any more. In this case you need to install new library.
Could you provide your /etc/dat.conf file?
And it may be useful to take a look at the output of 4.0 with I_MPI_DEBUG set to 100.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitry,
Yes, the DAPL library is quite old, and supports only DAPL 1.1. I didn't see any mention of this in the reference manual, though I see that it is mentioned on the website. Looks like it's finally time to upgrade!
In any case, here's the result with I_MPI_DEBUG set to 100:
$ mpirun -r ssh -rr -genv I_MPI_DEVICE rdssm -genv I_MPI_FALLBACK_DEVICE 0 -genv I_MPI_DEBUG 100 -n 2 ./hello_c
[0] MPI startup(): Intel MPI Library, Version 4.0 Build 20100224
[0] MPI startup(): Copyright (C) 2003-2010 Intel Corporation. All rights reserved.
[0] MPI startup(): RDMA, shared memory, and socket data transfer modes
[1] MPI startup(): RDMA, shared memory, and socket data transfer modes
[0] my_dlopen(): trying to dlopen: libdat.so
[1] my_dlopen(): trying to dlopen: libdat.so
[1] dapl fabric is not available and fallback fabric is not enabled
[0] dapl fabric is not available and fallback fabric is not enabled
rank 1 in job 1 n043-ib_50498 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
rank 0 in job 1 n043-ib_50498 caused collective abort of all ranks
exit status of rank 0: return code 254
And my dat.conf is simply
InfiniHost0 u1.1 nonthreadsafe default /usr/lib64/libdapl.so ri.1.1 "InfiniHost0 ib0" " "
Thanks,
Jon
Yes, the DAPL library is quite old, and supports only DAPL 1.1. I didn't see any mention of this in the reference manual, though I see that it is mentioned on the website. Looks like it's finally time to upgrade!
In any case, here's the result with I_MPI_DEBUG set to 100:
$ mpirun -r ssh -rr -genv I_MPI_DEVICE rdssm -genv I_MPI_FALLBACK_DEVICE 0 -genv I_MPI_DEBUG 100 -n 2 ./hello_c
[0] MPI startup(): Intel MPI Library, Version 4.0 Build 20100224
[0] MPI startup(): Copyright (C) 2003-2010 Intel Corporation. All rights reserved.
[0] MPI startup(): RDMA, shared memory, and socket data transfer modes
[1] MPI startup(): RDMA, shared memory, and socket data transfer modes
[0] my_dlopen(): trying to dlopen: libdat.so
[1] my_dlopen(): trying to dlopen: libdat.so
[1] dapl fabric is not available and fallback fabric is not enabled
[0] dapl fabric is not available and fallback fabric is not enabled
rank 1 in job 1 n043-ib_50498 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
rank 0 in job 1 n043-ib_50498 caused collective abort of all ranks
exit status of rank 0: return code 254
And my dat.conf is simply
InfiniHost0 u1.1 nonthreadsafe default /usr/lib64/libdapl.so ri.1.1 "InfiniHost0 ib0" " "
Thanks,
Jon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jon,
Yeah, old DAPL library is the issue. Could you upgrade it downloading new OFED stack from http://openfabrics.org/
Let me know if the problem still persists.
Regards!
Dmitry
Yeah, old DAPL library is the issue. Could you upgrade it downloading new OFED stack from http://openfabrics.org/
Let me know if the problem still persists.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Upgrading to OFED 1.5 solves the problem. Thanks for the help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are welcome!
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page