- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anybody could help me to run Intel MPI on IB?
My steps was:
1. Got Intel MPI 3.0 Evaluation for 30 days
2. Install it on shared directory
3. Configure password-less SSH between nodes
4. Configure (for other purposes) IBoIP - confirmed working
5. Compiled test MPI application - comes with Intel MPI
Now it works over Ethernet for can't run it over IB:
$ mpirun -n 4 -r ssh /gpfs/loadl/HPL/prefix/intel/mpi/3.0/test/test
Hello world: rank 0 of 4 running on n1
Hello world: rank 1 of 4 running on n3
Hello world: rank 2 of 4 running on n4
Hello world: rank 3 of 4 running on n2
$ mpirun -n 4 -r ssh -env I_MPI_DEVICE rdssm:OpenIB-cma -env I_MPI_FALLBACK_DEVICE 0 -env I_MPI_DEBUG 5 /gpfs/loadl/HPL/prefix/intel/mpi/3.0/test/test
[0] DAPL provider is not found and fallback device is not enabled
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(925): Initialization failed
MPIDD_Init(95).......: channel initialization failed
MPIDI_CH3_Init(144)..: generic failure with errno = -1
(unknown)():
rank 3 in job 1 n1_36568 caused collective abort of all ranks
exit status of rank 3: return code 13
[output from other nodes skipped]
My IB configuration: OFED 1.2.5 from Cisco:
OFED-1.2.5
ofa_kernel-1.2.5:
Git:
git://git.openfabrics.org/ofed_1_2/linux-2.6.git ofed_1_2_c
commit 21ec9ff84cba24ea6e53a268da21a72e6ab190d0
ofa_user-1.2.5:
libibverbs:
git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git master
commit d5052fa0bf8180be9edf1c4c1c014dde01f8a4dd
libmthca:
git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git master
commit f29c1d8a198a8d7f322c3924205a62770a9862a3
libmlx4:
git://git.kernel.org/pub/scm/libs/infiniband/libmlx4.git master
commit fc9edce51069fd38e33c9e627d9a89bc1e329b67
libehca:
git://git.openfabrics.org/ofed_1_2/libehca.git ofed_1_2
commit 00b26973092c949b11b8372eb027059fda7a8061
libipathverbs:
git://git.openfabrics.org/ofed_1_2/libipathverbs.git ofed_1_2
commit 15f62c3f045295dd2a941ae8d4e0e36035aad5cf
tvflash:
git://git.openfabrics.org/ofed_1_2/tvflash.git ofed_1_2
commit e0a0903b2a998a397ada053554fd678ed7914cc6
libibcm:
git://git.openfabrics.org/ofed_1_2/libibcm.git ofed_1_2
commit 8154d4d57f69789be6d26fdc8f10b552c83a87ec
libsdp:
git://git.openfabrics.org/ofed_1_2/libsdp.git ofed_1_2
commit 9e1c2cce1cbe030bf8fc9c03db4e80a703946af1
mstflint:
git://git.openfabrics.org/~mst/mstflint.git master
commit a9579dfbd259133cb50bf6b12ff247d5a04a9473
perftest:
git://git.openfabrics.org/~mst/perftest.git master
commit 20ea8b29537dda3f0a217b95ac50a0aaa7b24477
srptools:
git://git.openfabrics.org/ofed_1_2/srptools.git ofed_1_2
commit 883a08f0db168f4eb20293552f6416529da982f1
ipoibtools:
git://git.openfabrics.org/ofed_1_2/ipoibtools.git ofed_1_2
commit e29da6049cb725b175423fddc80181980ebfa0b4
librdmacm:
git://git.openfabrics.org/ofed_1_2/librdmacm.git ofed_1_2
commit 87b2be8cf17cca4f2212c32ecfd06c35d7ac7719
dapl:
git://git.openfabrics.org/ofed_1_2/dapl.git ofed_1_2
commit 3654c6ef425f94b9f27a593b0b8c1f3d7cc39029
management:
git://git.openfabrics.org/ofed_1_2/management.git ofed_1_2
commit 46bdba974ee2e1c8a64101effdb7358fd9060c8b
libcxgb3:
git://git.openfabrics.org/ofed_1_2/libcxgb3.git ofed_1_2
commit f97d cedc6d5af5c222542d69755ad4193f2114fc
qlvnictools:
git://git.openfabrics.org/ofed_1_2/qlvnictools.git ofed_1_2
commit bcfd11d4b5369398f2f816d0e1d89b6e98b25961
sdpnetstat:
git://git.openfabrics.org/ofed_1_2/sdpnetstat.git ofed_1_2
commit d726c17c3b54739ad71e2234c521aa3ee81a5905
ofascripts:
git://git.openfabrics.org/~vlad/ofascripts.git ofed_1_2_c
commit 598684991ff6127dd803540c757f56b289872bef
# MPI
mvapich-0.9.9-1458.src.rpm
mvapich2-0.9.8-15.src.rpm
openmpi-1.2.2-1.src.rpm
mpitests-2.0-705.src.rpm
$ ibv_devinfo
hca_id: mthca0
fw_ver: 4.8.917
node_guid: 0005:ad00:000b:b224
sys_image_guid: 0005:ad00:0100:d050
vendor_id: 0x05ad
vendor_part_id: 25208
hw_ver: 0xA0
board_id: HCA.HSDC.A0.Boot
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 6
& nbsp; port_lmc: 0x00
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 512 (2)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
$ cat /etc/dat.conf
#
# DAT 1.2 configuration file
#
# Each entry should have the following fields:
#
#
#
#
# For the uDAPL cma provder, specify as one of the following:
# network address, network hostname, or netdev name and 0 for port
#
# Simple (OpenIB-cma) default with netdev name provided first on list
# to enable use of same dat.conf version on all nodes
#
# Add examples for multiple interfaces and IPoIB HA fail over, and bonding
#
OpenIB-cma u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "ib1 0" ""
OpenIB-cma-2 u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "ib2 0" ""
OpenIB-cma-3 u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "ib3 0" ""
OpenIB-bond u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "bond0 0" ""
My steps was:
1. Got Intel MPI 3.0 Evaluation for 30 days
2. Install it on shared directory
3. Configure password-less SSH between nodes
4. Configure (for other purposes) IBoIP - confirmed working
5. Compiled test MPI application - comes with Intel MPI
Now it works over Ethernet for can't run it over IB:
$ mpirun -n 4 -r ssh /gpfs/loadl/HPL/prefix/intel/mpi/3.0/test/test
Hello world: rank 0 of 4 running on n1
Hello world: rank 1 of 4 running on n3
Hello world: rank 2 of 4 running on n4
Hello world: rank 3 of 4 running on n2
$ mpirun -n 4 -r ssh -env I_MPI_DEVICE rdssm:OpenIB-cma -env I_MPI_FALLBACK_DEVICE 0 -env I_MPI_DEBUG 5 /gpfs/loadl/HPL/prefix/intel/mpi/3.0/test/test
[0] DAPL provider is not found and fallback device is not enabled
[cli_0]: aborting job:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(925): Initialization failed
MPIDD_Init(95).......: channel initialization failed
MPIDI_CH3_Init(144)..: generic failure with errno = -1
(unknown)():
rank 3 in job 1 n1_36568 caused collective abort of all ranks
exit status of rank 3: return code 13
[output from other nodes skipped]
My IB configuration: OFED 1.2.5 from Cisco:
OFED-1.2.5
ofa_kernel-1.2.5:
Git:
git://git.openfabrics.org/ofed_1_2/linux-2.6.git ofed_1_2_c
commit 21ec9ff84cba24ea6e53a268da21a72e6ab190d0
ofa_user-1.2.5:
libibverbs:
git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git master
commit d5052fa0bf8180be9edf1c4c1c014dde01f8a4dd
libmthca:
git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git master
commit f29c1d8a198a8d7f322c3924205a62770a9862a3
libmlx4:
git://git.kernel.org/pub/scm/libs/infiniband/libmlx4.git master
commit fc9edce51069fd38e33c9e627d9a89bc1e329b67
libehca:
git://git.openfabrics.org/ofed_1_2/libehca.git ofed_1_2
commit 00b26973092c949b11b8372eb027059fda7a8061
libipathverbs:
git://git.openfabrics.org/ofed_1_2/libipathverbs.git ofed_1_2
commit 15f62c3f045295dd2a941ae8d4e0e36035aad5cf
tvflash:
git://git.openfabrics.org/ofed_1_2/tvflash.git ofed_1_2
commit e0a0903b2a998a397ada053554fd678ed7914cc6
libibcm:
git://git.openfabrics.org/ofed_1_2/libibcm.git ofed_1_2
commit 8154d4d57f69789be6d26fdc8f10b552c83a87ec
libsdp:
git://git.openfabrics.org/ofed_1_2/libsdp.git ofed_1_2
commit 9e1c2cce1cbe030bf8fc9c03db4e80a703946af1
mstflint:
git://git.openfabrics.org/~mst/mstflint.git master
commit a9579dfbd259133cb50bf6b12ff247d5a04a9473
perftest:
git://git.openfabrics.org/~mst/perftest.git master
commit 20ea8b29537dda3f0a217b95ac50a0aaa7b24477
srptools:
git://git.openfabrics.org/ofed_1_2/srptools.git ofed_1_2
commit 883a08f0db168f4eb20293552f6416529da982f1
ipoibtools:
git://git.openfabrics.org/ofed_1_2/ipoibtools.git ofed_1_2
commit e29da6049cb725b175423fddc80181980ebfa0b4
librdmacm:
git://git.openfabrics.org/ofed_1_2/librdmacm.git ofed_1_2
commit 87b2be8cf17cca4f2212c32ecfd06c35d7ac7719
dapl:
git://git.openfabrics.org/ofed_1_2/dapl.git ofed_1_2
commit 3654c6ef425f94b9f27a593b0b8c1f3d7cc39029
management:
git://git.openfabrics.org/ofed_1_2/management.git ofed_1_2
commit 46bdba974ee2e1c8a64101effdb7358fd9060c8b
libcxgb3:
git://git.openfabrics.org/ofed_1_2/libcxgb3.git ofed_1_2
commit f97d cedc6d5af5c222542d69755ad4193f2114fc
qlvnictools:
git://git.openfabrics.org/ofed_1_2/qlvnictools.git ofed_1_2
commit bcfd11d4b5369398f2f816d0e1d89b6e98b25961
sdpnetstat:
git://git.openfabrics.org/ofed_1_2/sdpnetstat.git ofed_1_2
commit d726c17c3b54739ad71e2234c521aa3ee81a5905
ofascripts:
git://git.openfabrics.org/~vlad/ofascripts.git ofed_1_2_c
commit 598684991ff6127dd803540c757f56b289872bef
# MPI
mvapich-0.9.9-1458.src.rpm
mvapich2-0.9.8-15.src.rpm
openmpi-1.2.2-1.src.rpm
mpitests-2.0-705.src.rpm
$ ibv_devinfo
hca_id: mthca0
fw_ver: 4.8.917
node_guid: 0005:ad00:000b:b224
sys_image_guid: 0005:ad00:0100:d050
vendor_id: 0x05ad
vendor_part_id: 25208
hw_ver: 0xA0
board_id: HCA.HSDC.A0.Boot
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid: 6
& nbsp; port_lmc: 0x00
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 512 (2)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
$ cat /etc/dat.conf
#
# DAT 1.2 configuration file
#
# Each entry should have the following fields:
#
#
#
#
# For the uDAPL cma provder, specify
# network address, network hostname, or netdev name and 0 for port
#
# Simple (OpenIB-cma) default with netdev name provided first on list
# to enable use of same dat.conf version on all nodes
#
# Add examples for multiple interfaces and IPoIB HA fail over, and bonding
#
OpenIB-cma u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "ib0 0" ""
OpenIB-cma-1 u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "ib1 0" ""
OpenIB-cma-2 u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "ib2 0" ""
OpenIB-cma-3 u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "ib3 0" ""
OpenIB-bond u1.2 nonthreadsafe default /usr/lib64/libdaplcma.so dapl.1.2 "bond0 0" ""
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My customers don't get much information from Cisco, so we're not sufficiently in the loop. However, I received the following comment this week:
the current topspin release 3.2.0-118 has fixes for uDAPL and Intel MPI, the release notes state:
uDAPL
Fixed uDAPL startup scalability problem when using Intel MPI. (PR
CSCse88951)
the current topspin release 3.2.0-118 has fixes for uDAPL and Intel MPI, the release notes state:
uDAPL
Fixed uDAPL startup scalability problem when using Intel MPI. (PR
CSCse88951)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your prompt reply.
I'm not using old Cisco MPI (actually, it was grabbed Cisco from Topspin and derives from MPICH, as I remember). Cisco now uses OFED. And I trying to run Intel MPI on newest OFED version.
I'm not using old Cisco MPI (actually, it was grabbed Cisco from Topspin and derives from MPICH, as I remember). Cisco now uses OFED. And I trying to run Intel MPI on newest OFED version.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you able to run Intel MPI on newest OFED version? The output with higher I_MPI_DEBUG value can be useful if you still have a problems with runs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After number of unsuccessful attempts, now it works (don't ask me why - I don't know).
Next question is how to compile 64-bit MPI applications with Intel MPI on x86_64 arch?
$ mpicc -o osu_acc_latency-intel-mpi osu_acc_latency.c
$ file osu_acc_latency-intel-mpi
osu_acc_latency-intel-mpi: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped
Next question is how to compile 64-bit MPI applications with Intel MPI on x86_64 arch?
$ mpicc -o osu_acc_latency-intel-mpi osu_acc_latency.c
$ file osu_acc_latency-intel-mpi
osu_acc_latency-intel-mpi: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please make sure that you have set 64-bit MPI environment. Source mpivars.
Best regards,
Andrey
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page