- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I try to start an MPI job on with the following settings.
I have two nodes, workstation1 and workstation2.
I can ssh from workstation1 (10.0.0.1) to workstation2 (10.0.0.') without password. I've already arranged rsa keys.
I can ssh from both workstation1 and workstation2 to themselves without password.
I can ping from 10.0.0.1 to 10.0.0.2 and from 10.0.0.2 to 10.0.0.1
workstation 1 & workstation2 are connected via Mellanox inifiniband.
I'm running Intel(R) MPI Library, Version 2017 Update 2 Build 20170125
I've installed MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64
workstation1 /etc/hosts :
127.0.0.1 localhost 10.0.0.1 workstation1 # The following lines are desirable for IPv6 capable hosts #::1 ip6-localhost ip6-loopback #fe00::0 ip6-localnet #ff00::0 ip6-mcastprefix #ff02::1 ip6-allnodes #ff02::2 ip6-allrouters # mpi nodes 10.0.0.2 workstation2
-------------------------------------------------------------
workstation2 /etc/hosts :
127.0.0.1 localhost 10.0.0.2 workstation2 # The following lines are desirable for IPv6 capable hosts #::1 ip6-localhost ip6-loopback #fe00::0 ip6-localnet #ff00::0 ip6-mcastprefix #ff02::1 ip6-allnodes #ff02::2 ip6-allrouters #mpi nodes 10.0.0.1 workstation1
--------------------------------------------------------------
Here's my application start command, (simplified app names and params)
#!/bin/bash export PATH=$PATH:$PWD:/opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$I_MPI_ROOT/intel64/lib:../program1/bin:../program2/bin export I_MPI_FABRICS=dapl:dapl export I_MPI_DEBUG=6 export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1 # Due to the bug in IntelMPI, -genv I_MPI_ADJUST_BCAST "9" flags has been added. # Mode detailed information is available : https://software.intel.com/en-us/articles/intel-mpi-library-2017-known-issue-mpi-bcast-hang-on-large-user-defined-datatypes mpirun -l -genv I_MPI_ADJUST_BCAST "9" -genv I_MPI_PIN_DOMAIN=omp : -n 1 -host 10.0.0.1 ../program1/bin/program1 master stitching stitching \ : -n 1 -host 10.0.0.2 ../program1/bin/program1 slave dissemination \ : -n 1 -host 10.0.0.1 ../program1/bin/program2 param1 param2
-------------------------------------------
I can start my application in dual node with export I_MPI_FABRICS=tcp:tcp, but when I start with dapl:dapl it gives the following error :
OUTPUT :
0] [0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 2 Build 20170125 (id: 16752) [0] [0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation. All rights reserved. [0] [0] MPI startup(): Multi-threaded optimized library [0] [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1 [1] [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1 [2] [2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1 [0] [0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1 [0] [0] MPI startup(): dapl data transfer mode [1] [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1 [2] [2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1 [1] [1] MPI startup(): dapl data transfer mode [2] [2] MPI startup(): dapl data transfer mode [0] [0:10.0.0.1] unexpected DAPL event 0x4003 [0] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack: [0] MPIR_Init_thread(805): fail failed [0] MPID_Init(1831)......: channel initialization failed [0] MPIDI_CH3_Init(147)..: fail failed [0] (unknown)(): Internal MPI error! [1] [1:10.0.0.2] unexpected DAPL event 0x4003 [1] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack: [1] MPIR_Init_thread(805): fail failed [1] MPID_Init(1831)......: channel initialization failed [1] MPIDI_CH3_Init(147)..: fail failed [1] (unknown)(): Internal MPI error!
Do you have any idea what could be the cause? By the way, on single node with dapl, I can start my application on both computers separately (meaning -host 10.0.0.1 for all application for workstation1, never attaching 10.0.0.2 related apps).
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
This could be an internal MPI library issue, or something completely different. I'd like to see the output for a few tests to see if I can help you isolate the issue:
1) Run a simple "mpirun -n 1 -host 10.0.0.1 hostname : -n 1 -host 10.0.0.2 hostname"
2) Build the "test.c" example provided with Intel MPI (in the installation directory under the test directory) and run that:
$ mpicc test.c -o impi_test
$ mpirun -n 1 -host 10.0.0.1 ./impi_test : -n 1 -host 10.0.0.2 ./impi_test
This will help me determine if this is a startup issue as it looks like or more related to the mpmd setup you seem to be running.
Also, is your system configured as IPV4 or IPV6?
Regards,
Carlos
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page