Community
cancel
Showing results for 
Search instead for 
Did you mean: 
sayginify
Beginner
360 Views

Unexpected DAPL event 0x4003

Hello,

I try to start an MPI job on with the following settings.

I have two nodes, workstation1 and workstation2.
I can ssh from workstation1 (10.0.0.1) to workstation2 (10.0.0.') without password. I've already arranged rsa keys.
I can ssh from both workstation1 and workstation2 to themselves without password.
I can ping from 10.0.0.1 to 10.0.0.2 and from 10.0.0.2 to 10.0.0.1

workstation 1 & workstation2 are connected via Mellanox inifiniband.
I'm running Intel(R) MPI Library, Version 2017 Update 2  Build 20170125
I've installed MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64

workstation1 /etc/hosts :

127.0.0.1    localhost
10.0.0.1    workstation1

# The following lines are desirable for IPv6 capable hosts
#::1     ip6-localhost ip6-loopback
#fe00::0 ip6-localnet
#ff00::0 ip6-mcastprefix
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters

# mpi nodes
10.0.0.2 workstation2

-------------------------------------------------------------
workstation2 /etc/hosts :

127.0.0.1    localhost
10.0.0.2    workstation2

# The following lines are desirable for IPv6 capable hosts
#::1     ip6-localhost ip6-loopback
#fe00::0 ip6-localnet
#ff00::0 ip6-mcastprefix
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters

#mpi nodes
10.0.0.1 workstation1

--------------------------------------------------------------
Here's my application start command, (simplified app names and params)

#!/bin/bash
export PATH=$PATH:$PWD:/opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$I_MPI_ROOT/intel64/lib:../program1/bin:../program2/bin
export I_MPI_FABRICS=dapl:dapl
export I_MPI_DEBUG=6
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1

# Due to the bug in IntelMPI, -genv I_MPI_ADJUST_BCAST "9" flags has been added.
# Mode detailed information is available : https://software.intel.com/en-us/articles/intel-mpi-library-2017-known-issue-mpi-bcast-hang-on-large...

mpirun -l -genv I_MPI_ADJUST_BCAST "9" -genv I_MPI_PIN_DOMAIN=omp
: -n 1 -host 10.0.0.1 ../program1/bin/program1 master stitching stitching \
: -n 1 -host 10.0.0.2 ../program1/bin/program1 slave dissemination \
: -n 1 -host 10.0.0.1 ../program1/bin/program2 param1 param2

-------------------------------------------

I can start my application in dual node with export I_MPI_FABRICS=tcp:tcp, but when I start with dapl:dapl it gives the following error :

OUTPUT :

0] [0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 2  Build 20170125 (id: 16752)
[0] [0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation.  All rights reserved.
[0] [0] MPI startup(): Multi-threaded optimized library
[0] [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[1] [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[2] [2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[0] [0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[0] [0] MPI startup(): dapl data transfer mode
[1] [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[2] [2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[1] [1] MPI startup(): dapl data transfer mode
[2] [2] MPI startup(): dapl data transfer mode
[0] [0:10.0.0.1] unexpected DAPL event 0x4003
[0] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack:
[0] MPIR_Init_thread(805): fail failed
[0] MPID_Init(1831)......: channel initialization failed
[0] MPIDI_CH3_Init(147)..: fail failed
[0] (unknown)(): Internal MPI error!
[1] [1:10.0.0.2] unexpected DAPL event 0x4003
[1] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack:
[1] MPIR_Init_thread(805): fail failed
[1] MPID_Init(1831)......: channel initialization failed
[1] MPIDI_CH3_Init(147)..: fail failed
[1] (unknown)(): Internal MPI error!

Do you have any idea what could be the cause? By the way, on single node with dapl, I can start my application on both computers separately (meaning -host 10.0.0.1 for all application for workstation1, never attaching 10.0.0.2 related apps).

0 Kudos
1 Reply
Carlos_R_Intel
Employee
360 Views

Hi,

 

This could be an internal MPI library issue, or something completely different. I'd like to see the output for a few tests to see if I can help you isolate the issue:

1) Run a simple "mpirun -n 1 -host 10.0.0.1 hostname : -n 1 -host 10.0.0.2 hostname"

2) Build the "test.c" example provided with Intel MPI (in the installation directory under the test directory) and run that:

$ mpicc test.c -o impi_test

$ mpirun -n 1 -host 10.0.0.1 ./impi_test : -n 1 -host 10.0.0.2 ./impi_test

This will help me determine if this is a startup issue as it looks like or more related to the mpmd setup you seem to be running.

Also, is your system configured as IPV4 or IPV6?

Regards,
Carlos

Reply