Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
28 Views

DAPL works but OFA not

Dear Intel colleagues,

I have just set up a new diskless cluster. Running IMB "Pingpong" with -genv I_MPI_FABRICS shm:dapl shows promising performance. But with -genv I_MPI_FABRICS shm:ofa things never worked. I have provided all system environment and execution traces below. Your help will be important to us.

# I_MPI_DEBUG 4 
/opt/intel/impi/4.1.1.036/intel64/bin/mpirun -n 2 -host dn01,dn02 -ppn 1 -genv I_MPI_DEBUG 4  -genv I_MPI_FABRICS shm:ofa /opt/intel/impi/4.1.1.036/intel64/bin/IMB-MPI1 PingPong
[0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

#I_MPI_DEBUG 2
/opt/intel/impi/4.1.1.036/intel64/bin/mpirun -n 2 -host dn01,dn02 -ppn 1 -genv I_MPI_DEBUG 2  -genv I_MPI_FABRICS shm:ofa   /opt/intel/impi/4.1.1.036/intel64/bin/IMB-MPI1 PingPong
[0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

#I_MPI_DEBUG 100
/opt/intel/impi/4.1.1.036/intel64/bin/mpirun -n 2 -host dn01,dn02 -ppn 1 -genv I_MPI_DEBUG 100  -genv I_MPI_FABRICS shm:ofa /opt/intel/impi/4.1.1.036/intel64/bin/IMB-MPI1 PingPong
[0] MPI startup(): Intel(R) MPI Library, Version 4.1 Update 1  Build 20130522
[0] MPI startup(): Copyright (C) 2003-2013 Intel Corporation.  All rights reserved.
[0] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[1] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[0] MPI startup(): Found 1 IB devices
[1] MPI startup(): Found 1 IB devices
[1] MPI startup(): Open 0 IB device: mlx4_0
[0] MPI startup(): Open 0 IB device: mlx4_0
[1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

mpirun -V
Intel(R) MPI Library for Linux* OS, Version 4.1 Update 1 Build 20130522
Copyright (C) 2003-2013, Intel Corporation. All rights reserved.

icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.2.183 Build 20130514
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

env | grep I_MPI
I_MPI_ROOT=/opt/intel/impi/4.1.1.036

pdsh -w dn[01-06] ls /usr/lib64/libibverbs.so
dn01: /usr/lib64/libibverbs.so
dn02: /usr/lib64/libibverbs.so
dn05: /usr/lib64/libibverbs.so
dn06: /usr/lib64/libibverbs.so
dn03: /usr/lib64/libibverbs.so
dn04: /usr/lib64/libibverbs.so

ibstat -V
ibstat BUILD VERSION: 1.6.1.MLNX20130822.dfac5dd Build date: Aug 25 2013 11:19:43

uname -a
Linux dn01 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
ssh dn01
Last login: Tue Oct 28 10:56:44 2014 from head.cluster

head -n 20 /etc/dat.conf
# DAT v2.0, v1.2 configuration file
#
# Each entry should have the following fields:
#
# <ia_name> <api_version> <threadsafety> <default> <lib_path> \
#           <provider_version> <ia_params> <platform_params>
#
# For uDAPL cma provder, <ia_params> is one of the following:
#       network address, network hostname, or netdev name and 0 for port
#
# For uDAPL scm provider, <ia_params> is device name and port
# For uDAPL ucm provider, <ia_params> is device name and port
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL iWARP provider, <ia_params> is netdev device name and 0
# For uDAPL RoCE provider, <ia_params> is device name and 0
#
#ON THIS CLUSTER, ONLY PORT 2 OF EACH HCA IS ACTIVATED
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0" ""
ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mthca0 1" ""

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1032855
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1032855
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


#if I_MPI_FALLBACK is enabled, then I_MPI_FABRICS shm:ofa will work, but apparently "falls back" to 1Gbit Ethernet
export I_MPI_FALLBACK=1
[root@head run-033]# /opt/intel/impi/4.1.1.036/intel64/bin/mpirun -n 2 -host dn01,dn02 -ppn 1 -genv I_MPI_DEBUG 4  -genv I_MPI_FABRICS shm:ofa /opt/intel/impi/4.1.1.036/intel64/bin/IMB-MPI1 PingPong
[0] MPI startup(): fabric ofa failed: will try use tcp fabric
[1] MPI startup(): fabric ofa failed: will try use tcp fabric
[0] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       30486    dn01       {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
[0] MPI startup(): 1       29284    dn02       {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
 ...(ifnored)
      2097152           20     17770.48       112.55
      4194304           10     35445.40       112.85

 

0 Kudos
3 Replies
Highlighted
Moderator
28 Views

Do you have OFED* installed?  In order to use the ofa fabric, you will need access to the native OFED* verbs.

0 Kudos
Highlighted
Beginner
28 Views

Could you be more specific? I do think I have MLNX OFED installed. Is there any command to test that?

0 Kudos
Highlighted
Beginner
28 Views

Hello,

I followed this thread to solve my issue. But unfortunately i was not able to resolve it.

Both DAPL and OFA doesn't work for me.

Software Versions:

  • MLNX_OFED_LINUX-2.3-1.0.1-rhel6.5-x86_64
  • Intel parallel cluster 2015
  • Intel MPSS 3.4.3
  • Mellanox Infiniband Connect X-3 adapter

With OFA:

export I_MPI_MIC=1
export I_MPI_FABRICS=shm:ofa
export I_MPI_DEVICE=rdssm
export I_MPI_OFA_ADAPTER_NAME=mlx4_0
export I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u ,ofa-v2-scif0
export I_MPI_PIN_MODE=pm
export I_MPI_PIN_DOMAIN=auto
 

Error Messages: [export I_MPI_DEBUG=2]

[42] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[19] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[43] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[26] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[27] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

Error Messages: [export I_MPI_DEBUG=100]

[0] MPI startup(): Intel(R) MPI Library, Version 5.0 Update 1  Build 20140709
[0] MPI startup(): Copyright (C) 2003-2014 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[1] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[2] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[5] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[9] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[10] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[3] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[4] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[6] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[7] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[8] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[11] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[9] MPI startup(): Found 2 IB devices
[10] MPI startup(): Found 2 IB devices
[6] MPI startup(): Found 2 IB devices
[8] MPI startup(): Found 2 IB devices
[7] MPI startup(): Found 2 IB devices
[11] MPI startup(): Found 2 IB devices
[0] MPI startup(): Found 2 IB devices
[1] MPI startup(): Found 2 IB devices
[3] MPI startup(): Found 2 IB devices
[2] MPI startup(): Found 2 IB devices
[4] MPI startup(): Found 2 IB devices
[5] MPI startup(): Found 2 IB devices
[10] MPI startup(): Open 0 IB device: mlx4_0
[6] MPI startup(): Open 0 IB device: mlx4_0
[9] MPI startup(): Open 0 IB device: mlx4_0
[8] MPI startup(): Open 0 IB device: mlx4_0
[5] MPI startup(): Open 0 IB device: mlx4_0
[7] MPI startup(): Open 0 IB device: mlx4_0
[3] MPI startup(): Open 0 IB device: mlx4_0
[1] MPI startup(): Open 0 IB device: mlx4_0
[4] MPI startup(): Open 0 IB device: mlx4_0
[0] MPI startup(): Open 0 IB device: mlx4_0
[11] MPI startup(): Open 0 IB device: mlx4_0
[42] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[2] MPI startup(): Open 0 IB device: mlx4_0
[36] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[31] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[37] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[37] MPI startup(): Found 0 IB devices
[31] MPI startup(): Found 0 IB devices
[38] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[38] MPI startup(): Found 0 IB devices
[40] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[40] MPI startup(): Found 0 IB devices
[20] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[33] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[33] MPI startup(): Found 0 IB devices
[20] MPI startup(): Found 0 IB devices
[17] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[43] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[43] MPI startup(): Found 0 IB devices
[13] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[25] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[25] MPI startup(): Found 0 IB devices
[27] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[30] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[30] MPI startup(): Found 0 IB devices
[17] MPI startup(): Found 0 IB devices
[23] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[23] MPI startup(): Found 0 IB devices
[27] MPI startup(): Found 0 IB devices
[12] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[12] MPI startup(): Found 0 IB devices
[13] MPI startup(): Found 0 IB devices
[29] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[29] MPI startup(): Found 0 IB devices
[15] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[15] MPI startup(): Found 0 IB devices
[35] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[35] MPI startup(): Found 0 IB devices
[36] MPI startup(): Found 0 IB devices
[39] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[39] MPI startup(): Found 0 IB devices
[22] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[22] MPI startup(): Found 0 IB devices
[41] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[41] MPI startup(): Found 0 IB devices
[42] MPI startup(): Found 0 IB devices
[24] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[24] MPI startup(): Found 0 IB devices
[26] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[26] MPI startup(): Found 0 IB devices
[14] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[14] MPI startup(): Found 0 IB devices
[16] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[16] MPI startup(): Found 0 IB devices
[18] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[18] MPI startup(): Found 0 IB devices
[28] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[28] MPI startup(): Found 0 IB devices
[32] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[32] MPI startup(): Found 0 IB devices
[19] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[19] MPI startup(): Found 0 IB devices
[36] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[34] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[34] MPI startup(): Found 0 IB devices
[31] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[37] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[38] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[21] MPI startup(): MPIDI_CH3I_RDMA_Process.boot_cq_hndl=(nil)
[21] MPI startup(): Found 0 IB devices
[30] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[20] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[39] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[13] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[33] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[25] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[40] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[17] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[28] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[23] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[41] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[29] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[42] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[12] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[24] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[43] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[32] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[27] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[14] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[15] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[34] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[21] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[35] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[16] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[22] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[26] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[18] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[19] MPI startup(): ofa fabric is not available and fallback fabric is not enabled
[10] MPI startup(): Start 1 ports per adapter
[11] MPI startup(): Start 1 ports per adapter
[0] MPI startup(): Start 1 ports per adapter
[2] MPI startup(): Start 1 ports per adapter
[5] MPI startup(): Start 1 ports per adapter
[3] MPI startup(): Start 1 ports per adapter
[1] MPI startup(): Start 1 ports per adapter
[7] MPI startup(): Start 1 ports per adapter
[8] MPI startup(): Start 1 ports per adapter
[6] MPI startup(): Start 1 ports per adapter
[9] MPI startup(): Start 1 ports per adapter
[4] MPI startup(): Start 1 ports per adapter

  • While installing MPSS and starting the openibd service, i noticed that setting up infiniband network interfaces doesnt say OK

[root@tbx-node07 MLNX_OFED_LINUX-2.3-1.0.1-rhel6.5-x86_64]# service openibd start
Loading HCA driver and Access Layer:                       [  OK  ]
Setting up InfiniBand network interfaces:
No configuration found for ib0
Setting up service network . . .                           [  done  ]

[root@node07 ~]# ibv_devinfo [-On host]
Failed to query device propshca_id:     mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.32.5100
        node_guid:                      f452:1403:006a:9050
        sys_image_guid:                 f452:1403:006a:9053
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x1
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               3
                        port_lmc:               0x00
                        link_layer:             InfiniBand

[On one of the MIC]

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.32.5100
        node_guid:                      f452:1403:006a:9050
        sys_image_guid:                 f452:1403:006a:9053
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               3
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: scif0
        transport:                      SCIF (2)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe57:02a8
        sys_image_guid:                 4c79:baff:fe57:02a8
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1001
                        port_lmc:               0x00
                        link_layer:             SCI

  • On host

                          [root@node07 ~]# ls /sys/class/infiniband

                          mlx4_0  scif0

  • on mic

                         [root@node07 ~]# ssh mic0 ls /sys/class/infiniband
                          mlx4_0
                          scif0

  • I_MPI_ROOT=/opt/intel/impi/5.0.1.035 is set to the following.

 

My setup has 4 mic cards in the server with 2 processors. Can you guys please help me in getting ofa and dapl work with intel mic's?

Please let me know if you need any additional information.

0 Kudos