Community
cancel
Showing results for 
Search instead for 
Did you mean: 
William_Howell
Beginner
105 Views

Host to mic communication failing with DAPL provider ofa-v2-scif0

We are experiencing a problem while trying to use DAPL provider ofa-v2-scif0. Everything appears to work between mics on the same host, however jobs fail between host and mic. The scif appears to at the surface to be configured correctly. 

 

  • On the host we see the device with the desired iWARP transport
[host ~]# ibv_devinfo 
hca_id: scif0
        transport:                      iWARP (1)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe29:0385
        sys_image_guid:                 4c79:baff:fe29:0385
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1000
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.35.5100
        node_guid:                      0002:c903:0019:3e50
        sys_image_guid:                 0002:c903:0019:3e53
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               24
                        port_lmc:               0x00
                        link_layer:             InfiniBand
  • On the mics we see the same
[host-mic0 ~]# ibv_devinfo 
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.35.5100
        node_guid:                      0002:c903:0019:3e50
        sys_image_guid:                 0002:c903:0019:3e53
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               24
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: scif0
        transport:                      iWARP (1)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe29:0384
        sys_image_guid:                 4c79:baff:fe29:0384
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1001
                        port_lmc:               0x00
                        link_layer:             Ethernet
  • Running a hello world example with I_MPI_DEBUG set to 5 shows in the host+mic case:
####
#### Host + 1 MIC
####
[0] MPI startup(): Multi-threaded optimized library
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[0] MPI startup(): DAPL provider ofa-v2-scif0
[1] MPI startup(): DAPL provider ofa-v2-scif0
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[2] MPI startup(): DAPL provider ofa-v2-scif0
[3] MPI startup(): DAPL provider ofa-v2-scif0
[0] MPI startup(): shm and dapl data transfer modes
[1] MPI startup(): shm and dapl data transfer modes
[2] MPI startup(): shm and dapl data transfer modes
[3] MPI startup(): shm and dapl data transfer modes
beacon047:SCM:2237:b2a76700: 122181 us(122181 us):  modify_qp_state: ERR type 2 qpn 0x5 gid 0xbf4a8c (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
beacon047:SCM:2237:b2a76700: 122211 us(30 us):  DAPL ERR modify_qp_state Network is unreachable
beacon047:SCM:2237:b2a76700: 122219 us(8 us):  CONN_RTU: QPS_RTR ERR Network is unreachable (2,1,9,5,3e9) -> 10.39.20.243 1960
beacon047:SCM:2237:b2a76700: 122827 us(608 us):  modify_qp_state: ERR type 2 qpn 0x6 gid 0xbf51ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
beacon047:SCM:2237:b2a76700: 122840 us(13 us):  DAPL ERR modify_qp_state Network is unreachable
beacon047:SCM:2237:b2a76700: 122846 us(6 us):  CONN_RTU: QPS_RTR ERR Network is unreachable (2,1,e,6,3e9) -> 10.39.20.243 1961
[0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
beacon047-mic0:SCM:1578:cc7a700: 98856 us(98856 us):  ACCEPT_RTU: rcv ERR, rcnt=0 op=1 <- 10.39.20.242
beacon047:SCM:2238:842f6700: 123490 us(123490 us):  modify_qp_state: ERR type 2 qpn 0x7 gid 0x1e529ac (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
beacon047:SCM:2238:842f6700: 123517 us(27 us):  DAPL ERR modify_qp_state Network is unreachable
beacon047:SCM:2238:842f6700: 123522 us(5 us):  CONN_RTU: QPS_RTR ERR Network is unreachable (2,1,d,7,3e9) -> 10.39.20.243 1960
[2:beacon047-mic0] unexpected DAPL event 0x4003
beacon047-mic0:SCM:1579:215ae700: 95960 us(95960 us):  ACCEPT_RTU: rcv ERR, rcnt=0 op=1 <- 10.39.20.242
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(784): 
MPID_Init(1326)......: channel initialization failed
MPIDI_CH3_Init(141)..: 
(unknown)(): Internal MPI error!
beacon047-mic0:SCM:1578:cc7a700: 100130 us(1274 us):  ACCEPT_RTU: rcv ERR, rcnt=0 op=1 <- 10.39.20.242
beacon047:SCM:2238:842f6700: 125038 us(1516 us):  modify_qp_state: ERR type 2 qpn 0x8 gid 0x1e5310c (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
beacon047:SCM:2238:842f6700: 125052 us(14 us):  DAPL ERR modify_qp_state Network is unreachable
beacon047:SCM:2238:842f6700: 125057 us(5 us):  CONN_RTU: QPS_RTR ERR Network is unreachable (2,1,f,8,3e9) -> 10.39.20.243 1961
[1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
beacon047:SCM:2237:b0efb400: 125145 us(2299 us):  DAPL ERR ibv_send Transport endpoint is not connected
beacon047:SCM:2238:8277b400: 125230 us(173 us):  DAPL ERR ibv_send Transport endpoint is not connected
[0:beacon047][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_send_rc.c:2234] error(0x40000): ofa-v2-scif0: Could not post RDMA_Write: DAT_INTERNAL_ERROR()
[1:beacon047][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_send_rc.c:2234] error(0x40000): ofa-v2-scif0: Could not post RDMA_Write: DAT_INTERNAL_ERROR()
[3:beacon047-mic0] unexpected DAPL event 0x4003
beacon047-mic0:SCM:1579:215ae700: 98073 us(2113 us):  ACCEPT_RTU: rcv ERR, rcnt=0 op=1 <- 10.39.20.242
Fatal error in MPI_Init: Internal MPI error!, error stack:
MPIR_Init_thread(784): 
MPID_Init(1326)......: channel initialization failed
MPIDI_CH3_Init(141)..: 
(unknown)(): Internal MPI error!
  • Running a dtest using ofa-v2-scif0 fails similarly between host and mic, but succeeds between multiple mics
  • The system configuration is 
OS: CentOS release 6.6 (Final)
Kernel: 2.6.32-504.30.3.el6.x86_64
MPSS: 3.6.1
OFED: OFED-3.18-1

Using intel compilers and mpi from Parallel Studio 2016 update 1

Compilers: 2016.1.056
MPI: 5.1.2.150

 

  • MPSS modules and OFED were rebuilt for the kernel 
0 Kudos
4 Replies
William_Howell
Beginner
105 Views

The output in my previous post actually shows something odd. On both the host and mic ibv_devinfo lists the scif link_layer as Ethernet. Judging from the information at https://software.intel.com/en-us/blogs/2014/05/20/troubleshooting-ofed-issues I would expect this to be IB.

hca_id: scif0
        transport:                      iWARP (1)
...
                        link_layer:             Ethernet

 

Does anyone have any suggestions on debugging this?

William_Howell
Beginner
105 Views

In case it helps anyone else, we were upgrading our system including changing

OFED: OFED-3.5-2-MIC-rc3   to   OFED-3.18-1

and 

MPSS: 3.4.3   to 3.6.1

Along the way the default setting in /etc/modprobe.d/ibscif.conf for option 'new_ib_type' seems to have changed from 0 to 1.

 

 According to the MPSS User Guide:

Note:If your host OS kernel version is older than v3.10 it is required to modify the /etc/modprobe.d/ibscif.conf file so that it contains a line “options ibscif new_ib_type=1”.

Since this applies the kernel version we run, ensuring the new_ib_type is set to 1 has resolved the scif communication issue. However, now the ibv_devinfo on the host has strange output

[root@host ~]# ibv_devinfo 
hca_id: scif0
        transport:                      invalid transport (-1)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe29:0385
        sys_image_guid:                 4c79:baff:fe29:0385
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1000
                        port_lmc:               0x00
                        link_layer:             Unknown

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.36.5000
        node_guid:                      0002:c903:0019:3e50
        sys_image_guid:                 0002:c903:0019:3e53
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               24
                        port_lmc:               0x00
                        link_layer:             InfiniBand

 

I'm quite confused by the output given that with these settings the scif appears to be operating correctly. Can someone please explain?

Loc_N_Intel
Employee
105 Views

Hi William,

This known issue with ibv_devinfo (found in recent MPSS) is being tracked internally. However, it is doesn't affect functionality. Thank you for sharing.

William_Howell
Beginner
105 Views

Hi Loc,

Thank you for the update.

 

 

Reply