- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are experiencing a problem while trying to use DAPL provider ofa-v2-scif0. Everything appears to work between mics on the same host, however jobs fail between host and mic. The scif appears to at the surface to be configured correctly.
- On the host we see the device with the desired iWARP transport
[host ~]# ibv_devinfo hca_id: scif0 transport: iWARP (1) fw_ver: 0.0.1 node_guid: 4c79:baff:fe29:0385 sys_image_guid: 4c79:baff:fe29:0385 vendor_id: 0x8086 vendor_part_id: 0 hw_ver: 0x1 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1000 port_lmc: 0x00 link_layer: Ethernet hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.35.5100 node_guid: 0002:c903:0019:3e50 sys_image_guid: 0002:c903:0019:3e53 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: MT_1100120019 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 24 port_lmc: 0x00 link_layer: InfiniBand
- On the mics we see the same
[host-mic0 ~]# ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.35.5100 node_guid: 0002:c903:0019:3e50 sys_image_guid: 0002:c903:0019:3e53 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 24 port_lmc: 0x00 link_layer: InfiniBand hca_id: scif0 transport: iWARP (1) fw_ver: 0.0.1 node_guid: 4c79:baff:fe29:0384 sys_image_guid: 4c79:baff:fe29:0384 vendor_id: 0x8086 vendor_part_id: 0 hw_ver: 0x1 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1001 port_lmc: 0x00 link_layer: Ethernet
- Running a hello world example with I_MPI_DEBUG set to 5 shows in the host+mic case:
#### #### Host + 1 MIC #### [0] MPI startup(): Multi-threaded optimized library [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0 [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0 [0] MPI startup(): DAPL provider ofa-v2-scif0 [1] MPI startup(): DAPL provider ofa-v2-scif0 [2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0 [3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0 [2] MPI startup(): DAPL provider ofa-v2-scif0 [3] MPI startup(): DAPL provider ofa-v2-scif0 [0] MPI startup(): shm and dapl data transfer modes [1] MPI startup(): shm and dapl data transfer modes [2] MPI startup(): shm and dapl data transfer modes [3] MPI startup(): shm and dapl data transfer modes beacon047:SCM:2237:b2a76700: 122181 us(122181 us): modify_qp_state: ERR type 2 qpn 0x5 gid 0xbf4a8c (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0 beacon047:SCM:2237:b2a76700: 122211 us(30 us): DAPL ERR modify_qp_state Network is unreachable beacon047:SCM:2237:b2a76700: 122219 us(8 us): CONN_RTU: QPS_RTR ERR Network is unreachable (2,1,9,5,3e9) -> 10.39.20.243 1960 beacon047:SCM:2237:b2a76700: 122827 us(608 us): modify_qp_state: ERR type 2 qpn 0x6 gid 0xbf51ec (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0 beacon047:SCM:2237:b2a76700: 122840 us(13 us): DAPL ERR modify_qp_state Network is unreachable beacon047:SCM:2237:b2a76700: 122846 us(6 us): CONN_RTU: QPS_RTR ERR Network is unreachable (2,1,e,6,3e9) -> 10.39.20.243 1961 [0] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [0] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 beacon047-mic0:SCM:1578:cc7a700: 98856 us(98856 us): ACCEPT_RTU: rcv ERR, rcnt=0 op=1 <- 10.39.20.242 beacon047:SCM:2238:842f6700: 123490 us(123490 us): modify_qp_state: ERR type 2 qpn 0x7 gid 0x1e529ac (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0 beacon047:SCM:2238:842f6700: 123517 us(27 us): DAPL ERR modify_qp_state Network is unreachable beacon047:SCM:2238:842f6700: 123522 us(5 us): CONN_RTU: QPS_RTR ERR Network is unreachable (2,1,d,7,3e9) -> 10.39.20.243 1960 [2:beacon047-mic0] unexpected DAPL event 0x4003 beacon047-mic0:SCM:1579:215ae700: 95960 us(95960 us): ACCEPT_RTU: rcv ERR, rcnt=0 op=1 <- 10.39.20.242 Fatal error in MPI_Init: Internal MPI error!, error stack: MPIR_Init_thread(784): MPID_Init(1326)......: channel initialization failed MPIDI_CH3_Init(141)..: (unknown)(): Internal MPI error! beacon047-mic0:SCM:1578:cc7a700: 100130 us(1274 us): ACCEPT_RTU: rcv ERR, rcnt=0 op=1 <- 10.39.20.242 beacon047:SCM:2238:842f6700: 125038 us(1516 us): modify_qp_state: ERR type 2 qpn 0x8 gid 0x1e5310c (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0 beacon047:SCM:2238:842f6700: 125052 us(14 us): DAPL ERR modify_qp_state Network is unreachable beacon047:SCM:2238:842f6700: 125057 us(5 us): CONN_RTU: QPS_RTR ERR Network is unreachable (2,1,f,8,3e9) -> 10.39.20.243 1961 [1] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [1] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 beacon047:SCM:2237:b0efb400: 125145 us(2299 us): DAPL ERR ibv_send Transport endpoint is not connected beacon047:SCM:2238:8277b400: 125230 us(173 us): DAPL ERR ibv_send Transport endpoint is not connected [0:beacon047][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_send_rc.c:2234] error(0x40000): ofa-v2-scif0: Could not post RDMA_Write: DAT_INTERNAL_ERROR() [1:beacon047][../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapl_send_rc.c:2234] error(0x40000): ofa-v2-scif0: Could not post RDMA_Write: DAT_INTERNAL_ERROR() [3:beacon047-mic0] unexpected DAPL event 0x4003 beacon047-mic0:SCM:1579:215ae700: 98073 us(2113 us): ACCEPT_RTU: rcv ERR, rcnt=0 op=1 <- 10.39.20.242 Fatal error in MPI_Init: Internal MPI error!, error stack: MPIR_Init_thread(784): MPID_Init(1326)......: channel initialization failed MPIDI_CH3_Init(141)..: (unknown)(): Internal MPI error!
- Running a dtest using ofa-v2-scif0 fails similarly between host and mic, but succeeds between multiple mics
- The system configuration is
OS: CentOS release 6.6 (Final) Kernel: 2.6.32-504.30.3.el6.x86_64 MPSS: 3.6.1 OFED: OFED-3.18-1 Using intel compilers and mpi from Parallel Studio 2016 update 1 Compilers: 2016.1.056 MPI: 5.1.2.150
- MPSS modules and OFED were rebuilt for the kernel
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The output in my previous post actually shows something odd. On both the host and mic ibv_devinfo lists the scif link_layer as Ethernet. Judging from the information at https://software.intel.com/en-us/blogs/2014/05/20/troubleshooting-ofed-issues I would expect this to be IB.
hca_id: scif0 transport: iWARP (1) ... link_layer: Ethernet
Does anyone have any suggestions on debugging this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case it helps anyone else, we were upgrading our system including changing
OFED: OFED-3.5-2-MIC-rc3 to OFED-3.18-1 and MPSS: 3.4.3 to 3.6.1
Along the way the default setting in /etc/modprobe.d/ibscif.conf for option 'new_ib_type' seems to have changed from 0 to 1.
According to the MPSS User Guide:
Note:If your host OS kernel version is older than v3.10 it is required to modify the /etc/modprobe.d/ibscif.conf file so that it contains a line “options ibscif new_ib_type=1”.
Since this applies the kernel version we run, ensuring the new_ib_type is set to 1 has resolved the scif communication issue. However, now the ibv_devinfo on the host has strange output
[root@host ~]# ibv_devinfo hca_id: scif0 transport: invalid transport (-1) fw_ver: 0.0.1 node_guid: 4c79:baff:fe29:0385 sys_image_guid: 4c79:baff:fe29:0385 vendor_id: 0x8086 vendor_part_id: 0 hw_ver: 0x1 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1000 port_lmc: 0x00 link_layer: Unknown hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.36.5000 node_guid: 0002:c903:0019:3e50 sys_image_guid: 0002:c903:0019:3e53 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: MT_1100120019 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 24 port_lmc: 0x00 link_layer: InfiniBand
I'm quite confused by the output given that with these settings the scif appears to be operating correctly. Can someone please explain?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi William,
This known issue with ibv_devinfo (found in recent MPSS) is being tracked internally. However, it is doesn't affect functionality. Thank you for sharing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Loc,
Thank you for the update.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page