Software Archive
Read-only legacy content
17061 Discussions

CentOS 7 + MPSS 3.4.x + OFED 3.1x: Bug in ibp_server?

Peter_G_3
Beginner
626 Views

Hi,

I'm currently in the process of setting up the OS for a diskless cluster with two Xeon Phi Cards per host.

Currently working with CentOS 7.0, MPSS 3.4.3, OFED 3.12-1 and Lustre 2.7.0.

Installation and booting host and two Xeon Phis works fine so far, except that as soon as I try load Lustre (using o2ib) on the second Xeon Phi the complete system crashes due to an error within the ibp_server module (logs can be found a. Using only one Xeon Phi lustre works fine, including mount over Infiniband.

Anybody got any experience with setting up Lustre on a similar system?

I already tried different versions of MPSS  (3.4.x and 3.3.x), OFED (3.12-1, 3.18-rc1), Lustre (2.6.0, 2.7.0).

For Lustre installation on Xeon Phi, the information posted here has been used: https://software.intel.com/de-de/blogs/2014/11/06/lustre-on-intel-xeon-phi#Compiling_Lustre_Client

Any help is highly appreciated.

Host + micx log files (MPSS 3.4.3, OFED 3.18-rc1, Lustre 2.7.0):

Mar 30 13:15:15 mac-node-015 systemd: Starting Intel(R) MPSS control service...
Mar 30 13:15:40 mac-node-015 kernel: mic0: Transition from state ready to booting
Mar 30 13:15:40 mac-node-015 kernel: mic image: /usr/share/mpss/boot/bzImage-knightscorner
Mar 30 13:15:40 mac-node-015 kernel: MIC 0 Booting
Mar 30 13:15:40 mac-node-015 kernel: mic1: Transition from state ready to booting
Mar 30 13:15:40 mac-node-015 kernel: mic image: /usr/share/mpss/boot/bzImage-knightscorner
Mar 30 13:15:40 mac-node-015 kernel: MIC 1 Booting
Mar 30 13:15:45 mac-node-015 kernel: Waiting for MIC 0 boot 5
Mar 30 13:15:45 mac-node-015 kernel: Waiting for MIC 1 boot 5
Mar 30 13:15:50 mac-node-015 kernel: Waiting for MIC 0 boot 10
Mar 30 13:15:50 mac-node-015 kernel: Waiting for MIC 1 boot 10
Mar 30 13:15:55 mac-node-015 kernel: Waiting for MIC 0 boot 15
Mar 30 13:15:55 mac-node-015 kernel: Waiting for MIC 1 boot 15
Mar 30 13:16:00 mac-node-015 kernel: Waiting for MIC 0 boot 20
Mar 30 13:16:00 mac-node-015 kernel: Waiting for MIC 1 boot 20
Mar 30 13:16:01 mac-node-015 kernel: MIC 0 Network link is up
Mar 30 13:16:01 mac-node-015 kernel: MIC 1 Network link is up
Mar 30 13:16:03 mac-node-015 kernel: mic0: Transition from state booting to online
Mar 30 13:16:03 mac-node-015 kernel: mic1: Transition from state booting to online
Mar 30 13:16:04 mac-node-015 mpss: Starting Intel(R) MPSS: [  OK  ]
Mar 30 13:16:04 mac-node-015 mpss: mic0: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
Mar 30 13:16:04 mac-node-015 mpss: mic1: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
Mar 30 13:16:04 mac-node-015 systemd: Started Intel(R) MPSS control service.
Mar 30 13:16:04 mac-node-015 kernel: device mic0 entered promiscuous mode
Mar 30 13:16:04 mac-node-015 kernel: br0: port 2(mic0) entered forwarding state
Mar 30 13:16:04 mac-node-015 kernel: br0: port 2(mic0) entered forwarding state
Mar 30 13:16:04 mac-node-015 kernel: device mic1 entered promiscuous mode
Mar 30 13:16:04 mac-node-015 kernel: br0: port 3(mic1) entered forwarding state
Mar 30 13:16:04 mac-node-015 kernel: br0: port 3(mic1) entered forwarding state
Mar 30 13:16:07 mac-node-015 systemd: Starting LSB: Start ofed layer on top of mpss...
Mar 30 13:16:07 mac-node-015 ofed-mic: Starting OFED Stack:
Mar 30 13:16:07 mac-node-015 kernel: CCL Direct Server v1.0
Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation
Mar 30 13:16:07 mac-node-015 kernel: CCL Direct CM Server v1.0
Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation
Mar 30 13:16:07 mac-node-015 kernel: CCL Direct SA Server v1.0
Mar 30 13:16:07 Copyright (c) 2011-2013 Intel Corporation
Mar 30 13:16:08 mac-node-015 kernel: ibscif: OpenFabrics IBSCIF Driver v0.1 built Mar 30 2015 10:38:18
Mar 30 13:16:08 mac-node-015 kernel: ibscif: max_pinned=50, window_size=40, blocking_send=0, blocking_recv=1, fast_rdma=1, host_proxy=0, rma_threshold=1024, scif_loopback=1, new_ib_type=1, verbose=0, check_grh=1
Mar 30 13:16:08 mac-node-015 kernel: ibscif: ibscif_add_one: my node_id is 0
Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: Device event: infiniband, scif0, add
Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: scif0: change (OpenFabrics IBSCIF Driver v0.1) -> (mac-node-015 scif0)
Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: scif0: change (OpenFabrics IBSCIF Driver v0.1) -> (mac-node-015 scif0)
Mar 30 13:16:08 mac-node-015 rdma-ndd[1345]: mlx4_0: change (mac-node-015 HCA-1) -> (mac-node-015 mlx4_0)
Mar 30 13:16:08 mac-node-015 ofed-mic: host[  OK  ]
Mar 30 13:16:08 mac-node-015 ntpd[654]: Listen normally on 9 mic0 fe80::4e79:baff:fe24:f79 UDP 123
Mar 30 13:16:08 mac-node-015 ntpd[654]: Listen normally on 10 mic1 fe80::4e79:baff:fe24:e59 UDP 123
Mar 30 13:16:09 mac-node-015 ibpd: pid 1682 /dev/ibp1 started 4 threads
Mar 30 13:16:13 mac-node-015 ofed-mic: mic0 : ib0 [  OK  ]
Mar 30 13:16:13 mac-node-015 ibpd: pid 1709 /dev/ibp2 started 4 threads
Mar 30 13:16:17 mac-node-015 ofed-mic: mic1 ib0 [  OK  ]
Mar 30 13:16:17 mac-node-015 systemd: Started LSB: Start ofed layer on top of mpss.
Mar 30 13:16:19 mac-node-015 ntpd[654]: Listen normally on 11 mic0:ib 192.0.2.100 UDP 123
Mar 30 13:19:57 mac-node-015 kernel: ibp_server: ibp_cmd_reg_user_mr(2670) ib_reg_user_mr returned -12
Mar 30 13:19:57 mac-node-015 kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
Mar 30 13:19:57 mac-node-015 kernel: IP: [<ffffffff812d0399>] __list_del_entry+0x29/0xd0
Mar 30 13:19:57 mac-node-015 kernel: PGD 4627b2067 PUD 457c0e067 PMD 0 
Mar 30 13:19:57 mac-node-015 kernel: Oops: 0000 [#1] SMP
Mar 30 13:19:26 mac-node-015-mic0 kernel: [  218.607348] Module libcfs loaded at 0xffffffffa0164000
Mar 30 13:19:27 mac-node-015-mic0 kernel: [  218.731611] LNet: HW CPU cores: 228, npartitions: 12
Mar 30 13:19:27 mac-node-015-mic0 kernel: [  218.739217] Module crc32c loaded at 0xffffffffa01c6000
Mar 30 13:19:27 mac-node-015-mic0 kernel: [  218.742417] alg: No test for adler32 (adler32-zlib)
Mar 30 13:19:27 mac-node-015-mic0 kernel: [  218.742855] alg: No test for crc32 (crc32-table)
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  223.826335] Module lnet loaded at 0xffffffffa01cc000
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  223.922522] Module obdclass loaded at 0xffffffffa0226000
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  224.203083] Lustre: Lustre: Build Version: v2_7_0_0--PRISTINE-2.6.38.8+mpss3.4.3
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  224.271101] Module ptlrpc loaded at 0xffffffffa030d000
Mar 30 13:19:32 mac-node-015-mic0 kernel: [  224.699319] Module ko2iblnd loaded at 0xffffffffa042f000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.097243] LNet: Added LNI 10.100.22.15@o2ib [8/768/0/180]
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.196620] Module fld loaded at 0xffffffffa046b000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.228555] Module lmv loaded at 0xffffffffa047b000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.264301] Module fid loaded at 0xffffffffa04b2000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.296241] Module mdc loaded at 0xffffffffa04bf000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.416045] Module lov loaded at 0xffffffffa04f2000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.534093] Module lustre loaded at 0xffffffffa0541000
Mar 30 13:19:34 mac-node-015-mic0 kernel: [  226.688858] modprobe used greatest stack depth: 4656 bytes left
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  241.947416] Module libcfs loaded at 0xffffffffa0164000
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  242.071345] LNet: HW CPU cores: 228, npartitions: 12
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  242.079014] Module crc32c loaded at 0xffffffffa01c6000
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  242.082280] alg: No test for adler32 (adler32-zlib)
Mar 30 13:19:50 mac-node-015-mic1 kernel: [  242.082724] alg: No test for crc32 (crc32-table)
Mar 30 13:19:55 mac-node-015-mic1 kernel: [  247.164349] Module lnet loaded at 0xffffffffa01cc000
Mar 30 13:19:55 mac-node-015-mic1 kernel: [  247.259874] Module obdclass loaded at 0xffffffffa0226000
Mar 30 13:19:55 mac-node-015-mic1 kernel: [  247.540370] Lustre: Lustre: Build Version: v2_7_0_0--PRISTINE-2.6.38.8+mpss3.4.3
Mar 30 13:19:55 mac-node-015-mic1 kernel: [  247.608764] Module ptlrpc loaded at 0xffffffffa030d000
Mar 30 13:19:56 mac-node-015-mic1 kernel: [  248.036393] Module ko2iblnd loaded at 0xffffffffa042f000

 

0 Kudos
2 Replies
Michael_H_Intel1
Employee
626 Views
  1. Do not use rdma_cm/ipoib/Lustre natively on the card before MPSS 3.4! There was a bug very similar to the one you described.
  2. I'm successfully running a similar combo (actually up to 8 Xeon Phi) using RH 6.4 (+ Ksplice to patch all security holes), OFED 3.5.2-MIC, Intel Enterprise Lustre 2.0, and MPSS 3.4/3.4.3
  3. Your CentOS7 might be a problem - considering the impact of systemd I'd not recommend using it for HPC - my recommendation would be CentOS 6.6

 

0 Kudos
Peter_G_3
Beginner
626 Views

Thanks for the reply.

1. Thanks for the information, I'll stay with 3.4.3 and won't do any further tests with 3.3.x

2. Glad to hear it should actually work. Might check with OFED 3.5.2-MIC.

3. CentOS 6.6 (or any other distribution) is not an option. Systemd so far had no real bad impact, everything working as expected. Still the bug might be specific to the used Linux kernel (3.10.0-123).

 

 

 

 

0 Kudos
Reply