Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
28 Views

ssh mic0 timeout and micctrl --initdefaults fails

Hi Xeon Phi experts:

When i try to connect to the phi (7120P) with ssh it hangs until timeout, and 'ping -c 3 mic0' has 100% package loss.

The only problem i had during setup was when executing:

~# systemctl stop mpssd
~# micctrl --initdefaults -vvv
   [Info] mic0: Using existing /etc/mpss/default.conf
   [Info] mic0: Using existing /etc/mpss/mic0.conf
   [Info] mic0: File System Base /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz
   [Info] mic0: MIC Family x100
   [Info] mic0: MPSSVersion 3.x
   [Info] mic0: Common files at /var/mpss/common
   [Info] mic0: Unique files at /var/mpss/mic0
   [Info] mic0: Hostname CTHULHU-mic0
[Network] mic0: ifdown mic0
  [Error] Failed to rename temporary file /etc/netctl/interfaces
[Filesys] mic0: Created /etc/netctl/static-mic0
[Network] mic0: ifup mic0
[Filesys] mic0: Update /var/mpss/mic0/etc/network/interfaces
   [Info] mic0: Removing conflicting existing /etc/hosts entry: 172.31.1.1	CTHULHU-mic0 mic0 #Generated-by-micctrl
[Filesys] mic0: Update /etc/hosts with 172.31.1.1 CTHULHU-mic0
   [Info] mic0: Verbose mode Disabled
   [Info] mic0: Linux OS image /usr/share/mpss/boot/bzImage-knightscorner
                System Map /usr/share/mpss/boot/bzImage-knightscorner
   [Info] mic0: Boot On Start Enabled
   [Info] mic0: Shutdown Timeout 300
   [Info] mic0: MIC Crash Dump at /var/crash/mic size 16
  [Error] mic0: Create failed for /etc/ssh/ rsa1 keys: Unknown error 255
   [Info] mic0: ExtraCommandLine 'highres=off noautogroup'
   [Info] mic0: RootDevice RamFS /var/mpss/mic0.image.gz
   [Info] mic0: Console hvc0
   [Info] mic0: PowerManagement cpufreq_on;corec6_on;pc3_on;pc6_on
   [Info] mic0: Cgroup memory=disabled
   [Info] mic0: [Parse] /etc/mpss/mic0.conf
   [Info] mic0: [Parse] Configuration version 1.1
   [Info] mic0: [Parse] /etc/mpss/default.conf
[Filesys] mic0: Update /var/mpss/mic0/etc/hosts

I have tried deleting the file, but it doesn't make difference..

Everything passes in miccheck:

~$ python2 miccheck.py
MicCheck 3.6.1-r1
Copyright (c) 2015, Intel Corporation.

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass
  Test 7 (mic0): Check running SMC firmware version is correct ... pass

Status: OK

And lspci -vvv has the expected output (LinkSta: Width x8 because my current cpu only has 28 pcie lanes):

# lspci -s 04:00.0 -vvv
04:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev 20)
	Subsystem: Intel Corporation Xeon Phi coprocessor SE10/7120 series
	Physical Slot: 4-1
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 52
	NUMA node: 0
	Region 0: [virtual] Memory at 13800000000 (64-bit, prefetchable) [size=16G]
	Region 4: Memory at fb400000 (64-bit, non-prefetchable) [size=128K]
	Capabilities: [44] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [4c] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s (ok), Width x8 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=4 offset=00017000
		PBA: BAR=4 offset=00018000
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Kernel driver in use: mic
	Kernel modules: mic_host

ip link shows mic0 is "DOWN", which may be a bit suspicious?:

2: mic0: <BROADCAST> mtu 64512 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 4c:79:ba:44:04:83 brd ff:ff:ff:ff:ff:ff

And beginning of dmesg | grep mic:

[    1.842992] mic: loading out-of-tree module taints kernel.
[    1.847430] mic 0000:04:00.0: enabling device (0100 -> 0102)
[    1.847654] mic0: Transition from state ready to resetting
[   11.848714] mic_probe 4:0:0 as board #0
[   11.848731] mic: number of devices detected 1 
[   12.874044] mic0: Resetting (Post Code 12)
[   12.874047] mic0: Transition from state resetting to ready
[   16.261422] mic0: Transition from state ready to booting
[   16.261443] mic image: /usr/share/mpss/boot/bzImage-knightscorner
[   34.112744] mic0: Transition from state booting to online
[ 1151.785868] mic0: Transition from state online to shutdown
[ 1173.888858] mic0: Transition from state shutdown to resetting
[ 1175.915456] mic0: Resetting (Post Code 3C)
[ 1176.928771] mic0: Resetting (Post Code 3d)
[ 1177.942096] mic0: Resetting (Post Code 3d)
[ 1178.955412] mic0: Resetting (Post Code 3d)
[ 1179.968696] mic0: Resetting (Post Code 3d)
[ 1180.982047] mic0: Resetting (Post Code 3E)
[ 1181.995368] mic0: Resetting (Post Code 3E)
[ 1183.008682] mic0: Resetting (Post Code 3E)
[ 1184.022000] mic0: Resetting (Post Code 09)
[ 1185.035322] mic0: Resetting (Post Code 09)
[ 1186.048633] mic0: Resetting (Post Code 10)
[ 1187.061958] mic0: Resetting (Post Code 12)
[ 1187.061961] mic0: Transition from state resetting to ready
[ 1523.197850] mic0: Transition from state ready to booting
[ 1523.197871] mic image: /usr/share/mpss/boot/bzImage-knightscorner
[ 1557.373556] mic0: Transition from state booting to online

I can't really figure out the problem. Is it my network configuration? And if so what should I do?

I use Arch linux (Manjaro) with linux 4.9, with mpss from the AUR.
MB: AsRock Taichi x99
CPU: i7 6800K

0 Kudos
2 Replies
Highlighted
New Contributor III
28 Views

The liines about not  being able to read/update the directory /etc/ssh are worrisome.

Unfortunately, ARC Linux is not supported. Here's what I get on a CentOS 6 system:

micctrl --initdefaults -vvv
[Filesys] mic0: Created directory /etc/mpss
[Filesys] mic0: Created /etc/mpss/default.conf
[Filesys] mic0: Created /etc/mpss/mic0.conf version 1.1
   [Info] mic0: File System Base /usr/share/mpss/boot/initramfs-knightscorner.cpio.gz
   [Info] mic0: MIC Family x100
   [Info] mic0: MPSSVersion 3.x
   [Info] mic0: Common files at /var/mpss/common
   [Info] mic0: Unique files at /var/mpss/mic0
   [Info] mic0: Hostname pleedo-mic0.nikhef.nl
[Filesys] mic0: Update MacAddrs in /etc/mpss/mic0.conf
   [Info] mic0: Network Static Pair MIC 172.31.1.1 Host 172.31.1.254
[Filesys] mic0: Updated /etc/sysconfig/network-scripts/ifcfg-mic0
[Network] mic0: ifup mic0
[Filesys] Update file /etc/resolv.conf
[Filesys] mic0: Update /var/mpss/mic0/etc/network/interfaces
   [Info] mic0: Using existing /etc/hosts entry: 172.31.1.1	pleedo-mic0.nikhef.nl mic0
[Filesys] mic0: Update Network in /etc/mpss/mic0.conf
   [Info] mic0: Verbose mode Disabled
   [Info] mic0: Linux OS image /usr/share/mpss/boot/bzImage-knightscorner
                System Map /usr/share/mpss/boot/bzImage-knightscorner
   [Info] mic0: Boot On Start Enabled
   [Info] mic0: Shutdown Timeout 300
   [Info] mic0: MIC Crash Dump at /var/crash/mic size 16
   [Info] mic0: ExtraCommandLine 'highres=off noautogroup'
[Filesys] mic0: Update RootDevice in /etc/mpss/mic0.conf
   [Info] mic0: RootDevice RAMFS /var/mpss/mic0.image.gz
   [Info] mic0: Console hvc0
   [Info] mic0: PowerManagement cpufreq_on;corec6_off;pc3_on;pc6_off
   [Info] mic0: Cgroup memory=disabled
   [Info] mic0: [Parse] /etc/mpss/mic0.conf
   [Info] mic0: [Parse] Configuration version 1.1
   [Info] mic0: [Parse] /etc/mpss/default.conf
[Filesys] mic0: Update /var/mpss/mic0/etc/hosts

 

Here's what I would do:

  • start the MPSS daemon
  • check for the device /dev/ttyMIC0
  • use a tool like minicom to connect to the console of the MIC and log in
  • check the network settings on the MIC

Good luck!

0 Kudos
Highlighted
28 Views

Thanks for the suggestion!

I have just solved the first problem about "/ets/netctl/interfaces".
It turned out that --initdefaults created a file in "/ets/netctl", for every invocation, named inter*****, where ***** were random letters, e.g. "inter0NmyNk".
I renamed the latest one to "interfaces", manually, and deleted the rest and the error disappeared.

Maybe something similar is happening for "/ets/ssh". I'll investigate and report back if I am successful.

P.S. I know Arch linux is not supported and neither is the motherboard, technically. But there is always a chance that someone has seen it before, the unix(-like) OS's are far more alike than different.

0 Kudos