Software Archive
Read-only legacy content

Intel Xeon Phi 5110P - device is not online

Petr_P_
Beginner
1,783 Views

Dear friends, can you help us with the following problem?

We have Intel Xeon Phi 5110 P installed on Asus p8z77ws motherboard.

OS - CentOS 6 with necessary kernel version.

We have installed mpss 3.3.2 and tried to switch Xeon Phi to online mode or to update its flash, but got reset failed or timeout messages.

Is it broken?

Here is the log to show details:

1. Ifconfig shows mic0 interface

[root@171202-1 openflow]# ifconfig 
eth4      Link encap:Ethernet  HWaddr 40:16:7E:34:E8:08  
          inet addr:192.168.0.66  Bcast:192.168.0.255  Mask:255.255.255.0 
          UP BROADCAST MULTICAST  MTU:1500  Metric:1 
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b) 
          Interrupt:18 Memory:f0700000-f0720000 

eth5      Link encap:Ethernet  HWaddr 82:50:FD:BC:9A:C7  
          inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0 
          inet6 addr: fe80::8050:fdff:febc:9ac7/64 Scope:Link 
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 
          RX packets:11 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:1000 
          RX bytes:660 (660.0 b)  TX bytes:288 (288.0 b) 
          Interrupt:17 Memory:f0800000-f0820000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0 
          inet6 addr: ::1/128 Scope:Host 
          UP LOOPBACK RUNNING  MTU:16436  Metric:1 
          RX packets:116 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:0 
          RX bytes:9828 (9.5 KiB)  TX bytes:9828 (9.5 KiB) 

mic0      Link encap:Ethernet  HWaddr 82:50:FD:BC:9A:C7  
          inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0 
          inet6 addr: fe80::8050:fdff:febc:9ac7/64 Scope:Link 
          UP BROADCAST RUNNING  MTU:64512  Metric:1 
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0 
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b) 

2. after booting linux mpssd wasn't launched (miccheck shows it), therefore we start it.

[root@171202-1 openflow]# miccheck 
MicCheck 3.4-r1 
Copyright 2013 Intel Corporation All Rights Reserved 

Executing default tests for host 
  Test 0: Check number of devices the OS sees in the system ... pass 
  Test 1: Check mic driver is loaded ... pass 
  Test 2: Check number of devices driver sees in the system ... pass 
  Test 3: Check mpssd daemon is running ... fail 
    mpssd daemon not running 

Status: FAIL 
Failure: mpssd daemon not running 
...
[root@171202-1 openflow]# service mpss start 
Starting Intel(R) MPSS:                                    [FAILED] 
[root@171202-1 openflow]# mpssd & 
[1] 3578 
[root@171202-1 openflow]# Error aquiring lockfile /var/lock/mpss: File exists 

[root@171202-1 openflow]# ps -A | grep mpss 
 3566 pts/0    00:00:00 mpssd 
 3578 pts/0    00:00:00 mpssd 
 3579 pts/0    00:00:00 mpssd <defunct>

3. miccheck shows fail on test 4 - "{C}{C}{C}{C}Check device is in online state and its postcode FF"

[root@171202-1 openflow]# miccheck 
MicCheck 3.4-r1 
Copyright 2013 Intel Corporation All Rights Reserved 

Executing default tests for host 
  Test 0: Check number of devices the OS sees in the system ... pass 
  Test 1: Check mic driver is loaded ... pass 
  Test 2: Check number of devices driver sees in the system ... pass 
  Test 3: Check mpssd daemon is running ... pass 
Executing default tests for device: 0 
  Test 4 (mic0): Check device is in online state and its postcode is FF ... fail 
    device is not online: reset failed 

Status: FAIL 
Failure: A device test failed 

4. we create dump of lscpci -vvv command.(complete file lspci_dump.txt is attached).

03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11)
	Subsystem: Intel Corporation Device 2500
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at e00000000 (64-bit, prefetchable) [size=8G]
	Region 4: Memory at f0400000 (64-bit, non-prefetchable) [size=128K]
	Capabilities: [44] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [4c] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <4us, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=4 offset=00017000
		PBA: BAR=4 offset=00018000
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Kernel driver in use: mic 

5. we try to reset Xeon Phi by micctrl

[root@171202-1 openflow]# micctrl -s 
mic0: reset failed 
[root@171202-1 openflow]# micctrl -rw 
          mic0: resetting 
  [Error] Timeout booting MIC, check your installation

6. during resetting in the linux log file "messages" (complete file is attached) we can see something like this

....
Oct 25 13:42:16 171202-1 kernel: mic0: Resetting (Post Code 3C)
Oct 25 13:42:17 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:18 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:19 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:20 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:21 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:22 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:23 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:24 171202-1 kernel: mic0: Resetting (Post Code F2)
Oct 25 13:42:24 171202-1 kernel: Reattempting reset after F2/F4 failure
Oct 25 13:42:24 171202-1 kernel: mic0: Transition from state resetting to resetting
Oct 25 13:42:26 171202-1 kernel: mic0: Resetting (Post Code 3C)
Oct 25 13:42:27 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:28 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:29 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:30 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:31 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:32 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:33 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:34 171202-1 kernel: mic0: Resetting (Post Code F2)
Oct 25 13:42:34 171202-1 kernel: Reattempting reset after F2/F4 failure
Oct 25 13:42:34 171202-1 kernel: mic0: Transition from state resetting to resetting
Oct 25 13:42:36 171202-1 kernel: mic0: Resetting (Post Code 3C)
Oct 25 13:42:37 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:38 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:38 171202-1 kernel: mic0: Transition from state resetting to reset failed
Oct 25 13:42:38 171202-1 kernel: MIC 0 RESETFAIL postcode 3d 25651 

7. using minicom to connect to /dev/tty/MIC0, but we get only "Initialization modem"

8. micinfo results

MicInfo Utility Log
Created Sat Oct 25 13:53:07 2014


	System Info
		HOST OS			: Linux
		OS Version		: 2.6.32-431.el6.x86_64
		Driver Version		: 3.4-1
		MPSS Version		: 3.4
		Host Physical Memory	: 32555 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : NotAvailable
		SMC Firmware Version	 : NotAvailable
		SMC Boot Loader Version	 : NotAvailable
		uOS Version 		 : NotAvailable
		Device Serial Number 	 : NotAvailable

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x2250
		Subsystem ID 		 : 0x2500
		Coprocessor Stepping ID	 : 3
		PCIe Width 		 : x16
		PCIe Speed 		 : 5 GT/s
		PCIe Max payload size	 : 128 bytes
		PCIe Max read req size	 : 512 bytes
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : B1
		Board SKU 		 : B1PRQ-5110P/5120D
		ECC Mode 		 : NotAvailable
		SMC HW Revision 	 : NotAvailable

	Cores
		Total No of Active Cores : NotAvailable
		Voltage 		 : NotAvailable
		Frequency 		 : NotAvailable

	Thermal
		Fan Speed Control 	 : NotAvailable
		Fan RPM 		 : NotAvailable
		Fan PWM 		 : NotAvailable
		Die Temp		 : NotAvailable

	GDDR
		GDDR Vendor		 : NotAvailable
		GDDR Version		 : NotAvailable
		GDDR Density		 : NotAvailable
		GDDR Size		 : NotAvailable
		GDDR Technology		 : NotAvailable
		GDDR Speed		 : NotAvailable
		GDDR Frequency		 : NotAvailable
		GDDR Voltage		 : NotAvailable 

We tried Xeon Phi with Red Hat* Enterprise Linux* 64-bit 7.0 (kernel 3.10.0-123) and got the same result. Also we tried it with Microsoft Windows Server 2012 R2 (64 bit) - in this case mpss doesn't install and roll back, installation log shows that it can't reset Xeon Phi too.

0 Kudos
5 Replies
JJK
New Contributor III
1,783 Views

so when you started mic0 was available ?

I also see that your first ifconfig output shows eth5 == mic0 ?!?!?

As far as I understand the mpss stack, if you have 'mic0' then you're nearly good to go - try pinging the MIC at 172.31.1.1

Another thing to try is to power down the system and then turn it back on - I've had some heating problems with a 5110P which was caused by a motherboard fan not turning on properly. What happens if you power down the box, wait 5 minutes, power it back on and then check the kernel messages  and/or /var/log/dmesg for anything related to the MIC. You can query the temperature of the card using 'micsmc-gui' or 'micsmc-t':

# micsmc  -t    

mic0 (temp):
   Cpu Temp: ................ 40.00 C
   Memory Temp: ............. 26.00 C
   Fan-In Temp: ............. 24.00 C
   Fan-Out Temp: ............ 26.00 C
   Core Rail Temp: .......... 25.00 C
   Uncore Rail Temp: ........ 26.00 C
   Memory Rail Temp: ........ 26.00 C

mic1 (temp):
   Cpu Temp: ................ 39.00 C
   Memory Temp: ............. 27.00 C
   Fan-In Temp: ............. 25.00 C
   Fan-Out Temp: ............ 29.00 C
   Core Rail Temp: .......... 30.00 C
   Uncore Rail Temp: ........ 30.00 C
   Memory Rail Temp: ........ 30.00 C

A 'micctrl -s' and 'micinfo' should report the card even if it's not online . Post the output of those commands and it should tell us a lot more.

 

0 Kudos
Petr_P_
Beginner
1,783 Views

Thank you  for your reply.

Yes, there is mic0 interface.And your are right, obviously mic0==eth5. But I don't know why.

We tried to ping MIC at 172.31.1.1, but got no response.

Priviously, we tried to shutdown, wait and restart computer - the same result. I think it's not a temperature problem, because we have a good fan, which blows through coprocessor and Xeon phi is cold, when I touch it by finger.

micsmc  -t gives the following result:

[root@171202-1 openflow]# micsmc  -t
Warning: mic0: cannot access device information: device is not available
[root@171202-1 openflow]# micctrl -s
mic0: reset failed

/var/log/dmesg contrains the strings (complete version is attached):

mic 0000:04:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
mic 0000:04:00.0: setting latency timer to 64
mic 0000:04:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
  alloc irq_desc for 48 on node -1
  alloc kstat_irqs on node -1
mic 0000:04:00.0: irq 48 for MSI/MSI-X
mic0: Transition from state ready to resetting
BUG: soft lockup - CPU#2 stuck for 67s! [modprobe:1511]
Modules linked in: mic(+)(U) iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 e1000e ptp pps_core sg lpc_ich mfd_core snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc shpchp ext4 jbd2 mbcache sd_mod crc_t10dif firewire_ohci firewire_core crc_itu_t ahci xhci_hcd wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 2 
Modules linked in: mic(+)(U) iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 e1000e ptp pps_core sg lpc_ich mfd_core snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc shpchp ext4 jbd2 mbcache sd_mod crc_t10dif firewire_ohci firewire_core crc_itu_t ahci xhci_hcd wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 1511, comm: modprobe Not tainted 2.6.32-431.el6.x86_64 #1 System manufacturer System Product Name/P8Z77 WS
RIP: 0010:[<ffffffff8128d7e0>]  [<ffffffff8128d7e0>] delay_tsc+0x30/0x80
RSP: 0018:ffff880812fcbb78  EFLAGS: 00000212
RAX: 00000000578913d6 RBX: ffff880812fcbb98 RCX: 00000000578913d6
RDX: 0000000000226cac RSI: 00000000002de95d RDI: 00000000002de974
RBP: ffffffff8100bb8e R08: ffff88080f970668 R09: 0000000000000000
R10: 0000000000000024 R11: 0000000000000000 R12: ffff880812fcbb68
R13: ffffffff8100bb8e R14: 0000000000000246 R15: ffff880812fcbb18
FS:  00007f22a8e6f700(0000) GS:ffff88002c300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f6306286000 CR3: 0000000810401000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 1511, threadinfo ffff880812fca000, task ffff88080ea0eaa0)
Stack:
 000000000000000c 0000000000001b24 0000000008c435a0 0000000008c43180
<d> ffff880812fcbba8 ffffffff8128d7a6 ffff880812fcbc38 ffffffffa037a2ee
<d> ffffffffa03e3238 ffff880812efc840 0000000000000246 0000000000000000
Call Trace:
 [<ffffffff8128d7a6>] ? __const_udelay+0x46/0x50
 [<ffffffffa037a2ee>] ? calculate_etc_compensation+0xce/0x2c0 [mic]
 [<ffffffffa037a811>] ? adapter_init_device+0x331/0x470 [mic]
 [<ffffffffa036c8ca>] ? mic_probe+0x19a/0x560 [mic]
 [<ffffffff8128460a>] ? kobject_get+0x1a/0x30
 [<ffffffff812a4db7>] ? local_pci_probe+0x17/0x20
 [<ffffffff812a5fa1>] ? pci_device_probe+0x101/0x120
 [<ffffffff8136d712>] ? driver_sysfs_add+0x62/0x90
 [<ffffffff8136d8b0>] ? driver_probe_device+0xa0/0x2a0
 [<ffffffff8136db5b>] ? __driver_attach+0xab/0xb0
 [<ffffffff8136dab0>] ? __driver_attach+0x0/0xb0
 [<ffffffff8136ce64>] ? bus_for_each_dev+0x64/0x90
 [<ffffffff8136d64e>] ? driver_attach+0x1e/0x20
 [<ffffffff8136c698>] ? bus_add_driver+0x1e8/0x2b0
 [<ffffffff8136dea6>] ? driver_register+0x76/0x140
 [<ffffffff81204496>] ? sysfs_create_file+0x26/0x30
 [<ffffffff812a6206>] ? __pci_register_driver+0x56/0xd0
 [<ffffffffa03f780a>] ? micveth_init+0x42/0x47 [mic]
 [<ffffffffa03f7000>] ? mic_init+0x0/0x44f [mic]
 [<ffffffffa03f723e>] ? mic_init+0x23e/0x44f [mic]
 [<ffffffff8152d515>] ? notifier_call_chain+0x55/0x80
 [<ffffffffa03f7000>] ? mic_init+0x0/0x44f [mic]
 [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0
 [<ffffffff810bc531>] ? sys_init_module+0xe1/0x250
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Code: 56 41 55 41 54 53 0f 1f 44 00 00 65 44 8b 2c 25 d8 e0 00 00 49 89 fe 66 66 90 0f ae e8 e8 d9 7e d8 ff 66 90 4c 63 e0 eb 11 66 90 <f3> 90 65 8b 1c 25 d8 e0 00 00 44 39 eb 75 23 66 66 90 0f ae e8 
Call Trace:
 [<ffffffff8128d7fa>] ? delay_tsc+0x4a/0x80
 [<ffffffff8128d7a6>] ? __const_udelay+0x46/0x50
 [<ffffffffa037a2ee>] ? calculate_etc_compensation+0xce/0x2c0 [mic]
 [<ffffffffa037a811>] ? adapter_init_device+0x331/0x470 [mic]
 [<ffffffffa036c8ca>] ? mic_probe+0x19a/0x560 [mic]
 [<ffffffff8128460a>] ? kobject_get+0x1a/0x30
 [<ffffffff812a4db7>] ? local_pci_probe+0x17/0x20
 [<ffffffff812a5fa1>] ? pci_device_probe+0x101/0x120
 [<ffffffff8136d712>] ? driver_sysfs_add+0x62/0x90
 [<ffffffff8136d8b0>] ? driver_probe_device+0xa0/0x2a0
 [<ffffffff8136db5b>] ? __driver_attach+0xab/0xb0
 [<ffffffff8136dab0>] ? __driver_attach+0x0/0xb0
 [<ffffffff8136ce64>] ? bus_for_each_dev+0x64/0x90
 [<ffffffff8136d64e>] ? driver_attach+0x1e/0x20
 [<ffffffff8136c698>] ? bus_add_driver+0x1e8/0x2b0
 [<ffffffff8136dea6>] ? driver_register+0x76/0x140
 [<ffffffff81204496>] ? sysfs_create_file+0x26/0x30
 [<ffffffff812a6206>] ? __pci_register_driver+0x56/0xd0
 [<ffffffffa03f780a>] ? micveth_init+0x42/0x47 [mic]
 [<ffffffffa03f7000>] ? mic_init+0x0/0x44f [mic]
 [<ffffffffa03f723e>] ? mic_init+0x23e/0x44f [mic]
 [<ffffffff8152d515>] ? notifier_call_chain+0x55/0x80
 [<ffffffffa03f7000>] ? mic_init+0x0/0x44f [mic]
 [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0
 [<ffffffff810bc531>] ? sys_init_module+0xe1/0x250
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
ETC timer compensation(2946ppm) is much higherthan expected
mic_probe 4:0:0 as board #0
mic: number of devices detected 1 
mic0: Resetting (Post Code F2)
Reattempting reset after F2/F4 failure
mic0: Transition from state resetting to resetting
mic0: Resetting (Post Code 3C)
mic0: Resetting (Post Code 3d)
mic0: Resetting (Post Code 3d)
mic0: Resetting (Post Code 3d)
mic0: Resetting (Post Code 3d)
mic0: Resetting (Post Code 3E)
mic0: Resetting (Post Code 3E)
mic0: Resetting (Post Code 3E)
mic0: Resetting (Post Code F2)
Reattempting reset after F2/F4 failure
mic0: Transition from state resetting to resetting
mic0: Resetting (Post Code 3C)
mic0: Resetting (Post Code 3d)
mic0: Resetting (Post Code 3d)
mic0: Resetting (Post Code 3d)
mic0: Resetting (Post Code 3d)
mic0: Resetting (Post Code 3E)
mic0: Resetting (Post Code 3E)
mic0: Resetting (Post Code 3E)
mic0: Resetting (Post Code F2)
Reattempting reset after F2/F4 failure 

Is it possible, that microcontroller of Xeon Phi answers for some requests through PCI-E, but Xeon Phi Chip is dead?

0 Kudos
Petr_P_
Beginner
1,783 Views

micinfo results are at the end of the first post

0 Kudos
Frances_R_Intel
Employee
1,783 Views

Did you install a new MPSS without doing a 'service mpss unload'? Based on the first messages file, here is what I think happened. The first mpssd is from before you installed the new MPSS. When you did the install, it didn't kill the mpssd (it isn't supposed to) and it didn't unload the mic kernel module (again, it isn't supposed to). The directions ask you to do this yourself. (I suspect this is to make things easier for cluster administrators who may be installing on a different root image from the one that is running.) However, it appears to have removed the lock file. When you ran the first miccheck, it found no lock file and told you there was no mpssd. So you restarted the mpss service, creating the second mpssd and the lock file. You can check that it is this second mpssd that is using the lock file by doing 'fuser /var/lock/mpss'. However, now you had two mpssds that thought they were talking to the card - and the mic kernel module was probably a little confused as well. You tried starting the mpssd daemon again but this time it found a lock file and failed. This is probably the defunct mpssd in your ps output. The second mpssd tried to pass a reset command to the card but when it tried to reset the memory on the card, it failed. That is what the post code F2 means. (For the curious, you can find the post codes in the MPSS Users Guide that comes with each MPSS release.)

The fix? Well,if I am right, the reboot of the host should have fixed things, but the coprocessor might be very confused, Can you try powering down the host? Actually pull the plug and wait a few seconds? This will completely reset the card. Let me know what happens.

0 Kudos
Khalaj__Mohammad_Ebr
1,783 Views

Hi,

I installed 2 Xeon Phi 7120p co-processors on an Asus Z270 ws mother board. The motherboard support Above 4G option so that is not a problem. My operating system is CentOS 6.9. Both of the Xeon Phi's are blinking when the system is on, but when I do "lspci -vv" it only shows one of them slots. I removed (physically) the co-processor that is recognized by the system and kept the one that is missing. After rebooting, the system recognized the other one, which was not recognized with the second co-processor. But when the two of them are installed, only one of them is recognized by the system.

My bigger problem is the one that is recognized suffer from thermal issues and it become very hot. because after 5 minutes, when I do "lspci -vv" the co-processor become like this:

05:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: mic

I did some research on the "Unknown header type 7f" and I noticed it can be a thermal issue. When I checked the co-processors physically they were very hot. My case is a full tower with 4 big fans. I also put the case in a server room, but the same thing happen. In the case there are two more GPUs other than the Xeon Phi Co-processors.

What are the ways than I can avoid this problem? Is this because of the case? I think being that hot for the co-processors is not normal, since there is no load on them.

Any help would be appreciated.

Thanks

0 Kudos
Reply