- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear friends, can you help us with the following problem?
We have Intel Xeon Phi 5110 P installed on Asus p8z77ws motherboard.
OS - CentOS 6 with necessary kernel version.
We have installed mpss 3.3.2 and tried to switch Xeon Phi to online mode or to update its flash, but got reset failed or timeout messages.
Is it broken?
Here is the log to show details:
1. Ifconfig shows mic0 interface
[root@171202-1 openflow]# ifconfig eth4 Link encap:Ethernet HWaddr 40:16:7E:34:E8:08 inet addr:192.168.0.66 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:18 Memory:f0700000-f0720000 eth5 Link encap:Ethernet HWaddr 82:50:FD:BC:9A:C7 inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 inet6 addr: fe80::8050:fdff:febc:9ac7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:11 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:660 (660.0 b) TX bytes:288 (288.0 b) Interrupt:17 Memory:f0800000-f0820000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:116 errors:0 dropped:0 overruns:0 frame:0 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9828 (9.5 KiB) TX bytes:9828 (9.5 KiB) mic0 Link encap:Ethernet HWaddr 82:50:FD:BC:9A:C7 inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 inet6 addr: fe80::8050:fdff:febc:9ac7/64 Scope:Link UP BROADCAST RUNNING MTU:64512 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
2. after booting linux mpssd wasn't launched (miccheck shows it), therefore we start it.
[root@171202-1 openflow]# miccheck MicCheck 3.4-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... fail mpssd daemon not running Status: FAIL Failure: mpssd daemon not running ... [root@171202-1 openflow]# service mpss start Starting Intel(R) MPSS: [FAILED] [root@171202-1 openflow]# mpssd & [1] 3578 [root@171202-1 openflow]# Error aquiring lockfile /var/lock/mpss: File exists [root@171202-1 openflow]# ps -A | grep mpss 3566 pts/0 00:00:00 mpssd 3578 pts/0 00:00:00 mpssd 3579 pts/0 00:00:00 mpssd <defunct>
3. miccheck shows fail on test 4 - "{C}{C}{C}{C}Check device is in online state and its postcode FF"
[root@171202-1 openflow]# miccheck MicCheck 3.4-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... fail device is not online: reset failed Status: FAIL Failure: A device test failed
4. we create dump of lscpci -vvv command.(complete file lspci_dump.txt is attached).
03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11) Subsystem: Intel Corporation Device 2500 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at e00000000 (64-bit, prefetchable) [size=8G] Region 4: Memory at f0400000 (64-bit, non-prefetchable) [size=128K] Capabilities: [44] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [4c] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <4us, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [98] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=4 offset=00017000 PBA: BAR=4 offset=00018000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Kernel driver in use: mic
5. we try to reset Xeon Phi by micctrl
[root@171202-1 openflow]# micctrl -s mic0: reset failed [root@171202-1 openflow]# micctrl -rw mic0: resetting [Error] Timeout booting MIC, check your installation
6. during resetting in the linux log file "messages" (complete file is attached) we can see something like this
.... Oct 25 13:42:16 171202-1 kernel: mic0: Resetting (Post Code 3C) Oct 25 13:42:17 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:18 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:19 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:20 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:21 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:22 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:23 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:24 171202-1 kernel: mic0: Resetting (Post Code F2) Oct 25 13:42:24 171202-1 kernel: Reattempting reset after F2/F4 failure Oct 25 13:42:24 171202-1 kernel: mic0: Transition from state resetting to resetting Oct 25 13:42:26 171202-1 kernel: mic0: Resetting (Post Code 3C) Oct 25 13:42:27 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:28 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:29 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:30 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:31 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:32 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:33 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:34 171202-1 kernel: mic0: Resetting (Post Code F2) Oct 25 13:42:34 171202-1 kernel: Reattempting reset after F2/F4 failure Oct 25 13:42:34 171202-1 kernel: mic0: Transition from state resetting to resetting Oct 25 13:42:36 171202-1 kernel: mic0: Resetting (Post Code 3C) Oct 25 13:42:37 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:38 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:38 171202-1 kernel: mic0: Transition from state resetting to reset failed Oct 25 13:42:38 171202-1 kernel: MIC 0 RESETFAIL postcode 3d 25651
7. using minicom to connect to /dev/tty/MIC0, but we get only "Initialization modem"
8. micinfo results
MicInfo Utility Log Created Sat Oct 25 13:53:07 2014 System Info HOST OS : Linux OS Version : 2.6.32-431.el6.x86_64 Driver Version : 3.4-1 MPSS Version : 3.4 Host Physical Memory : 32555 MB Device No: 0, Device Name: mic0 Version Flash Version : NotAvailable SMC Firmware Version : NotAvailable SMC Boot Loader Version : NotAvailable uOS Version : NotAvailable Device Serial Number : NotAvailable Board Vendor ID : 0x8086 Device ID : 0x2250 Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 128 bytes PCIe Max read req size : 512 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-5110P/5120D ECC Mode : NotAvailable SMC HW Revision : NotAvailable Cores Total No of Active Cores : NotAvailable Voltage : NotAvailable Frequency : NotAvailable Thermal Fan Speed Control : NotAvailable Fan RPM : NotAvailable Fan PWM : NotAvailable Die Temp : NotAvailable GDDR GDDR Vendor : NotAvailable GDDR Version : NotAvailable GDDR Density : NotAvailable GDDR Size : NotAvailable GDDR Technology : NotAvailable GDDR Speed : NotAvailable GDDR Frequency : NotAvailable GDDR Voltage : NotAvailable
We tried Xeon Phi with Red Hat* Enterprise Linux* 64-bit 7.0 (kernel 3.10.0-123) and got the same result. Also we tried it with Microsoft Windows Server 2012 R2 (64 bit) - in this case mpss doesn't install and roll back, installation log shows that it can't reset Xeon Phi too.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so when you started mic0 was available ?
I also see that your first ifconfig output shows eth5 == mic0 ?!?!?
As far as I understand the mpss stack, if you have 'mic0' then you're nearly good to go - try pinging the MIC at 172.31.1.1
Another thing to try is to power down the system and then turn it back on - I've had some heating problems with a 5110P which was caused by a motherboard fan not turning on properly. What happens if you power down the box, wait 5 minutes, power it back on and then check the kernel messages and/or /var/log/dmesg for anything related to the MIC. You can query the temperature of the card using 'micsmc-gui' or 'micsmc-t':
# micsmc -t mic0 (temp): Cpu Temp: ................ 40.00 C Memory Temp: ............. 26.00 C Fan-In Temp: ............. 24.00 C Fan-Out Temp: ............ 26.00 C Core Rail Temp: .......... 25.00 C Uncore Rail Temp: ........ 26.00 C Memory Rail Temp: ........ 26.00 C mic1 (temp): Cpu Temp: ................ 39.00 C Memory Temp: ............. 27.00 C Fan-In Temp: ............. 25.00 C Fan-Out Temp: ............ 29.00 C Core Rail Temp: .......... 30.00 C Uncore Rail Temp: ........ 30.00 C Memory Rail Temp: ........ 30.00 C
A 'micctrl -s' and 'micinfo' should report the card even if it's not online . Post the output of those commands and it should tell us a lot more.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your reply.
Yes, there is mic0 interface.And your are right, obviously mic0==eth5. But I don't know why.
We tried to ping MIC at 172.31.1.1, but got no response.
Priviously, we tried to shutdown, wait and restart computer - the same result. I think it's not a temperature problem, because we have a good fan, which blows through coprocessor and Xeon phi is cold, when I touch it by finger.
micsmc -t gives the following result:
[root@171202-1 openflow]# micsmc -t Warning: mic0: cannot access device information: device is not available [root@171202-1 openflow]# micctrl -s mic0: reset failed
/var/log/dmesg contrains the strings (complete version is attached):
mic 0000:04:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 mic 0000:04:00.0: setting latency timer to 64 mic 0000:04:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 alloc irq_desc for 48 on node -1 alloc kstat_irqs on node -1 mic 0000:04:00.0: irq 48 for MSI/MSI-X mic0: Transition from state ready to resetting BUG: soft lockup - CPU#2 stuck for 67s! [modprobe:1511] Modules linked in: mic(+)(U) iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 e1000e ptp pps_core sg lpc_ich mfd_core snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc shpchp ext4 jbd2 mbcache sd_mod crc_t10dif firewire_ohci firewire_core crc_itu_t ahci xhci_hcd wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] CPU 2 Modules linked in: mic(+)(U) iTCO_wdt iTCO_vendor_support microcode serio_raw i2c_i801 e1000e ptp pps_core sg lpc_ich mfd_core snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc shpchp ext4 jbd2 mbcache sd_mod crc_t10dif firewire_ohci firewire_core crc_itu_t ahci xhci_hcd wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1511, comm: modprobe Not tainted 2.6.32-431.el6.x86_64 #1 System manufacturer System Product Name/P8Z77 WS RIP: 0010:[<ffffffff8128d7e0>] [<ffffffff8128d7e0>] delay_tsc+0x30/0x80 RSP: 0018:ffff880812fcbb78 EFLAGS: 00000212 RAX: 00000000578913d6 RBX: ffff880812fcbb98 RCX: 00000000578913d6 RDX: 0000000000226cac RSI: 00000000002de95d RDI: 00000000002de974 RBP: ffffffff8100bb8e R08: ffff88080f970668 R09: 0000000000000000 R10: 0000000000000024 R11: 0000000000000000 R12: ffff880812fcbb68 R13: ffffffff8100bb8e R14: 0000000000000246 R15: ffff880812fcbb18 FS: 00007f22a8e6f700(0000) GS:ffff88002c300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f6306286000 CR3: 0000000810401000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process modprobe (pid: 1511, threadinfo ffff880812fca000, task ffff88080ea0eaa0) Stack: 000000000000000c 0000000000001b24 0000000008c435a0 0000000008c43180 <d> ffff880812fcbba8 ffffffff8128d7a6 ffff880812fcbc38 ffffffffa037a2ee <d> ffffffffa03e3238 ffff880812efc840 0000000000000246 0000000000000000 Call Trace: [<ffffffff8128d7a6>] ? __const_udelay+0x46/0x50 [<ffffffffa037a2ee>] ? calculate_etc_compensation+0xce/0x2c0 [mic] [<ffffffffa037a811>] ? adapter_init_device+0x331/0x470 [mic] [<ffffffffa036c8ca>] ? mic_probe+0x19a/0x560 [mic] [<ffffffff8128460a>] ? kobject_get+0x1a/0x30 [<ffffffff812a4db7>] ? local_pci_probe+0x17/0x20 [<ffffffff812a5fa1>] ? pci_device_probe+0x101/0x120 [<ffffffff8136d712>] ? driver_sysfs_add+0x62/0x90 [<ffffffff8136d8b0>] ? driver_probe_device+0xa0/0x2a0 [<ffffffff8136db5b>] ? __driver_attach+0xab/0xb0 [<ffffffff8136dab0>] ? __driver_attach+0x0/0xb0 [<ffffffff8136ce64>] ? bus_for_each_dev+0x64/0x90 [<ffffffff8136d64e>] ? driver_attach+0x1e/0x20 [<ffffffff8136c698>] ? bus_add_driver+0x1e8/0x2b0 [<ffffffff8136dea6>] ? driver_register+0x76/0x140 [<ffffffff81204496>] ? sysfs_create_file+0x26/0x30 [<ffffffff812a6206>] ? __pci_register_driver+0x56/0xd0 [<ffffffffa03f780a>] ? micveth_init+0x42/0x47 [mic] [<ffffffffa03f7000>] ? mic_init+0x0/0x44f [mic] [<ffffffffa03f723e>] ? mic_init+0x23e/0x44f [mic] [<ffffffff8152d515>] ? notifier_call_chain+0x55/0x80 [<ffffffffa03f7000>] ? mic_init+0x0/0x44f [mic] [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0 [<ffffffff810bc531>] ? sys_init_module+0xe1/0x250 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Code: 56 41 55 41 54 53 0f 1f 44 00 00 65 44 8b 2c 25 d8 e0 00 00 49 89 fe 66 66 90 0f ae e8 e8 d9 7e d8 ff 66 90 4c 63 e0 eb 11 66 90 <f3> 90 65 8b 1c 25 d8 e0 00 00 44 39 eb 75 23 66 66 90 0f ae e8 Call Trace: [<ffffffff8128d7fa>] ? delay_tsc+0x4a/0x80 [<ffffffff8128d7a6>] ? __const_udelay+0x46/0x50 [<ffffffffa037a2ee>] ? calculate_etc_compensation+0xce/0x2c0 [mic] [<ffffffffa037a811>] ? adapter_init_device+0x331/0x470 [mic] [<ffffffffa036c8ca>] ? mic_probe+0x19a/0x560 [mic] [<ffffffff8128460a>] ? kobject_get+0x1a/0x30 [<ffffffff812a4db7>] ? local_pci_probe+0x17/0x20 [<ffffffff812a5fa1>] ? pci_device_probe+0x101/0x120 [<ffffffff8136d712>] ? driver_sysfs_add+0x62/0x90 [<ffffffff8136d8b0>] ? driver_probe_device+0xa0/0x2a0 [<ffffffff8136db5b>] ? __driver_attach+0xab/0xb0 [<ffffffff8136dab0>] ? __driver_attach+0x0/0xb0 [<ffffffff8136ce64>] ? bus_for_each_dev+0x64/0x90 [<ffffffff8136d64e>] ? driver_attach+0x1e/0x20 [<ffffffff8136c698>] ? bus_add_driver+0x1e8/0x2b0 [<ffffffff8136dea6>] ? driver_register+0x76/0x140 [<ffffffff81204496>] ? sysfs_create_file+0x26/0x30 [<ffffffff812a6206>] ? __pci_register_driver+0x56/0xd0 [<ffffffffa03f780a>] ? micveth_init+0x42/0x47 [mic] [<ffffffffa03f7000>] ? mic_init+0x0/0x44f [mic] [<ffffffffa03f723e>] ? mic_init+0x23e/0x44f [mic] [<ffffffff8152d515>] ? notifier_call_chain+0x55/0x80 [<ffffffffa03f7000>] ? mic_init+0x0/0x44f [mic] [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0 [<ffffffff810bc531>] ? sys_init_module+0xe1/0x250 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b ETC timer compensation(2946ppm) is much higherthan expected mic_probe 4:0:0 as board #0 mic: number of devices detected 1 mic0: Resetting (Post Code F2) Reattempting reset after F2/F4 failure mic0: Transition from state resetting to resetting mic0: Resetting (Post Code 3C) mic0: Resetting (Post Code 3d) mic0: Resetting (Post Code 3d) mic0: Resetting (Post Code 3d) mic0: Resetting (Post Code 3d) mic0: Resetting (Post Code 3E) mic0: Resetting (Post Code 3E) mic0: Resetting (Post Code 3E) mic0: Resetting (Post Code F2) Reattempting reset after F2/F4 failure mic0: Transition from state resetting to resetting mic0: Resetting (Post Code 3C) mic0: Resetting (Post Code 3d) mic0: Resetting (Post Code 3d) mic0: Resetting (Post Code 3d) mic0: Resetting (Post Code 3d) mic0: Resetting (Post Code 3E) mic0: Resetting (Post Code 3E) mic0: Resetting (Post Code 3E) mic0: Resetting (Post Code F2) Reattempting reset after F2/F4 failure
Is it possible, that microcontroller of Xeon Phi answers for some requests through PCI-E, but Xeon Phi Chip is dead?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
micinfo results are at the end of the first post
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you install a new MPSS without doing a 'service mpss unload'? Based on the first messages file, here is what I think happened. The first mpssd is from before you installed the new MPSS. When you did the install, it didn't kill the mpssd (it isn't supposed to) and it didn't unload the mic kernel module (again, it isn't supposed to). The directions ask you to do this yourself. (I suspect this is to make things easier for cluster administrators who may be installing on a different root image from the one that is running.) However, it appears to have removed the lock file. When you ran the first miccheck, it found no lock file and told you there was no mpssd. So you restarted the mpss service, creating the second mpssd and the lock file. You can check that it is this second mpssd that is using the lock file by doing 'fuser /var/lock/mpss'. However, now you had two mpssds that thought they were talking to the card - and the mic kernel module was probably a little confused as well. You tried starting the mpssd daemon again but this time it found a lock file and failed. This is probably the defunct mpssd in your ps output. The second mpssd tried to pass a reset command to the card but when it tried to reset the memory on the card, it failed. That is what the post code F2 means. (For the curious, you can find the post codes in the MPSS Users Guide that comes with each MPSS release.)
The fix? Well,if I am right, the reboot of the host should have fixed things, but the coprocessor might be very confused, Can you try powering down the host? Actually pull the plug and wait a few seconds? This will completely reset the card. Let me know what happens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I installed 2 Xeon Phi 7120p co-processors on an Asus Z270 ws mother board. The motherboard support Above 4G option so that is not a problem. My operating system is CentOS 6.9. Both of the Xeon Phi's are blinking when the system is on, but when I do "lspci -vv" it only shows one of them slots. I removed (physically) the co-processor that is recognized by the system and kept the one that is missing. After rebooting, the system recognized the other one, which was not recognized with the second co-processor. But when the two of them are installed, only one of them is recognized by the system.
My bigger problem is the one that is recognized suffer from thermal issues and it become very hot. because after 5 minutes, when I do "lspci -vv" the co-processor become like this:
05:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: mic
I did some research on the "Unknown header type 7f" and I noticed it can be a thermal issue. When I checked the co-processors physically they were very hot. My case is a full tower with 4 big fans. I also put the case in a server room, but the same thing happen. In the case there are two more GPUs other than the Xeon Phi Co-processors.
What are the ways than I can avoid this problem? Is this because of the case? I think being that hot for the co-processors is not normal, since there is no load on them.
Any help would be appreciated.
Thanks
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page