- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am currently managing a CentOS host system with several Xeon Phi 5100P coprocessors. One of the coprocessors (mic0) is exhibiting issues with accessing the SMC buffers, making it difficult to (1) perform/verify firmware updates via micflash, (2) verify coprocessor operations via miccheck, and (3) get power and thermal information via micsmc. The other coprocessors in the system do not exhibit these issues.
Output from "micflash -update -device 0"
No image path specified - Searching: /usr/share/mpss/flash mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0390-02.rom.smc mic0: Flash update started mic0: Flash update done mic0: SMC update started micflash: mic0: SMC update failed: SMC buffer size exceeded (0x1) mic0: Transitioning to ready state Please restart host for flash changes to take effect
Ouptut from "miccheck -d 0"
MicCheck 3.3-r1
Copyright 2013 Intel Corporation All Rights Reserved
Executing default tests for host
Test 0: Check number of devices the OS sees in the system ... pass
Test 1: Check mic driver is loaded ... pass
Test 2: Check number of devices driver sees in the system ... pass
Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
Test 5 (mic0): Check ras daemon is available in device ... pass
Test 6 (mic0): Check running flash version is correct ... pass
Test 7 (mic0): Check running SMC firmware version is correct ... fail
failed to get thermal information
Status: FAIL
Failure: failed to get thermal information
This fail appears to just be because of thermal information, not because of the firmware version. The output from "micsmc" and "micflash -getversion" verify this when checked against mic1:
Output from "micsmc -a mic0"
mic0 (info): Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID: ............... 0x2250 Number of Cores: ......... 60 OS Version: .............. 2.6.38.8+mpss3.3 Flash Version: ........... 2.1.02.0390 Driver Version: .......... 3.3-1 (<hostname omitted>) Stepping: ................ 0x3 Substepping: ............. 0x0 Error: mic0: while accessing device temperature data: thermal info: RAS: cmd 0x25: Error 0x7: SMC communication error Error: mic0: while accessing device frequency data: power limits info: RAS: cmd 0x2a: Error 0x7: SMC communication error mic0 (mem): Free Memory: ............. 7404.34 MB Total Memory: ............ 7697.61 MB Memory Usage: ............ 293.27 MB mic0 (cores): Device Utilization: User: 0.00%, System: 0.01%, Idle: 99.99% Per Core Utilization (60 cores in use) <output omitted: mic0 (cores) is okay>
Output from "micsmc -a mic1"
mic1 (info): Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID: ............... 0x2250 Number of Cores: ......... 60 OS Version: .............. 2.6.38.8+mpss3.3 Flash Version: ........... 2.1.02.0390 Driver Version: .......... 3.3-1 (<hostname omitted>) Stepping: ................ 0x3 Substepping: ............. 0x0 mic1 (temp): Cpu Temp: ................ 48.00 C Memory Temp: ............. 39.00 C Fan-In Temp: ............. 31.00 C Fan-Out Temp: ............ 39.00 C Core Rail Temp: .......... 36.00 C Uncore Rail Temp: ........ 38.00 C Memory Rail Temp: ........ 38.00 C mic1 (freq): Core Frequency: .......... 1.05 GHz Total Power: ............. 103.00 Watts Low Power Limit: ......... 257.00 Watts High Power Limit: ........ 306.00 Watts Physical Power Limit: .... 326.00 Watts mic1 (mem): Free Memory: ............. 7372.31 MB Total Memory: ............ 7697.61 MB Memory Usage: ............ 325.30 MB mic1 (cores): Device Utilization: User: 0.00%, System: 0.04%, Idle: 99.96% Per Core Utilization (60 cores in use) <output omitted>
Output of "micinfo -d 0"
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.
Created Wed Sep 24 21:01:13 2014
System Info
HOST OS : Linux
OS Version : 2.6.32-431.23.3.el6.x86_64
Driver Version : 3.3-1
MPSS Version : 3.3
Host Physical Memory : 32846 MB
Device No: 0, Device Name: mic0
micinfo: Failed to get thermal info: RAS: cmd 0x25: Error 0x7: SMC communication error: Success
micinfo: version info failed: RAS: cmd 0x25: Error 0x7: SMC communication error: Success
Output of "micinfo -d 1"
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.
Created Wed Sep 24 20:59:51 2014
System Info
HOST OS : Linux
OS Version : 2.6.32-431.23.3.el6.x86_64
Driver Version : 3.3-1
MPSS Version : 3.3
Host Physical Memory : 32846 MB
Device No: 1, Device Name: mic1
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.3
Device Serial Number : ADKC32601544
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 934000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 47 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
miccheck on mic1 is okay. The firmware and SMC bootloader on mic1 is up to date, so the values reflected should be what is similar on mic0, assuming micflash did its job on mic0 with both the firmware update (verified via micsmc above, and micflash -getversion -device 0) and the bootloader update (not verified; don't know how except with micinfo).
I used these references, but they were of minimal help:
- Flash issues and remedies: https://software.intel.com/en-us/forums/topic/494772
- Flash version too old? https://software.intel.com/en-us/forums/topic/402175
- Cannot monitor MICs with micsmc: https://software.intel.com/en-us/forums/topic/402397
I hope I don't have to get a replacement for mic0, but it looks like that might be necessary if I want power and thermal readings from it.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The symtom "SMC buffer size excceded" when flashing a coprocessor is known for the SKU 5120D (I assume mic0 and mic1 are the same SKU). Here is what you can try:
- Repeat to run the flash the device mic0 again and observe the SMC's blue led in the back. Make sure that the SMC's blue led is blinking when you flash the card.
- If you still have the same problem, please upgrade with the latest MPSS 3.4 and try again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page