- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am currently managing a CentOS host system with several Xeon Phi 5100P coprocessors. One of the coprocessors (mic0) is exhibiting issues with accessing the SMC buffers, making it difficult to (1) perform/verify firmware updates via micflash, (2) verify coprocessor operations via miccheck, and (3) get power and thermal information via micsmc. The other coprocessors in the system do not exhibit these issues.
Output from "micflash -update -device 0"
No image path specified - Searching: /usr/share/mpss/flash mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0390-02.rom.smc mic0: Flash update started mic0: Flash update done mic0: SMC update started micflash: mic0: SMC update failed: SMC buffer size exceeded (0x1) mic0: Transitioning to ready state Please restart host for flash changes to take effect
Ouptut from "miccheck -d 0"
MicCheck 3.3-r1
Copyright 2013 Intel Corporation All Rights Reserved
Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass
  Test 7 (mic0): Check running SMC firmware version is correct ... fail
    failed to get thermal information
Status: FAIL
Failure: failed to get thermal information
This fail appears to just be because of thermal information, not because of the firmware version. The output from "micsmc" and "micflash -getversion" verify this when checked against mic1:
Output from "micsmc -a mic0"
mic0 (info): Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID: ............... 0x2250 Number of Cores: ......... 60 OS Version: .............. 2.6.38.8+mpss3.3 Flash Version: ........... 2.1.02.0390 Driver Version: .......... 3.3-1 (<hostname omitted>) Stepping: ................ 0x3 Substepping: ............. 0x0 Error: mic0: while accessing device temperature data: thermal info: RAS: cmd 0x25: Error 0x7: SMC communication error Error: mic0: while accessing device frequency data: power limits info: RAS: cmd 0x2a: Error 0x7: SMC communication error mic0 (mem): Free Memory: ............. 7404.34 MB Total Memory: ............ 7697.61 MB Memory Usage: ............ 293.27 MB mic0 (cores): Device Utilization: User: 0.00%, System: 0.01%, Idle: 99.99% Per Core Utilization (60 cores in use) <output omitted: mic0 (cores) is okay>
Output from "micsmc -a mic1"
mic1 (info): Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID: ............... 0x2250 Number of Cores: ......... 60 OS Version: .............. 2.6.38.8+mpss3.3 Flash Version: ........... 2.1.02.0390 Driver Version: .......... 3.3-1 (<hostname omitted>) Stepping: ................ 0x3 Substepping: ............. 0x0 mic1 (temp): Cpu Temp: ................ 48.00 C Memory Temp: ............. 39.00 C Fan-In Temp: ............. 31.00 C Fan-Out Temp: ............ 39.00 C Core Rail Temp: .......... 36.00 C Uncore Rail Temp: ........ 38.00 C Memory Rail Temp: ........ 38.00 C mic1 (freq): Core Frequency: .......... 1.05 GHz Total Power: ............. 103.00 Watts Low Power Limit: ......... 257.00 Watts High Power Limit: ........ 306.00 Watts Physical Power Limit: .... 326.00 Watts mic1 (mem): Free Memory: ............. 7372.31 MB Total Memory: ............ 7697.61 MB Memory Usage: ............ 325.30 MB mic1 (cores): Device Utilization: User: 0.00%, System: 0.04%, Idle: 99.96% Per Core Utilization (60 cores in use) <output omitted>
Output of "micinfo -d 0"
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.
Created Wed Sep 24 21:01:13 2014
        System Info
                HOST OS                 : Linux
                OS Version              : 2.6.32-431.23.3.el6.x86_64
                Driver Version          : 3.3-1
                MPSS Version            : 3.3
                Host Physical Memory    : 32846 MB
Device No: 0, Device Name: mic0
micinfo: Failed to get thermal info: RAS: cmd 0x25: Error 0x7: SMC communication error: Success
micinfo: version info failed: RAS: cmd 0x25: Error 0x7: SMC communication error: Success
Output of "micinfo -d 1"
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.
Created Wed Sep 24 20:59:51 2014
        System Info
                HOST OS                 : Linux
                OS Version              : 2.6.32-431.23.3.el6.x86_64
                Driver Version          : 3.3-1
                MPSS Version            : 3.3
                Host Physical Memory    : 32846 MB
Device No: 1, Device Name: mic1
        Version
                Flash Version            : 2.1.02.0390
                SMC Firmware Version     : 1.16.5078
                SMC Boot Loader Version  : 1.8.4326
                uOS Version              : 2.6.38.8+mpss3.3
                Device Serial Number     : ADKC32601544
        Board
                Vendor ID                : 0x8086
                Device ID                : 0x2250
                Subsystem ID             : 0x2500
                Coprocessor Stepping ID  : 3
                PCIe Width               : x16
                PCIe Speed               : 5 GT/s
                PCIe Max payload size    : 256 bytes
                PCIe Max read req size   : 512 bytes
                Coprocessor Model        : 0x01
                Coprocessor Model Ext    : 0x00
                Coprocessor Type         : 0x00
                Coprocessor Family       : 0x0b
                Coprocessor Family Ext   : 0x00
                Coprocessor Stepping     : B1
                Board SKU                : B1PRQ-5110P/5120D
                ECC Mode                 : Enabled
                SMC HW Revision          : Product 225W Passive CS
        Cores
                Total No of Active Cores : 60
                Voltage                  : 934000 uV
                Frequency                : 1052631 kHz
        Thermal
                Fan Speed Control        : N/A
                Fan RPM                  : N/A
                Fan PWM                  : N/A
                Die Temp                 : 47 C
        GDDR
                GDDR Vendor              : Elpida
                GDDR Version             : 0x1
                GDDR Density             : 2048 Mb
                GDDR Size                : 7936 MB
                GDDR Technology          : GDDR5
                GDDR Speed               : 5.000000 GT/s
                GDDR Frequency           : 2500000 kHz
                GDDR Voltage             : 1501000 uV
miccheck on mic1 is okay. The firmware and SMC bootloader on mic1 is up to date, so the values reflected should be what is similar on mic0, assuming micflash did its job on mic0 with both the firmware update (verified via micsmc above, and micflash -getversion -device 0) and the bootloader update (not verified; don't know how except with micinfo).
I used these references, but they were of minimal help:
- Flash issues and remedies: https://software.intel.com/en-us/forums/topic/494772
- Flash version too old? https://software.intel.com/en-us/forums/topic/402175
- Cannot monitor MICs with micsmc: https://software.intel.com/en-us/forums/topic/402397
I hope I don't have to get a replacement for mic0, but it looks like that might be necessary if I want power and thermal readings from it.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The symtom "SMC buffer size excceded" when flashing a coprocessor is known for the SKU 5120D (I assume mic0 and mic1 are the same SKU). Here is what you can try:
	- Repeat to run the flash the device mic0 again and observe the SMC's blue led in the back. Make sure that the SMC's blue led is blinking when you flash the card.
	- If you still have the same problem, please upgrade with the latest MPSS 3.4 and try again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page