Good Afternoon,
My MIC card isn't working with the newly install MPSS 3.8.2 driver. I followed the instructions in the readme to install MPSS on a new RHEL 6.9 system. Below are some of the errors that I continue to receive:
[mpss-3.8.2]micctrl -s
mic0: reset failed
[mpss-3.8.2]# micctrl -rw
mic0: resetting
[Error] Timeout booting MIC, check your installation
I also cannot update the firmware/smc upon installing the MPSS driver and rpms:
[mpss-3.8.2]# /usr/bin/micflash -update -device all -smcbootloader
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_C0_0391-02.rom.smc
micflash: mic0: No compatible SMC boot-loader image found
[mpss-3.8.2]# /usr/bin/micflash -update -device all
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_C0_0391-02.rom.smc
micflash: micflash: mic0: Failed to switch to maintenance mode: write: /sys/class/mic/mic0/state: Input/output error
My hardware is a Supermicro workstation:
[mpss-3.8.2]# dmidecode -s bios-version
3.0a
[mpss-3.8.2]# dmidecode -s system-product-name
X9DRG-QF
Thank you for any insight or input on this issue. I've been beating my head into the ground over this.
連結已複製
is the device visible to the OS? try
# lspci | grep copro # lspci -v -s `lspci | grep copro | awk '{ print $1 }'`
which should list something like
02:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11) Subsystem: Intel Corporation Device 2500 Flags: bus master, fast devsel, latency 0, IRQ 32 Memory at 21c00000000 (64-bit, prefetchable) [size=8G] Memory at cb900000 (64-bit, non-prefetchable) [size=128K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+ Capabilities: [98] MSI-X: Enable+ Count=16 Masked- Capabilities: [100] Advanced Error Reporting Kernel driver in use: mic
If not, then your card is not functioning normally - this usually points at a cooling problem.
If you do see the card, then we'll proceed to debugging the mic driver itself
Hello JJK,
Yes, I'm able to verify that the coprocessor card is visible on the system:
[~]# lspci | grep copro
84:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 3120 series (rev 20)
[~]# lspci -v -s `lspci | grep copro | awk '{ print $1 }'`
84:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 3120 series (rev 20)
Subsystem: Intel Corporation Device 3608
Flags: bus master, fast devsel, latency 0, IRQ 64
Memory at 383c00000000 (64-bit, prefetchable) [size=8G]
Memory at fba00000 (64-bit, non-prefetchable) [size=128K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
Capabilities: [100] Advanced Error Reporting
Kernel driver in use: mic
good, that means it is most likely a software installation issue and not a hardware issue. I've got a very similar setup over here,. Next, run 'micinfo' as root:
# service micras stop # service mpss stop # micinfo MicInfo Utility Log Created Mon Jul 31 10:49:25 2017 System Info HOST OS : Linux OS Version : 2.6.32-696.3.1.el6.x86_64 Driver Version : 3.8.2-1 MPSS Version : 3.8.2 Host Physical Memory : 64388 MB Device No: 0, Device Name: mic0 Version Flash Version : NotAvailable SMC Firmware Version : NotAvailable SMC Boot Loader Version : NotAvailable Coprocessor OS Version : NotAvailable Device Serial Number : NotAvailable Board Vendor ID : 0x8086 Device ID : 0x2250 Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 256 bytes PCIe Max read req size : 512 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-5110P/5120D ECC Mode : NotAvailable SMC HW Revision : NotAvailable Cores Total No of Active Cores : NotAvailable Voltage : NotAvailable Frequency : NotAvailable Thermal Fan Speed Control : NotAvailable Fan RPM : NotAvailable Fan PWM : NotAvailable Die Temp : NotAvailable GDDR GDDR Vendor : NotAvailable GDDR Version : NotAvailable GDDR Density : NotAvailable GDDR Size : NotAvailable GDDR Technology : NotAvailable GDDR Speed : NotAvailable GDDR Frequency : NotAvailable GDDR Voltage : NotAvailable
Now, start the mpss daemon and check the status:
# service mpss start # micctrl -s mic0: ready
and check /var/log/messages and dmesg for any error/warnings. You can also run 'micdebug.sh' and post the output file here - that will tell theh Intel support people a lot more.
Attached is the output from the 'micdebug.sh' script. I didn't see any errors in the messages file, but can confirm the following error message under 'dmesg':
[~]# dmesg |grep -i error
ERST: Error Record Serialization Table (ERST) support is initialized.
Error! Card not in offline/ready state. Cannot change mode
Error! Card not in offline/ready state. Cannot change mode
Error! Card not in offline/ready state. Cannot change mode
Below are the results from the other commands. Should I also restart the "micras" service?
[~]# service micras status
Intel(R) micras is stopped
[~]# service mpss status
mpss is stopped
[~]# micinfo
MicInfo Utility Log
Created Mon Jul 31 05:27:55 2017
System Info
HOST OS : Linux
OS Version : 2.6.32-696.3.1.el6.x86_64
Driver Version : 3.8.2-1
MPSS Version : 3.8.2
Host Physical Memory : 516840 MB
Device No: 0, Device Name: mic0
Version
Flash Version : NotAvailable
SMC Firmware Version : NotAvailable
SMC Boot Loader Version : NotAvailable
Coprocessor OS Version : NotAvailable
Device Serial Number : NotAvailable
Board
Vendor ID : 0x8086
Device ID : 0x225d
Subsystem ID : 0x3608
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-3120/3140 P/A
ECC Mode : NotAvailable
SMC HW Revision : NotAvailable
Cores
Total No of Active Cores : NotAvailable
Voltage : NotAvailable
Frequency : NotAvailable
Thermal
Fan Speed Control : NotAvailable
Fan RPM : NotAvailable
Fan PWM : NotAvailable
Die Temp : NotAvailable
GDDR
GDDR Vendor : NotAvailable
GDDR Version : NotAvailable
GDDR Density : NotAvailable
GDDR Size : NotAvailable
GDDR Technology : NotAvailable
GDDR Speed : NotAvailable
GDDR Frequency : NotAvailable
GDDR Voltage : NotAvailable
[~]# service mpss start
Loading MIC module: [ OK ]
Starting Intel(R) MPSS: [FAILED]
[~]# service mpss status
mpss is running
[~]# service micras status
Intel(R) micras is stopped
[~]# micctrl -s
mic0: reset failed
the micdebug files show (in host_dmesg.txt) a continuous cycle of:
1776 mic0: Resetting (Post Code F2) 1777 Reattempting reset after F2/F4 failure 1778 mic0: Transition from state resetting to resetting 1779 mic0: Resetting (Post Code 3C) 1780 mic0: Resetting (Post Code 3C) 1781 mic0: Resetting (Post Code 3d) 1782 mic0: Resetting (Post Code 3d) 1783 mic0: Resetting (Post Code 3d) 1784 mic0: Resetting (Post Code 3E) 1785 mic0: Resetting (Post Code 3E)
in this post it is suggested to power down the box, unplug the cable, then power it up again:
https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/535257
Also, make sure that only a single version of the mpss stack is running (but since you're installing on a new host, I doubt that this is causing the problem).
I don't see any other versions currently running on this system please see below for current installed MPSS rpms. (FYI - I had to reinstall MPSS twice on this system to test a script that would automate this install on other systems in the environment).
I will power down the system, unplug both power cords and power back on; then reply back with my results. Thanks.
[~]# ps -ef|grep -i mpss|grep -v grep
root 17451 1 0 05:29 pts/0 00:00:00 /usr/sbin/mpssd
[~]# rpm -qa | egrep "intel|intel-mic|libscif|glibc2.12pkg|netperf|mpss"|sort -u
glibc2.12pkg-libmicaccesssdk0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicaccesssdk-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt-doc-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libodmdebug0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libodmdebug-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libsettings0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libsettings-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-flash-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-memdiag-kernel-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-rasmm-kernel-3.8.2-1.glibc2.12.x86_64
intel-composerxe-compat-k1om-3.8.2-1.x86_64
libscif0-3.8.2-1.glibc2.12.x86_64
libscif-dev-3.8.2-1.glibc2.12.x86_64
libscif-doc-3.8.2-1.glibc2.12.x86_64
mpss-boot-files-3.8.2-1.glibc2.12.x86_64
mpss-coi-3.8.2-1.glibc2.12.x86_64
mpss-coi-dev-3.8.2-1.glibc2.12.x86_64
mpss-coi-doc-3.8.2-1.glibc2.12.x86_64
mpss-core-3.8.2-1.glibc2.12.x86_64
mpss-core-dev-3.8.2-1.glibc2.12.x86_64
mpss-daemon-3.8.2-1.glibc2.12.x86_64
mpss-daemon-dev-3.8.2-1.glibc2.12.x86_64
mpss-eclipse-cdt-mpm-3.8.2-1.glibc2.12.x86_64
mpss-license-3.8.2-1.glibc2.12.x86_64
mpss-miccheck-3.8.2-1.glibc2.12.x86_64
mpss-miccheck-bin-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-doc-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-python-3.8.2-1.glibc2.12.x86_64
mpss-micsmc-gui-3.8.2-1.glibc2.12.x86_64
mpss-modules-2.6.32-696.3.1.el6.x86_64-3.8.2-1.x86_64
mpss-modules-dev-2.6.32-696.3.1.el6.x86_64-3.8.2-1.x86_64
mpss-modules-headers-3.8.2-1.glibc2.12.x86_64
mpss-mpm-3.8.2-1.glibc2.12.x86_64
mpss-mpm-doc-3.8.2-1.glibc2.12.x86_64
mpss-myo-3.8.2-1.glibc2.12.x86_64
mpss-myo-dev-3.8.2-1.glibc2.12.x86_64
mpss-myo-doc-3.8.2-1.glibc2.12.x86_64
mpss-offload-3.8.2-1.glibc2.12.x86_64
mpss-offload-dev-3.8.2-1.glibc2.12.x86_64
mpss-sciftutorials-3.8.2-1.glibc2.12.x86_64
mpss-sciftutorials-doc-3.8.2-1.glibc2.12.x86_64
mpss-sdk-k1om-3.8.2-1.x86_64
mpss-sysmgmt-micdiagnostic-3.8.2-1.glibc2.12.x86_64
mpss-sysmgmt-micras-3.8.2-1.glibc2.12.x86_64
mpss-sysmgmt-python-3.8.2-1.glibc2.12.x86_64
netperf-2.6.0-r0.glibc2.12.x86_64
netperf-doc-2.6.0-r0.glibc2.12.x86_64
I have powered off my system and left both power cords unplugged for ~15 minutes before plugging both power cords back in and booting back up. I've attached a new micdebug output for reference. The device seems to still be in the same state in the 'reset failed'.
[~]# dmesg |grep -i mic0|tail -4
mic0: Resetting (Post Code 3C)
mic0: Resetting (Post Code 3C)
mic0: Resetting (Post Code 3d)
mic0: Transition from state resetting to reset failed
[~]# micctrl -s
mic0: reset failed
From the MPSS Userguide, section "Troubleshooting and Debugging" (I.2):
The POST codes are defined as follow: "3C" Begin GDDR read training with CDR enabled "3d" Begin GDDR read training with CDR disabled "F2" GDDR failed memory training "F4" Memory preservation failure
this suggests that there is a problem with the GDDR memory on the Phi board. I'd recommend talking to your sales rep here, or perhaps someone from Intel support can help out here - unfortunately this DOES look like a hardware problem.
Just to followup with this thread after some time, it was indeed a hardware issue with the coprocessor. After swapping it out for a newer coprocessor and updating the supermicro bios the card is now functional and the 'mic0' device is showing as ready:
[~]# micctrl -s
mic0: ready
