- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good Afternoon,
My MIC card isn't working with the newly install MPSS 3.8.2 driver. I followed the instructions in the readme to install MPSS on a new RHEL 6.9 system. Below are some of the errors that I continue to receive:
[mpss-3.8.2]micctrl -s
mic0: reset failed
[mpss-3.8.2]# micctrl -rw
mic0: resetting
[Error] Timeout booting MIC, check your installation
I also cannot update the firmware/smc upon installing the MPSS driver and rpms:
[mpss-3.8.2]# /usr/bin/micflash -update -device all -smcbootloader
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_C0_0391-02.rom.smc
micflash: mic0: No compatible SMC boot-loader image found
[mpss-3.8.2]# /usr/bin/micflash -update -device all
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_C0_0391-02.rom.smc
micflash: micflash: mic0: Failed to switch to maintenance mode: write: /sys/class/mic/mic0/state: Input/output error
My hardware is a Supermicro workstation:
[mpss-3.8.2]# dmidecode -s bios-version
3.0a
[mpss-3.8.2]# dmidecode -s system-product-name
X9DRG-QF
Thank you for any insight or input on this issue. I've been beating my head into the ground over this.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
is the device visible to the OS? try
# lspci | grep copro # lspci -v -s `lspci | grep copro | awk '{ print $1 }'`
which should list something like
02:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11) Subsystem: Intel Corporation Device 2500 Flags: bus master, fast devsel, latency 0, IRQ 32 Memory at 21c00000000 (64-bit, prefetchable) [size=8G] Memory at cb900000 (64-bit, non-prefetchable) [size=128K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+ Capabilities: [98] MSI-X: Enable+ Count=16 Masked- Capabilities: [100] Advanced Error Reporting Kernel driver in use: mic
If not, then your card is not functioning normally - this usually points at a cooling problem.
If you do see the card, then we'll proceed to debugging the mic driver itself
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello JJK,
Yes, I'm able to verify that the coprocessor card is visible on the system:
[~]# lspci | grep copro
84:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 3120 series (rev 20)
[~]# lspci -v -s `lspci | grep copro | awk '{ print $1 }'`
84:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 3120 series (rev 20)
Subsystem: Intel Corporation Device 3608
Flags: bus master, fast devsel, latency 0, IRQ 64
Memory at 383c00000000 (64-bit, prefetchable) [size=8G]
Memory at fba00000 (64-bit, non-prefetchable) [size=128K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
Capabilities: [100] Advanced Error Reporting
Kernel driver in use: mic
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
good, that means it is most likely a software installation issue and not a hardware issue. I've got a very similar setup over here,. Next, run 'micinfo' as root:
# service micras stop # service mpss stop # micinfo MicInfo Utility Log Created Mon Jul 31 10:49:25 2017 System Info HOST OS : Linux OS Version : 2.6.32-696.3.1.el6.x86_64 Driver Version : 3.8.2-1 MPSS Version : 3.8.2 Host Physical Memory : 64388 MB Device No: 0, Device Name: mic0 Version Flash Version : NotAvailable SMC Firmware Version : NotAvailable SMC Boot Loader Version : NotAvailable Coprocessor OS Version : NotAvailable Device Serial Number : NotAvailable Board Vendor ID : 0x8086 Device ID : 0x2250 Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 256 bytes PCIe Max read req size : 512 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-5110P/5120D ECC Mode : NotAvailable SMC HW Revision : NotAvailable Cores Total No of Active Cores : NotAvailable Voltage : NotAvailable Frequency : NotAvailable Thermal Fan Speed Control : NotAvailable Fan RPM : NotAvailable Fan PWM : NotAvailable Die Temp : NotAvailable GDDR GDDR Vendor : NotAvailable GDDR Version : NotAvailable GDDR Density : NotAvailable GDDR Size : NotAvailable GDDR Technology : NotAvailable GDDR Speed : NotAvailable GDDR Frequency : NotAvailable GDDR Voltage : NotAvailable
Now, start the mpss daemon and check the status:
# service mpss start # micctrl -s mic0: ready
and check /var/log/messages and dmesg for any error/warnings. You can also run 'micdebug.sh' and post the output file here - that will tell theh Intel support people a lot more.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Attached is the output from the 'micdebug.sh' script. I didn't see any errors in the messages file, but can confirm the following error message under 'dmesg':
[~]# dmesg |grep -i error
ERST: Error Record Serialization Table (ERST) support is initialized.
Error! Card not in offline/ready state. Cannot change mode
Error! Card not in offline/ready state. Cannot change mode
Error! Card not in offline/ready state. Cannot change mode
Below are the results from the other commands. Should I also restart the "micras" service?
[~]# service micras status
Intel(R) micras is stopped
[~]# service mpss status
mpss is stopped
[~]# micinfo
MicInfo Utility Log
Created Mon Jul 31 05:27:55 2017
System Info
HOST OS : Linux
OS Version : 2.6.32-696.3.1.el6.x86_64
Driver Version : 3.8.2-1
MPSS Version : 3.8.2
Host Physical Memory : 516840 MB
Device No: 0, Device Name: mic0
Version
Flash Version : NotAvailable
SMC Firmware Version : NotAvailable
SMC Boot Loader Version : NotAvailable
Coprocessor OS Version : NotAvailable
Device Serial Number : NotAvailable
Board
Vendor ID : 0x8086
Device ID : 0x225d
Subsystem ID : 0x3608
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-3120/3140 P/A
ECC Mode : NotAvailable
SMC HW Revision : NotAvailable
Cores
Total No of Active Cores : NotAvailable
Voltage : NotAvailable
Frequency : NotAvailable
Thermal
Fan Speed Control : NotAvailable
Fan RPM : NotAvailable
Fan PWM : NotAvailable
Die Temp : NotAvailable
GDDR
GDDR Vendor : NotAvailable
GDDR Version : NotAvailable
GDDR Density : NotAvailable
GDDR Size : NotAvailable
GDDR Technology : NotAvailable
GDDR Speed : NotAvailable
GDDR Frequency : NotAvailable
GDDR Voltage : NotAvailable
[~]# service mpss start
Loading MIC module: [ OK ]
Starting Intel(R) MPSS: [FAILED]
[~]# service mpss status
mpss is running
[~]# service micras status
Intel(R) micras is stopped
[~]# micctrl -s
mic0: reset failed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the micdebug files show (in host_dmesg.txt) a continuous cycle of:
1776 mic0: Resetting (Post Code F2) 1777 Reattempting reset after F2/F4 failure 1778 mic0: Transition from state resetting to resetting 1779 mic0: Resetting (Post Code 3C) 1780 mic0: Resetting (Post Code 3C) 1781 mic0: Resetting (Post Code 3d) 1782 mic0: Resetting (Post Code 3d) 1783 mic0: Resetting (Post Code 3d) 1784 mic0: Resetting (Post Code 3E) 1785 mic0: Resetting (Post Code 3E)
in this post it is suggested to power down the box, unplug the cable, then power it up again:
https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/535257
Also, make sure that only a single version of the mpss stack is running (but since you're installing on a new host, I doubt that this is causing the problem).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't see any other versions currently running on this system please see below for current installed MPSS rpms. (FYI - I had to reinstall MPSS twice on this system to test a script that would automate this install on other systems in the environment).
I will power down the system, unplug both power cords and power back on; then reply back with my results. Thanks.
[~]# ps -ef|grep -i mpss|grep -v grep
root 17451 1 0 05:29 pts/0 00:00:00 /usr/sbin/mpssd
[~]# rpm -qa | egrep "intel|intel-mic|libscif|glibc2.12pkg|netperf|mpss"|sort -u
glibc2.12pkg-libmicaccesssdk0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicaccesssdk-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt-doc-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libodmdebug0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libodmdebug-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libsettings0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libsettings-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-flash-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-memdiag-kernel-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-rasmm-kernel-3.8.2-1.glibc2.12.x86_64
intel-composerxe-compat-k1om-3.8.2-1.x86_64
libscif0-3.8.2-1.glibc2.12.x86_64
libscif-dev-3.8.2-1.glibc2.12.x86_64
libscif-doc-3.8.2-1.glibc2.12.x86_64
mpss-boot-files-3.8.2-1.glibc2.12.x86_64
mpss-coi-3.8.2-1.glibc2.12.x86_64
mpss-coi-dev-3.8.2-1.glibc2.12.x86_64
mpss-coi-doc-3.8.2-1.glibc2.12.x86_64
mpss-core-3.8.2-1.glibc2.12.x86_64
mpss-core-dev-3.8.2-1.glibc2.12.x86_64
mpss-daemon-3.8.2-1.glibc2.12.x86_64
mpss-daemon-dev-3.8.2-1.glibc2.12.x86_64
mpss-eclipse-cdt-mpm-3.8.2-1.glibc2.12.x86_64
mpss-license-3.8.2-1.glibc2.12.x86_64
mpss-miccheck-3.8.2-1.glibc2.12.x86_64
mpss-miccheck-bin-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-doc-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-python-3.8.2-1.glibc2.12.x86_64
mpss-micsmc-gui-3.8.2-1.glibc2.12.x86_64
mpss-modules-2.6.32-696.3.1.el6.x86_64-3.8.2-1.x86_64
mpss-modules-dev-2.6.32-696.3.1.el6.x86_64-3.8.2-1.x86_64
mpss-modules-headers-3.8.2-1.glibc2.12.x86_64
mpss-mpm-3.8.2-1.glibc2.12.x86_64
mpss-mpm-doc-3.8.2-1.glibc2.12.x86_64
mpss-myo-3.8.2-1.glibc2.12.x86_64
mpss-myo-dev-3.8.2-1.glibc2.12.x86_64
mpss-myo-doc-3.8.2-1.glibc2.12.x86_64
mpss-offload-3.8.2-1.glibc2.12.x86_64
mpss-offload-dev-3.8.2-1.glibc2.12.x86_64
mpss-sciftutorials-3.8.2-1.glibc2.12.x86_64
mpss-sciftutorials-doc-3.8.2-1.glibc2.12.x86_64
mpss-sdk-k1om-3.8.2-1.x86_64
mpss-sysmgmt-micdiagnostic-3.8.2-1.glibc2.12.x86_64
mpss-sysmgmt-micras-3.8.2-1.glibc2.12.x86_64
mpss-sysmgmt-python-3.8.2-1.glibc2.12.x86_64
netperf-2.6.0-r0.glibc2.12.x86_64
netperf-doc-2.6.0-r0.glibc2.12.x86_64
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have powered off my system and left both power cords unplugged for ~15 minutes before plugging both power cords back in and booting back up. I've attached a new micdebug output for reference. The device seems to still be in the same state in the 'reset failed'.
[~]# dmesg |grep -i mic0|tail -4
mic0: Resetting (Post Code 3C)
mic0: Resetting (Post Code 3C)
mic0: Resetting (Post Code 3d)
mic0: Transition from state resetting to reset failed
[~]# micctrl -s
mic0: reset failed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From the MPSS Userguide, section "Troubleshooting and Debugging" (I.2):
The POST codes are defined as follow: "3C" Begin GDDR read training with CDR enabled "3d" Begin GDDR read training with CDR disabled "F2" GDDR failed memory training "F4" Memory preservation failure
this suggests that there is a problem with the GDDR memory on the Phi board. I'd recommend talking to your sales rep here, or perhaps someone from Intel support can help out here - unfortunately this DOES look like a hardware problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just to followup with this thread after some time, it was indeed a hardware issue with the coprocessor. After swapping it out for a newer coprocessor and updating the supermicro bios the card is now functional and the 'mic0' device is showing as ready:
[~]# micctrl -s
mic0: ready
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page