Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Charles_S_1
Beginner
227 Views

MIC0 device fails after installing MPSS 3.8.2

Good Afternoon,

My MIC card isn't working with the newly install MPSS 3.8.2 driver. I followed the instructions in the readme to install MPSS on a new RHEL 6.9 system. Below are some of the errors that I continue to receive:

[mpss-3.8.2]micctrl -s
mic0: reset failed
[mpss-3.8.2]# micctrl -rw
          mic0: resetting
  [Error] Timeout booting MIC, check your installation

I also cannot update the firmware/smc upon installing the MPSS driver and rpms:

[mpss-3.8.2]# /usr/bin/micflash -update -device all -smcbootloader
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_C0_0391-02.rom.smc
micflash: mic0: No compatible SMC boot-loader image found

[mpss-3.8.2]# /usr/bin/micflash -update -device all
No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_C0_0391-02.rom.smc
micflash: micflash: mic0: Failed to switch to maintenance mode: write: /sys/class/mic/mic0/state: Input/output error

My hardware is a Supermicro workstation:

[mpss-3.8.2]# dmidecode -s bios-version
3.0a
[mpss-3.8.2]# dmidecode -s system-product-name
X9DRG-QF

Thank you for any insight or input on this issue. I've been beating my head into the ground over this.

 

 

0 Kudos
11 Replies
JJK
New Contributor III
227 Views

is the device visible to the OS? try

# lspci | grep copro
# lspci -v -s `lspci | grep copro | awk '{ print $1 }'`

which should list something like

02:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11)
	Subsystem: Intel Corporation Device 2500
	Flags: bus master, fast devsel, latency 0, IRQ 32
	Memory at 21c00000000 (64-bit, prefetchable) [size=8G]
	Memory at cb900000 (64-bit, non-prefetchable) [size=128K]
	Capabilities: [44] Power Management version 3
	Capabilities: [4c] Express Endpoint, MSI 00
	Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
	Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
	Capabilities: [100] Advanced Error Reporting
	Kernel driver in use: mic

 

If not, then your card is not functioning normally - this usually points at a cooling problem.

If you do see the card, then we'll proceed to debugging the mic driver itself

Charles_S_1
Beginner
227 Views

Hello JJK,

 

Yes, I'm able to verify that the coprocessor card is visible on the system:

[~]# lspci | grep copro
84:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 3120 series (rev 20)
[~]# lspci -v -s `lspci | grep copro | awk '{ print $1 }'`
84:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 3120 series (rev 20)
        Subsystem: Intel Corporation Device 3608
        Flags: bus master, fast devsel, latency 0, IRQ 64
        Memory at 383c00000000 (64-bit, prefetchable) [size=8G]
        Memory at fba00000 (64-bit, non-prefetchable) [size=128K]
        Capabilities: [44] Power Management version 3
        Capabilities: [4c] Express Endpoint, MSI 00
        Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
        Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: mic

 

JJK
New Contributor III
227 Views

good, that means it is most likely a software installation issue and not a hardware issue. I've got a very similar setup over here,. Next, run 'micinfo' as root:

# service micras stop
# service mpss stop
# micinfo
MicInfo Utility Log
Created Mon Jul 31 10:49:25 2017


	System Info
		HOST OS			: Linux
		OS Version		: 2.6.32-696.3.1.el6.x86_64
		Driver Version		: 3.8.2-1
		MPSS Version		: 3.8.2
		Host Physical Memory	: 64388 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : NotAvailable
		SMC Firmware Version	 : NotAvailable
		SMC Boot Loader Version	 : NotAvailable
		Coprocessor OS Version 	 : NotAvailable
		Device Serial Number 	 : NotAvailable

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x2250
		Subsystem ID 		 : 0x2500
		Coprocessor Stepping ID	 : 3
		PCIe Width 		 : x16
		PCIe Speed 		 : 5 GT/s
		PCIe Max payload size	 : 256 bytes
		PCIe Max read req size	 : 512 bytes
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : B1
		Board SKU 		 : B1PRQ-5110P/5120D
		ECC Mode 		 : NotAvailable
		SMC HW Revision 	 : NotAvailable

	Cores
		Total No of Active Cores : NotAvailable
		Voltage 		 : NotAvailable
		Frequency 		 : NotAvailable

	Thermal
		Fan Speed Control 	 : NotAvailable
		Fan RPM 		 : NotAvailable
		Fan PWM 		 : NotAvailable
		Die Temp		 : NotAvailable

	GDDR
		GDDR Vendor		 : NotAvailable
		GDDR Version		 : NotAvailable
		GDDR Density		 : NotAvailable
		GDDR Size		 : NotAvailable
		GDDR Technology		 : NotAvailable
		GDDR Speed		 : NotAvailable
		GDDR Frequency		 : NotAvailable
		GDDR Voltage		 : NotAvailable

Now, start the mpss daemon and check the status:

# service mpss start
# micctrl -s
mic0: ready

and check /var/log/messages and dmesg for any error/warnings. You can also run 'micdebug.sh' and post the output file here - that will tell theh Intel support people a lot more.

 

Charles_S_1
Beginner
227 Views

Attached is the output from the 'micdebug.sh' script. I didn't see any errors in the messages file, but can confirm the following error message under 'dmesg':

[~]# dmesg |grep -i error
ERST: Error Record Serialization Table (ERST) support is initialized.
Error! Card not in offline/ready state. Cannot change mode
Error! Card not in offline/ready state. Cannot change mode
Error! Card not in offline/ready state. Cannot change mode

Below are the results from the other commands. Should I also restart the "micras" service?

[~]# service micras status

Intel(R) micras is stopped

[~]# service mpss status

mpss is stopped

[~]# micinfo

MicInfo Utility Log

Created Mon Jul 31 05:27:55 2017

 

 

        System Info

                HOST OS                 : Linux

                OS Version              : 2.6.32-696.3.1.el6.x86_64

                Driver Version          : 3.8.2-1

                MPSS Version            : 3.8.2

                Host Physical Memory    : 516840 MB

 

Device No: 0, Device Name: mic0

 

        Version

                Flash Version            : NotAvailable

                SMC Firmware Version     : NotAvailable

                SMC Boot Loader Version  : NotAvailable

                Coprocessor OS Version   : NotAvailable

                Device Serial Number     : NotAvailable

 

        Board

                Vendor ID                : 0x8086

                Device ID                : 0x225d

                Subsystem ID             : 0x3608

                Coprocessor Stepping ID  : 2

                PCIe Width               : x16

                PCIe Speed               : 5 GT/s

                PCIe Max payload size    : 256 bytes

                PCIe Max read req size   : 512 bytes

                Coprocessor Model        : 0x01

                Coprocessor Model Ext    : 0x00

                Coprocessor Type         : 0x00

                Coprocessor Family       : 0x0b

                Coprocessor Family Ext   : 0x00

                Coprocessor Stepping     : C0

                Board SKU                : C0PRQ-3120/3140 P/A

                ECC Mode                 : NotAvailable

                SMC HW Revision          : NotAvailable

 

        Cores

                Total No of Active Cores : NotAvailable

                Voltage                  : NotAvailable

                Frequency                : NotAvailable

 

        Thermal

                Fan Speed Control        : NotAvailable

                Fan RPM                  : NotAvailable

                Fan PWM                  : NotAvailable

                Die Temp                 : NotAvailable

 

        GDDR

                GDDR Vendor              : NotAvailable

                GDDR Version             : NotAvailable

                GDDR Density             : NotAvailable

                GDDR Size                : NotAvailable

                GDDR Technology          : NotAvailable

                GDDR Speed               : NotAvailable

                GDDR Frequency           : NotAvailable

                GDDR Voltage             : NotAvailable

[~]# service mpss start

Loading MIC module:                                        [  OK  ]

Starting Intel(R) MPSS:                                    [FAILED]

[~]# service mpss status

mpss is running

[~]# service micras status

Intel(R) micras is stopped

[~]# micctrl -s

mic0: reset failed


 

Charles_S_1
Beginner
227 Views

MICDEBUG.SH output attachment

JJK
New Contributor III
227 Views

the micdebug files show (in host_dmesg.txt) a continuous cycle of:

   1776 mic0: Resetting (Post Code F2)
   1777 Reattempting reset after F2/F4 failure
   1778 mic0: Transition from state resetting to resetting
   1779 mic0: Resetting (Post Code 3C)
   1780 mic0: Resetting (Post Code 3C)
   1781 mic0: Resetting (Post Code 3d)
   1782 mic0: Resetting (Post Code 3d)
   1783 mic0: Resetting (Post Code 3d)
   1784 mic0: Resetting (Post Code 3E)
   1785 mic0: Resetting (Post Code 3E)

in this post it  is suggested to power down the box, unplug the cable, then power it up again:

 https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/535257

Also, make sure that only a single version of the mpss stack is running (but since you're installing on a new host, I doubt that this is causing the problem).

 

Charles_S_1
Beginner
227 Views

I don't see any other versions currently running on this system please see below for current installed MPSS rpms. (FYI - I had to reinstall MPSS twice on this system to test a script that would automate this install on other systems in the environment).

I will power down the system, unplug both power cords and power back on; then reply back with my results. Thanks. 

 [~]# ps -ef|grep -i mpss|grep -v grep
root     17451     1  0 05:29 pts/0    00:00:00 /usr/sbin/mpssd

[~]# rpm -qa | egrep "intel|intel-mic|libscif|glibc2.12pkg|netperf|mpss"|sort -u
glibc2.12pkg-libmicaccesssdk0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicaccesssdk-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libmicmgmt-doc-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libodmdebug0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libodmdebug-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libsettings0-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-libsettings-dev-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-flash-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-memdiag-kernel-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-rasmm-kernel-3.8.2-1.glibc2.12.x86_64
intel-composerxe-compat-k1om-3.8.2-1.x86_64
libscif0-3.8.2-1.glibc2.12.x86_64
libscif-dev-3.8.2-1.glibc2.12.x86_64
libscif-doc-3.8.2-1.glibc2.12.x86_64
mpss-boot-files-3.8.2-1.glibc2.12.x86_64
mpss-coi-3.8.2-1.glibc2.12.x86_64
mpss-coi-dev-3.8.2-1.glibc2.12.x86_64
mpss-coi-doc-3.8.2-1.glibc2.12.x86_64
mpss-core-3.8.2-1.glibc2.12.x86_64
mpss-core-dev-3.8.2-1.glibc2.12.x86_64
mpss-daemon-3.8.2-1.glibc2.12.x86_64
mpss-daemon-dev-3.8.2-1.glibc2.12.x86_64
mpss-eclipse-cdt-mpm-3.8.2-1.glibc2.12.x86_64
mpss-license-3.8.2-1.glibc2.12.x86_64
mpss-miccheck-3.8.2-1.glibc2.12.x86_64
mpss-miccheck-bin-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-doc-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-python-3.8.2-1.glibc2.12.x86_64
mpss-micsmc-gui-3.8.2-1.glibc2.12.x86_64
mpss-modules-2.6.32-696.3.1.el6.x86_64-3.8.2-1.x86_64
mpss-modules-dev-2.6.32-696.3.1.el6.x86_64-3.8.2-1.x86_64
mpss-modules-headers-3.8.2-1.glibc2.12.x86_64
mpss-mpm-3.8.2-1.glibc2.12.x86_64
mpss-mpm-doc-3.8.2-1.glibc2.12.x86_64
mpss-myo-3.8.2-1.glibc2.12.x86_64
mpss-myo-dev-3.8.2-1.glibc2.12.x86_64
mpss-myo-doc-3.8.2-1.glibc2.12.x86_64
mpss-offload-3.8.2-1.glibc2.12.x86_64
mpss-offload-dev-3.8.2-1.glibc2.12.x86_64
mpss-sciftutorials-3.8.2-1.glibc2.12.x86_64
mpss-sciftutorials-doc-3.8.2-1.glibc2.12.x86_64
mpss-sdk-k1om-3.8.2-1.x86_64
mpss-sysmgmt-micdiagnostic-3.8.2-1.glibc2.12.x86_64
mpss-sysmgmt-micras-3.8.2-1.glibc2.12.x86_64
mpss-sysmgmt-python-3.8.2-1.glibc2.12.x86_64
netperf-2.6.0-r0.glibc2.12.x86_64
netperf-doc-2.6.0-r0.glibc2.12.x86_64

Charles_S_1
Beginner
227 Views

I have powered off my system and left both power cords unplugged for ~15 minutes before plugging both power cords back in and booting back up. I've attached a new micdebug output for reference. The device seems to still be in the same state in the 'reset failed'.

[~]# dmesg |grep -i mic0|tail -4
mic0: Resetting (Post Code 3C)
mic0: Resetting (Post Code 3C)
mic0: Resetting (Post Code 3d)
mic0: Transition from state resetting to reset failed

[~]# micctrl -s
mic0: reset failed

 

JJK
New Contributor III
227 Views

From the MPSS Userguide, section "Troubleshooting and Debugging" (I.2):

The POST codes are defined as follow:
"3C" Begin GDDR read training with CDR enabled
"3d" Begin GDDR read training with CDR disabled
"F2" GDDR failed memory training
"F4" Memory preservation failure

 

this suggests that there is a problem with the GDDR memory on the Phi board.  I'd recommend talking to your sales rep here, or perhaps someone from Intel support can help out here - unfortunately this DOES look like a hardware problem.

Charles_S_1
Beginner
227 Views

Just to followup with this thread after some time, it was indeed a hardware issue with the coprocessor. After swapping it out for a newer coprocessor and updating the supermicro bios the card is now functional and the 'mic0' device is showing as ready:

[~]# micctrl -s
mic0: ready
 

 

Charles_S_1
Beginner
227 Views

So I spoke too soon. I just rebooted my system and now the mic0 device is missing:

[~]# dmesg|grep -i mic0

Should I uninstall/reinstall mpss3.8.2 to resolve the issue? Attached is my micdebug output. Thanks.

 

Reply