Software Archive
Read-only legacy content
17061 Discussions

Phi cards crash and won't reset shortly after booting.

Gary_M_3
Beginner
942 Views

I have a host with two mic cards that will properly boot when the host boots.   Within 10-15 minutes of the mpss service start, both mic cards will go offline and fail to reset.  This occurs even with a initconfig setup.  In the messages file:

Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1387 node 2
Jul 22 13:29:49 gb49 kernel: Warning: Core image elf header not found
Jul 22 13:29:49 gb49 kernel: Kdump: vmcore not initialized
Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1399 node 2 crash dump failed status -22
Jul 22 13:29:49 gb49 kernel: mic1: Transition from state online to lost
Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1414 stopping node 2 to recover lost node!
Jul 22 13:29:53 gb49 kernel: micvnet_execute_stop: timeout waiting for link down message response
Jul 22 13:30:28 gb49 kernel: dma_mark_wait 1080 TO chan 0x0
Jul 22 13:30:28 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:33 gb49 kernel: dma_mark_wait 1080 TO chan 0x0
Jul 22 13:30:33 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:38 gb49 kernel: dma_mark_wait 1080 TO chan 0x1
Jul 22 13:30:38 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:43 gb49 kernel: dma_mark_wait 1080 TO chan 0x2
Jul 22 13:30:43 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:48 gb49 kernel: dma_mark_wait 1080 TO chan 0x3
Jul 22 13:30:48 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:53 gb49 kernel: mic1: Transition from state lost to resetting
Jul 22 13:30:55 gb49 kernel: mic1: Resetting (Post Code ??)
Jul 22 13:30:55 gb49 kernel: mic1: Transition from state resetting to reset failed
Jul 22 13:30:55 gb49 kernel: MIC 1 RESETFAIL postcode ?? -1

Any suggestions?

From micinfo (after boot):

Created Mon Jul 22 13:20:11 2013


System Info
HOST OS : Linux
OS Version : 2.6.32-358.6.2.el6.x86_64
Driver Version : 6720-15
MPSS Version : 2.1.6720-15
Host Physical Memory : 32851 MB

Device No: 0, Device Name: mic0

Version
Flash Version : 2.1.02.0386
SMC Firmware Version : 1.14.4616
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202031

Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

Cores
Total No of Active Cores : 60
Voltage : 1019000 uV
Frequency : 1052631 kHz

Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 85 C

GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV

 mic1 is identical (except serial number).

0 Kudos
8 Replies
Frances_R_Intel
Employee
942 Views

From the versions numbers, it appears you are running the latest version of the MPSS. Was this an update or is this a new system you are installing?  What is the OS release you are running on the host? Are you running anything when the system goes down or is it idle?

0 Kudos
Gary_M_3
Beginner
942 Views

This is a new system and the issue has occured since purchase about 1 month ago.  I installed the latest version in an attempt to correct the issue.

The Host OS is Centos 6.4 running kernel 2.6.32-358.6.2.el6.x86_64.  In all occurances, the host and mic cards are idle.

0 Kudos
Gary_M_3
Beginner
942 Views

Just to be sure, I removed the current mpss install.   Then booted to the 2.6.32-358.el6.x86_64 and reran the complete install again.

Installation worked as expected, and the mic cards booted.  I noticed that the Die temps seemed very high.

After 3min, and at 138C, I lost mic0.  A few minutes later, I lost mic1 at 136C.

If this is an overheating issue, what are the expect normal die temps for the mics?

0 Kudos
Charles_C_Intel1
Employee
942 Views

They should be much cooler than that - I would consider anything over 80C too hot.  Do your cards have fans on them?  If not, they are passively cooled and require a specially designed chassis to properly remove the heat from them (like you tend to find in rack-mount systems), which you may not have if you are hitting over 100C!

0 Kudos
Bernard
Valued Contributor I
942 Views

This could be an overheating issue.What is the max threshold temperature?

0 Kudos
Gary_M_3
Beginner
942 Views

Just to follow up,  we flashed the firmware on the on the system and Die temps are a reasonable 45C and 68C.

0 Kudos
Frances_R_Intel
Employee
942 Views

I'm glad the system is behaving itself now. If you get a chance, could you rerun micinfo and post the current system information? I am curious whether the Flash or SMC versions changed in any way. I thought what you had was already the latest version.

0 Kudos
Gary_M_3
Beginner
942 Views

System has been very stable since the upgrade.  Here's the latest micinfo output:

MicInfo Utility Log

Created Tue Aug 13 14:43:32 2013


System Info
HOST OS : Linux
OS Version : 2.6.32-358.el6.x86_64
Driver Version : 6720-15
MPSS Version : 2.1.6720-15
Host Physical Memory : 32844 MB

Device No: 0, Device Name: mic0

Version
Flash Version : 2.1.03.0386
SMC Firmware Version : 1.15.4830
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202031

Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

Cores
Total No of Active Cores : 60
Voltage : 1028000 uV
Frequency : 1052631 kHz

Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 40 C

GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV

Device No: 1, Device Name: mic1

Version
Flash Version : 2.1.03.0386
SMC Firmware Version : 1.15.4830
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202064

Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS

Cores
Total No of Active Cores : 60
Voltage : 1019000 uV
Frequency : 1052631 kHz

Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 48 C

GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV

0 Kudos
Reply