- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a host with two mic cards that will properly boot when the host boots. Within 10-15 minutes of the mpss service start, both mic cards will go offline and fail to reset. This occurs even with a initconfig setup. In the messages file:
Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1387 node 2
Jul 22 13:29:49 gb49 kernel: Warning: Core image elf header not found
Jul 22 13:29:49 gb49 kernel: Kdump: vmcore not initialized
Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1399 node 2 crash dump failed status -22
Jul 22 13:29:49 gb49 kernel: mic1: Transition from state online to lost
Jul 22 13:29:49 gb49 kernel: micscif_handle_lostnode 1414 stopping node 2 to recover lost node!
Jul 22 13:29:53 gb49 kernel: micvnet_execute_stop: timeout waiting for link down message response
Jul 22 13:30:28 gb49 kernel: dma_mark_wait 1080 TO chan 0x0
Jul 22 13:30:28 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:33 gb49 kernel: dma_mark_wait 1080 TO chan 0x0
Jul 22 13:30:33 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:38 gb49 kernel: dma_mark_wait 1080 TO chan 0x1
Jul 22 13:30:38 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:43 gb49 kernel: dma_mark_wait 1080 TO chan 0x2
Jul 22 13:30:43 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:48 gb49 kernel: dma_mark_wait 1080 TO chan 0x3
Jul 22 13:30:48 gb49 kernel: drain_dma_intr 1151 err -16
Jul 22 13:30:53 gb49 kernel: mic1: Transition from state lost to resetting
Jul 22 13:30:55 gb49 kernel: mic1: Resetting (Post Code ??)
Jul 22 13:30:55 gb49 kernel: mic1: Transition from state resetting to reset failed
Jul 22 13:30:55 gb49 kernel: MIC 1 RESETFAIL postcode ?? -1
Any suggestions?
From micinfo (after boot):
Created Mon Jul 22 13:20:11 2013
System Info
HOST OS : Linux
OS Version : 2.6.32-358.6.2.el6.x86_64
Driver Version : 6720-15
MPSS Version : 2.1.6720-15
Host Physical Memory : 32851 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0386
SMC Firmware Version : 1.14.4616
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202031
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 1019000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 85 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
mic1 is identical (except serial number).
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From the versions numbers, it appears you are running the latest version of the MPSS. Was this an update or is this a new system you are installing? What is the OS release you are running on the host? Are you running anything when the system goes down or is it idle?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a new system and the issue has occured since purchase about 1 month ago. I installed the latest version in an attempt to correct the issue.
The Host OS is Centos 6.4 running kernel 2.6.32-358.6.2.el6.x86_64. In all occurances, the host and mic cards are idle.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just to be sure, I removed the current mpss install. Then booted to the 2.6.32-358.el6.x86_64 and reran the complete install again.
Installation worked as expected, and the mic cards booted. I noticed that the Die temps seemed very high.
After 3min, and at 138C, I lost mic0. A few minutes later, I lost mic1 at 136C.
If this is an overheating issue, what are the expect normal die temps for the mics?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
They should be much cooler than that - I would consider anything over 80C too hot. Do your cards have fans on them? If not, they are passively cooled and require a specially designed chassis to properly remove the heat from them (like you tend to find in rack-mount systems), which you may not have if you are hitting over 100C!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This could be an overheating issue.What is the max threshold temperature?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just to follow up, we flashed the firmware on the on the system and Die temps are a reasonable 45C and 68C.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm glad the system is behaving itself now. If you get a chance, could you rerun micinfo and post the current system information? I am curious whether the Flash or SMC versions changed in any way. I thought what you had was already the latest version.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
System has been very stable since the upgrade. Here's the latest micinfo output:
MicInfo Utility Log
Created Tue Aug 13 14:43:32 2013
System Info
HOST OS : Linux
OS Version : 2.6.32-358.el6.x86_64
Driver Version : 6720-15
MPSS Version : 2.1.6720-15
Host Physical Memory : 32844 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.03.0386
SMC Firmware Version : 1.15.4830
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202031
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 1028000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 40 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
Device No: 1, Device Name: mic1
Version
Flash Version : 2.1.03.0386
SMC Firmware Version : 1.15.4830
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC25202064
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 1019000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 48 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page