- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have an embedded board with an E3845 processor on it that randomly issues a kernel panic and reboots. It's running RHEL 7.6 with microcode 0x90d. Each time it kernel panic and reboots it issues one or more Machine Check Exception, which I've decoded using the mcelog facility.
For example,
mce: [Hardware Error]: CPU 2: Machine Check Exception: 4 Bank 0: 9000000020000003
mce: [Hardware Error]: TSC 2b2522a1b4
mce: [Hardware Error]: PROCESSOR 0:30679 TIME 1388556142 SOCKET 0 APIC 4 microcode 90d
mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 2: b20000040002010a
mce: [Hardware Error]: TSC 2b2522a141
mce: [Hardware Error]: PROCESSOR 0:30679 TIME 1388556142 SOCKET 0 APIC 0 microcode 90d
mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 5: f40000400090000f
mce: [Hardware Error]: TSC 2b2522a1b4 ADDR 70f01c80
mce: [Hardware Error]: PROCESSOR 0:30679 TIME 1388556142 SOCKET 0 APIC 0 microcode 90d
mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 0: b600000013080810
mce: [Hardware Error]: TSC 2b2522a141 ADDR 70f01c80
mce: [Hardware Error]: PROCESSOR 0:30679 TIME 1388556142 SOCKET 0 APIC 0 microcode 90d
..which decodes to:
Hardware event. This is not a software error.
CPU 2 BANK 0 TSC 2b2522a1b4
TIME 1388556142 Wed Jan 1 00:02:22 2014
MCG status:MCIP
MCi status:
Corrected error
Error enabled
MCA: External error
STATUS 9000000020000003 MCGSTATUS 4
CPUID Vendor Intel Family 6 Model 55 Step 9
SOCKET 0 APIC 4 microcode 90d
--
Hardware event. This is not a software error.
CPU 0 BANK 2 TSC 2b2522a141
TIME 1388556142 Wed Jan 1 00:02:22 2014
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Generic CACHE Level-2 Generic Error
STATUS b20000040002010a MCGSTATUS 4
CPUID Vendor Intel Family 6 Model 55 Step 9
SOCKET 0 APIC 0 microcode 90d
--
Hardware event. This is not a software error.
CPU 0 BANK 5 TSC 2b2522a1b4
ADDR 70f01c80
TIME 1388556142 Wed Jan 1 00:02:22 2014
MCG status:MCIP
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_ADDR register valid
MCA: Level-3 Generic cache hierarchy error
STATUS f40000400090000f MCGSTATUS 4
CPUID Vendor Intel Family 6 Model 55 Step 9
SOCKET 0 APIC 0 microcode 90d
--
Hardware event. This is not a software error.
CPU 0 BANK 0 TSC 2b2522a141
ADDR 70f01c80
TIME 1388556142 Wed Jan 1 00:02:22 2014
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_ADDR register valid
Processor context corrupt
MCA: BUS error: -1 0 Level-0 Local-CPU-originated-request Read Memory-access Request-did-not-timeout
STATUS b600000013080810 MCGSTATUS 4
CPUID Vendor Intel Family 6 Model 55 Step 9
SOCKET 0 APIC 0 microcode 90d
The is a head-less system that is PXE booted so capturing this state is very difficult, especially since we don't know how to cause the failure. The card vendor has provided the latest version of the BIOS for this board and has been unable to identify a root cause.
Has anyone had an issue like this with this part? Is there a later microcode version that would address this issue? Thank you!
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page