Community
cancel
Showing results for 
Search instead for 
Did you mean: 
JSmit74
Beginner
6,241 Views

Linux Machine Check Exception: Is it the CPU?

Hello,

On my Laptop Windows often showed the BSOD after minutes of use, so we contacted Dell and provided them the dump files, they exchanged the motherboard.

Now I am running Linux, but random kernel panics occur, sometimes after minutes, sometimes after days.

I configured kdump-tools on my linux distribution to start a crash kernel when the panic occurs to dump the memory along with dmesg output to allow post mortem analysis.

This is what dmesg says when the panic occurs:

[ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135

[ 3933.364177] mce: [Hardware Error]: RIP !INEXACT! 10: {_raw_spin_lock+0x12/0x50}

[ 3933.364182] mce: [Hardware Error]: TSC a0255fbd7f7 ADDR 42dd14480 MISC d62285

[ 3933.364185] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 1 microcode 15

[ 3933.364186] mce: [Hardware Error]: Run the above through 'mcelog --ascii'

[ 3933.364188] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 3: be00000000200135

[ 3933.364190] mce: [Hardware Error]: RIP !INEXACT! 33:<0000045a7992c1b5>

[ 3933.364191] mce: [Hardware Error]: TSC a0255fbd7f0 ADDR 42dd14480 MISC d62285

[ 3933.364194] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 0 microcode 15

[ 3933.364195] mce: [Hardware Error]: Run the above through 'mcelog --ascii'

[ 3933.364196] mce: [Hardware Error]: Machine check: Processor context corrupt

[ 3933.364197] Kernel panic - not syncing: Fatal Machine check

Analyzing the memory dump file with crash (crash /usr/lib/debug/boot/vmlinux /path/to/crashdump/file and typing "bt") gives me the following backtrace:

PID: 0 TASK: ffff8804177617f0 CPU: 6 COMMAND: "swapper/6"

# 0 [ffff88042dd89ca0] machine_kexec at ffffffff8104a732

# 1 [ffff88042dd89cf0] crash_kexec at ffffffff810e6ab3

# 2 [ffff88042dd89db8] panic at ffffffff8170ec6c

# 3 [ffff88042dd89e30] mce_panic at ffffffff8103687a

# 4 [ffff88042dd89e70] do_machine_check at ffffffff81038684

# 5 [ffff88042dd89f50] machine_check at ffffffff8171e25f

[exception RIP: intel_idle+216]

RIP: ffffffff813dfd78 RSP: ffff88041775de28 RFLAGS: 00000046

RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001

RDX: 0000000000000000 RSI: ffffffff81c93220 RDI: 0000000000000006

RBP: ffff88041775de50 R8: ffff88042dd912d0 R9: 000000000000001c

R10: 0000000000000320 R11: 0000000000000249 R12: 0000000000000002

R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff81c932e8

ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

--- ---

# 6 [ffff88041775de28] intel_idle at ffffffff813dfd78

# 7 [ffff88041775de58] cpuidle_enter_state at ffffffff815c9570

# 8 [ffff88041775de90] cpuidle_idle_call at ffffffff815c96a9

# 9 [ffff88041775ded0] arch_cpu_idle at ffffffff8101ceae

# 10 [ffff88041775dee0] cpu_startup_entry at ffffffff810beb85

# 11 [ffff88041775df30] start_secondary at ffffffff81040fc8

Diagnosing the dmesg output with mcelog gives me the following:

Hardware event. This is not a software error.

CPU 4 BANK 3 TSC a0255fbd7f7

RIP !INEXACT! 10:ffffffff8171d9c2

MISC d62285 ADDR 42dd14480

TIME 1398357146 Thu Apr 24 18:32:26 2014

MCG status:RIPV MCIP

MCi status:

Uncorrected error

Error enabled

MCi_MISC register valid

MCi_ADDR register valid

Processor context corrupt

MCA: Data CACHE Level-1 Data-Read Error

STATUS be00000000200135 MCGSTATUS 5

CPUID Vendor Intel Family 6 Model 58

RIP: _raw_spin_lock+0x12/0x50}

SOCKET 0 APIC 1 microcode 15

and

Hardware event. This is not a software error.

CPU 0 BANK 3 TSC a0255fbd7f0

RIP !INEXACT! 33:45a7992c1b5

MISC d62285 ADDR 42dd14480

TIME 1398357146 Thu Apr 24 18:32:26 2014

MCG status:RIPV MCIP

MCi status:

Uncorrected error

Error enabled

MCi_MISC register valid

MCi_ADDR register valid

Processor context corrupt

MCA: Data CACHE Level-1 Data-Read Error

STATUS be00000000200135 MCGSTATUS 5

CPUID Vendor Intel Family 6 Model 58

SOCKET 0 APIC 0 microcode 15

I have also run many passes of memcheck86+, it found no errors, so memory seems to be fine. Given that the motherboard has been changed it is very likely that the CPU is bad, right? Does anything in the output support that view?

Tags (2)
5 Replies
Jose_H_Intel1
Employee
1,439 Views

Hello josmith, I would like to ask first for the exact Intel® processor model; secondly, I would like to suggest resetting BIOS values to default.

Having done that, you may run the Linux* bootable version of the http://www.intel.com/support/processors/sb/CS-031726.htm Intel® Processor Diagnostic Tool*.

I hope this helps.

Jose_H_Intel1
Employee
1,439 Views

Even though you ran a memory test we also advise testing the memory by using only one stick at a time (or replacing all sticks if possible) and checking if the issue re-occurs.

JSmit74
Beginner
1,439 Views

First of all, thank you very much for the answer!

The BIOS settings were the default ones when the crash happened, I never changed anything.

The CPU is a Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz.

The fedora live image here: http://www.tcsscreening.com/files/users/IPDT_LiveUSB/index.html Home is pretty old and the sources for self compilation are equally outdated, where could I get a newer version? Given that the CPU was introduced at about the time Fedora 15 was released it might not work, or am I wrong here?

Output of lshw: http://pastebin.com/KvNr8J6z http://pastebin.com/KvNr8J6z

Output of dmidecode:http://pastebin.com/DL0G9fNx # dmidecode 2.12 SMBIOS 2.7 present. 115 structures occupying 5476 bytes. Tab - Pastebin.com

Jose_H_Intel1
Employee
1,439 Views

All recent Intel® Core processors should work with such version of Fedora*; please give it a try and also consider testing the memory as suggested above.

RCerv
Beginner
1,439 Views

Similar MCE in core i5 CPU resulted in worn out PSU as problem cause. R+R PSU solved further errors. Using Ubuntu OS.

Reply