Intel® NUCs
Assistance in Intel® NUC products
Announcements
All support for Intel® NUC 7 - 13 systems is transitioning to ASUS. Read more.
13487 Discussions

NUC6i3SYK - Machine check exception - no boot

RKnal
Beginner
6,344 Views

Hi!

For the past eight weeks, I have been happily running a NUC6i3SYK with Xubuntu 15.10 as a media player/HTPC. Memory modules are 2 x 4GB Crucial CT4G4SFS8213 DDR4-2133 and SSD is a Transcend TS128GMTS800. BIOS version is 0028.

Everything has been fine, no problems whatsoever regarding software or hardware. The machine has been running 24/7 for the last about 4 weeks.

Recently, the machine hung itself, and when rebooting, it enters the GRUB menu. From there trying to boot normally into Linux results in a reboot, ending once more in the GRUB menu. The same happens when rebooting in recovery mode, however, it get a bunch of messages prior to reboot, the final ones of which read as follows (timecode removed):

mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 6: be0000000008117a

mce: [Hardware Error]: RIP !INEXACT! 10: {intel_idle+0xbe/0x120}

mce: [Hardware Error]: TSC 5f6c9367e8 ADDR 2677b5580 MISC 372e02c086

mce: [Hardware Error]: PROCESSOR 0:406e3 TIME 1455929507 SOCKET 0 APIC 3 microcode 57

mce: [Hardware Error]: Run the above through 'mcelog --ascii'

mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 6: be0000000008117a

mce: [Hardware Error]: RIP !INEXACT! 10: {intel_idle+0xbe/0x120}

mce: [Hardware Error]: TSC 5f6c9367dc ADDR 2677b5580 MISC 372e02c086

mce: [Hardware Error]: PROCESSOR 0:406e3 TIME 1455929507 SOCKET 0 APIC 2 microcode 57

mce: [Hardware Error]: Run the above through 'mcelog --ascii'

mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 6: be0000000008117a

mce: [Hardware Error]: RIP !INEXACT! 10: {intel_idle+0xbe/0x120}

mce: [Hardware Error]: TSC 5f6c936860 ADDR 2677b5580 MISC 372e02c086

mce: [Hardware Error]: PROCESSOR 0:406e3 TIME 1455929507 SOCKET 0 APIC 1 microcode 57

mce: [Hardware Error]: Run the above through 'mcelog --ascii'

mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 6: be0000000008117a

mce: [Hardware Error]: RIP !INEXACT! 10: {intel_idle+0xbe/0x120}

mce: [Hardware Error]: TSC 5f6c936870 ADDR 2677b5580 MISC 372e02c086

mce: [Hardware Error]: PROCESSOR 0:406e3 TIME 1455929507 SOCKET 0 APIC 0 microcode 57

mce: [Hardware Error]: Run the above through 'mcelog --ascii'

mce: [Hardware Error]: Machine check: Processor context corrupt

Kernel panic - not syncing: Fatal Machine check

Kernel Offet: disabled

What I did so far is:

1. Run IPDT for UEFI, showing no errors, but a warning to the effect that the clock speed is too high (3 GHz). I did not do any overclocking or the like.

2. Run Memtest86 for each memory module separately, four full passes, showing no errors.

3. Tried to boot into various live images (Knoppix 7.0, Xubuntu 15.10, IPDT under Fedora, SystemRescueCD), however, the system hangs itself at some point during boot.

BIOS has been reset to standard values prior to the above.

What I will do next is put the SSD into my PC and run some checks on it. However, since I cannot boot any live images, I do not think it is a SSD/file system-related error.

So, does somebody have any ideas what might be the problem and what I might do to solve it (short of getting a replacement unit)?

Thanks, Robert

0 Kudos
26 Replies
JSpig
Beginner
2,147 Views

I had the same thing on a NUCi5SYK. I may have a partial solution. In the BIOS, From default settings, I disabled the Performance > Intel Turbo Boost Technology.

I can now boot the system. Trouble is, I'm still getting kernel panics every few hours which I'm still troubleshooting...

0 Kudos
RKnal
Beginner
2,147 Views

Thanks! I will try that and report back.

0 Kudos
RKnal
Beginner
2,147 Views

OK, seems like the i3 doesn`t even have Turbo Boost. However, I tried to set all processor related settings in the BIOS to basic functions, to no avail...

0 Kudos
SKell7
New Contributor III
2,147 Views

I did what the log said and ran it through mcelog --ascii:

Hardware event. This is not a software error.

CPU 0 BANK 6 TSC 5f6c936870

RIP !INEXACT! 10:ffffffff814559fe

MISC 372e02c086 ADDR 2677b5580

TIME 1455929507 Sat Feb 20 01:51:47 2016

MCG status:RIPV MCIP

MCi status:

Uncorrected error

Error enabled

MCi_MISC register valid

MCi_ADDR register valid

Processor context corrupt

MCA: corrected filtering (some unreported errors in same region)

Generic CACHE Level-2 Eviction Error

STATUS be0000000008117a MCGSTATUS 5

CPUID Vendor Intel Family 6 Model 78

RIP: intel_idle+0xbe/0x120}

SOCKET 0 APIC 0 microcode 57

Seems like a hardware error of the 2nd level cache (or the 3rd level cache or RAM as it is an eviction error). But who knows. Maybe intel_idle does something to the P-State / C-State which the CPU doesn't like. I have never had MCEs due to software, though. I think it more likely that the CPU is defective and you should RMA it.

0 Kudos
JSpig
Beginner
2,147 Views

I guess my solution above was off the mark - mine NUC has started throwing the Fatal Machine Check errors again... I even have tried to boot to a USB key with the Intel Processor Diagnostic Tool Live OS and can't run the tool because the error comes up before I get the chance to run the tool.

Might have to try getting an RMA.

0 Kudos
RKnal
Beginner
2,147 Views

I also couldn`t boot into the Fedora-based IPDT, however, there is a UEFI version that I was able to boot.

0 Kudos
n_scott_pearson
Super User Retired Employee
2,147 Views

This sounds more like an issue with the memory. Have you tried running with a different DIMM installed? I would do so...

...S

0 Kudos
RKnal
Beginner
2,147 Views

I haven't tried different DIMMs, but as I said, Memtest results in no errors for both modules. Further, the problem persists when using only one module at a time for both modules.

Therefore, I think it is rather unlikely that it is a memory issue...

0 Kudos
idata
Employee
2,147 Views

Did you end up having to RMA the unit, rk75muc? I'm experiencing similar errors: memtest runs fine, but I can't boot into nearly any operating system without a kernel panic and a reboot.

0 Kudos
RKnal
Beginner
2,147 Views

Yes, I had to RMA the unit which went suprisingly fast and easy. New unit works perfectly until now with the old memory modules and old SSD. To be on the safe side, I put cooling in BIOS to "cool".

0 Kudos
Amy_C_Intel
Employee
2,147 Views

Hello, All:

Thank you very much for your feedback.

rk75muc, so the unit works fine now?

Regards,

0 Kudos
RRako
New Contributor I
2,147 Views

"I put "cooling in BIOS to "cool" " - rk75muc - don't play with devil !!!

0 Kudos
RKnal
Beginner
2,147 Views

Amy: Yes, the unit works fine for now.

rado: Well, I figure "cool" is the most hardware-friendly setting, right?

0 Kudos
RRako
New Contributor I
2,147 Views

Yes but nobody knows yet if changing things regarding cooling fan can cause fan stop issue

0 Kudos
Amy_C_Intel
Employee
2,147 Views

rk75muc, thank you for sharing that. Let me know if the if the unit continues to work fine.

 

Regards,

0 Kudos
idata
Employee
2,147 Views

Hello,

I'm sorry but I can confirm the problems withe the NUC.

My NUC6i5SYH is equiped with 2x4 GB RAM and a 250 GB SSD and running Windows 10 - 64.

It works fine for about 4 weeks - 3D-CAD and some office jobs.

But yesterday it stops with windows error message WHEA_UNCORRECTABLE_ERROR ... and reboots ... and error ... and reboots ...and error ...

Trying to boot a linux system I got the same results as rk75nuc reported ... :-(

Klaus.

0 Kudos
RRako
New Contributor I
2,147 Views

And which version of BIOS do you have?

0 Kudos
idata
Employee
2,147 Views

Setup is running, so I can get the bios version:

SYSKLi35.86A.0028.2015.1112.1822

Default settings have been used.

0 Kudos
RRako
New Contributor I
2,147 Views

So that's why you got this WHEA error, because you are on BIOS 0028 instead of newest 0042. Your HW is now physicaly damaged

0 Kudos
idata
Employee
1,799 Views

Thanks for the quick diagnosis. It's very helpful to save additional investigations.

But I'm disappointed by the absence of an information to avoid the short lifetime of the NUC of only 4 weeks.

Mid of March I bought the NUC including a bios which is known as object to reduce the NUC to a device with inbuilt obsolescence.

The box contains a quick setup guide, regulatory and safety informations. But there was no hint or label to update the bios before use.

What's the recommendation to replace the damaged NUC?

0 Kudos
Reply