For about a year or so my Intel NUC6I5SYK (with the Iris Graphics 540) has been experiencing random crashes. Well they are not exactly 'random', but mostly occur when watching youtube videos with Chrome or during zoom meetings.
What do I mean by "crash"? The picture freezes and about 5 seconds later, the NUC reboots.
What I have found out so far:
- for simple tasks (idling, browsing, text editing, etc.) the NUC runs stable
- prime95 runs fine, no crash (20min test)
- furmark makes the NUC crash after about 2 minutes
- temperature readings are ok (90°C max) and I have check and cleaned the fan
- RAM and SSD seem to be fine (no errors found with Memtest or file system scans)
- crash occurs on Windows and Linux (therefore most likely not caused by the OS)
- BIOS is on the latest version
It looks like the issue is somehow related to the GPU (Iris 540) or the resulting higher power consumption when the GPU is active.
In the Windows event log I found the following:
- Critical Error, Kernel-Power: "The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly."
If the system locks up and then, after a small time, resets spontaneously, this is an indication that the Watchdog Timer reset the PC to recovery from a lockup. At this level, this is most-often the result of a memory bus lockup. A memory bus lockup is most-often caused by noise levels on the bus reaching thresholds where data cannot be distinguished from noise. While this can be the result of failures on the motherboard (something creating new noise (which happens with age) or bus support components not suppressing the noise they should be (because they are failing)) or failures occurring in the processor's memory controllers, the most common issue is bad or failing memory. This is what you have to look at first. Yes, I know you said MemTest didn't see anything. If you are talking about the Microsoft utility, well, it is completely worthless AFAIK. It can find failed memory, but is not very good for failing memory. In fact, even the higher-rated programs, like MemTest86 or MemTest86+, will still not find all cases of failing memory. Bottom line, find somewhere to borrow some other memory to try.
Hope this helps,
P.S. That Event Log entry is simply a signal that the system was started up without having recorded that a shutdown took place. It is thus totally useless as any kind of indicator for what caused the failure.
Thanks for your answer. I will try to organize a replacement RAM.
However, I find it quite strange that the RAM should be the culprit. I mean why does the NUC not crash with prime95 (close to 100% memory usage) and runs totally stable when performing tasks that are not GPU-intentive?
And yes I used MemTest86.
I am having a tough time coming up with an explanation that doesn't take me hours to craft. Suffice it to say that that it may not be bad memory -- specific memory cells that cannot be read and written reliably -- because if that was the case, MemTest86 should catch it. There are so many components that could play a role in bus lockups occurring and there could even be some particular sequence of events that has to occur in order to cause it. The problem is how do you figure out where the issue lies? Is it in the processor? Is it some component on the motherboard? We hope it's none of these, since they are integral. I thus look for other components that *can* be replaced and start with them first. Memory can be replaced -- but, since we don't know that this is culprit, we don't want to spend a fortune replacing it just yet. Try some from another system. Think borrow before purchase.
I agree it makes sense to test the memory properly since this component can be easily replaced.
I was able to get some replacement RAM DIMMs for testing, but unfortunately the NUC still crashes. I also reset the BIOS to the default values to make sure there is no bad setting.
Hm, maybe my NUC is toast. Warranty is gone so not sure what to do with it.
Well, if you have tested with both Windows and Linux and you have tested with two different sets of SODIMM(s), we can conclude that the problem isn't in the memory (obviously, though it could still be in the memory interface) and that it is difficult (though not impossible), considering using multiple O/Ss, to blame the graphics drivers.
Have you tried doing a clean install of the NUC-validated graphics driver? Have you tried with the latest Beta release of the graphics driver?
Yes, I have tried a clean re-install of the latest graphics driver (18.104.22.16881). The older driver cannot be installed on Windows anymore for some reason, so I couldn't try that. I don't know anything about beta drivers, where can I get those?
I also played around a bit with the BIOS settings and found an option to set the max. sustained and the boost power consumption. When I lower these values to 15 and 23W respectively, Furmark runs a bit longer (around 10 minutes instead of 2 minutes). However, GPU and CPU clock speeds are throttled heavily to achieve the lower power limits. At the time the NUC locked up, the GPU temperature reading was ~80°C.
Here is link to download page for latest Beta release: https://downloadcenter.intel.com/download/30522/Intel-Graphics-BETA-Windows-10-DCH-Drivers. BTW, I got to this page by going to https://downloadcenter.intel.com, searching for 'DCH' and selecting the resulting Beta release. This is build 9667 whereas the latest production build is 9466.
Hhmmm, I wonder if you are having a power supply issue. Have you tried using a different power supply?
Thanks for the link, will try it later.
I also suspected the power supply at first, but my tests with other supplies showed the same symptoms. I did not have an original NUC supply to test with but a different one (20V, 65W).
Just did some more testing: selected "balanced power" in the BIOS instead of "max. power", which sets the limits to I think 20 and 25W for sustained / boost. In addition, I adjusted the fan speed curve such that the fan spins at max. rpm when temperature reaches ~85°C (SYS temp). Interestingly, this time the Furmark test did not make the NUC crash (tested for 20min).
This makes me wonder if it could be some kind of thermal shutdown. Is there a way to find out? Could it be that one of the temperature sensors sporadically hits a threshold and triggers a reset? Are these thresholds known / documented somewhere?
Thank you for posting on the Intel® communities. We hope that the assistance provided by the community has been helpful.
Also, we would like to inform you that due to the Intel® NUC Kit NUC6i5SYK has been discontinued, Intel Customer Service no longer supports inquiries for it, but perhaps fellow community members have the knowledge to jump in and help. You may also find the Discontinued Products website helpful to address your request. Thank you for your understanding.
Please keep in mind that this thread will no longer be monitored by Intel.
Intel Customer Support Technician
A thermal shutdown is exactly that; the system immediately powers off completely (no windows shutdown, no restart, no nothing). And, when you do power back on (manually, not automatically), the BIOS will inform you that a thermal shutdown had occurred. No, while it is possible that heat could play a role in your failure, it isn't anything simple.
Are you monitoring temperatures with something that can record temperatures? As an example, you could download and run AIDA64 in trial mode; it can do this and lots of other things.
Hope this helps,
I tried AIDA64 but it isn't really useful (logging rate is way too low). The BIOS does not show any thermal shutdown warning or anything, so I assume this didn't happen.
I also don't think it is a thermal only problem because sometimes the NUC freezes even though the CPU fan is idling (which usually means low / normal temperatures). Sometimes I'm not even doing much but only watching a video on youtube and the NUC just locks up. Really getting annoying.
I'm now thinking about replacing it. Problem is, NUC11 is currently not available where I live.
Anyway, I appreciate your help, thanks!
Yea, it didn't feel like a thermal issue. Even your original temperatures were nowhere near the thermal shutdown point.
Nothing is available. Shortages are affecting the whole industry. This is the longer-term affects of a global pandemic...