I have a large number of Intel NUC7i5BNK systems running Debian Stretch, each has a Transcend TS64GMTS800 M.2 SSD and Kingston KVR24S17S6/4 SODIMM.
I am seeing an issue where very intermittently, perhaps on the order of every 100 reboots, the system will freeze on the GRUB screen as follows:
Loading Linux 4.9.0-7-amd64 ...
Loading initial ramdisk ...
Hard lockup, cannot switch to another console, have to power down. On the next boot it happily loads the OS properly.
Because we have multiple systems in the field, all rebooting daily, we see this problem somewhat regularly. I am able to reproduce it by starting a cronjob as follows:
@reboot sleep 60 && /sbin/reboot
After 7-20 hours of rebooting it will hit this lockup, fail to load the kernel, and thus fail to run another cron reboot.
Things I have tried:
memtest86, pro version, 50 passes (took over 24 hours), zero errors.
kernel option: acpi=off, no effect
kernel option: nomodeset, no effect
I have run apt-get update; apt-get upgrade; apt-get dist-upgrade to bring the kernel to 4.9.0-11 with no effect.
The UEFI/BIOS versions are different on these two machines and the intel changelog suggests nothing useful, its possible both bios versions suffer from the same bug but I'd love to know if there is any other diagnosis I can perform first.
In an attempt to get more information, I removed the "quiet" flag from the kernel options, that didn't change anything. I then added "verbose", which also didn't change anything. Since this lockup seems to happen right as the kernel loads, nothing pertaining to the lockup is logged in /var/log/messages, syslog, or kern.log and there is no screen output after "Loading initial ramdisk", no matter which logging options I choose.
There are a thousand causes for this problem when I search the internet and very few of them pertain to intermittent symptoms.
I know it seems like hardware is the most likely culprit, but the fact that this is happening on at least 10 different machines in the field suggests a systemic hardware bug in this entire platform, if it is indeed hardware.
I would love to know if anyone else can suggest something I can test
I put nine identical machines on the test bench and had them reboot every 60 seconds, the next morning they were all crashed.
In the NUC BIOS (all 9 are updated to 0081) the RAM timings are 'Automatic', there is no custom profile.
I did notice the ram I'm using is not on Intel's official list of tested ram. So I bought two modules from this list, installed them into 2 of the 9 test machines, and put them all on the bench. Still crashed even with the new RAM.
Intel has basically told me that they don't support Debian Stretch so there's nothing they can do, which is frustrating but understandable.
I am going to install Ubuntu LTS on 4 machines, Debian Testing on 5 machines, and aggressively reboot. Maybe it is just a fundamental Debian Stretch incompatibility with this hardware.
We offer limited support for Linux* but Ubuntu* is still a bit more in our range of action than Debian* so please try Ubuntu and lets us know if you still need assistance.
One thing I dont like to ask but in this case It may be applicable in order to discard any basic hardware related issue it to try Windows 10* and you can use PassMark* Rebooter (not an Intel* application) to run a quick test and compare results vrs Debian*.