I updated NUC7CJYH BIOS from 0027 to 0043 today, got mce errors like what I've seen in the past when I updated to all the version after 0027, so I decided to revert back to 0027 but failed. It says the BIOS version does not match and aborted. I literally cannot revert back to any version before 0043 after this update, so I am currently locked to 0043 now.
I am running arch linux with kernel version 4.17.11-6-ck-silvermont
[ 0.090006] mce: [Hardware Error]: Machine check events logged
[ 0.090009] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
[ 0.090016] mce: [Hardware Error]: TSC 0 ADDR fef4c9e0
[ 0.090023] mce: [Hardware Error]: PROCESSOR 0:706a1 TIME 1535229970 SOCKET 0 APIC 0 microcode 28
I have the same issue here. The bios won't revert back either.
Despite the message above Archlinux does continue to boot without the GUI. Using NoMachine remote access works perfectly.
After noticing that the nic activity lights were blinking randomly, I realized that the display had an issue and successfully logged in. The problem was with a tv as display. Using a computer monitor works normally.
There is a newer BIOS release out there that you may want to try: version 44, https://downloadcenter.intel.com/download/28106/BIOS-Update-JYGLKCPX-86A-?product=126135 Download BIOS Update [JYGLKCPX.86A]
Now, on BIOS 43 there is an Updated CPU Microcode (Security Advisory-00115), I dont know if this update may be generating the MCE errors, is it possible that you update your microcode from 4.17.11 to https://www.kernel.org/ 4.18.5? What Linux distribution are you using? Did you get this error right away after the BIOS was updated to BIOS 43?
On the other hand, when trying to downgrade BIOS (which I dont recommend) Did you try with the [F7] process? What process did you try and what file and when do you get the error.
I updated to the version 44 and the mce error remain the same for both 4.17.11 and 4.18.5, I still cannot revert back to any previous versions.
[~]$ dmesg | grep -i error
[ 0.053361] mce: [Hardware Error]: Machine check events logged
[ 0.053364] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
[ 0.053372] mce: [Hardware Error]: TSC 0 ADDR fef4c9e0
[ 0.053378] mce: [Hardware Error]: PROCESSOR 0:706a1 TIME 1535503652 SOCKET 0 APIC 0 microcode 28
[ 0.575717] RAS: Correctable Errors collector initialized.
[ 2.460149] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
I am running Arch Linux, I am getting mce errors on any version after 0027, so that's the reason I try the new BIOS once it's released and revert back to 0027 if I encountered mce errors.
I used the [F7] process to upgrade/downgrade BIOS, I loaded multiple old BIOS to USB drive and try one by one, I could not downgrade to any previous version.
The error information for downgrading BIOS is Incompatible BIOS version, Update Aborted.
I just checked on this again, BIOS version 43 includes an updated CPU Microcode (Security Advisory-00115), once BIOS is updated to this version you cannot go back to previous versions due to security reasons.
That tiny part is missing from the documentation and I already request that to be added. It cannot be downgraded through any regular method: BIOS update, [F7] or BIOS recovery jumper.
I apologize for the inconvenience.
I would need to read the logs and for that I used to use mcelog but it seems it is not part of kernel 4.15.0 which is the one that I am running and I am reading that mcelog was removed from that kernel version.
I updated from 4.15.0. updated to 4.18.5 still no mcelog and I tried to manually add it with no success.
Do you have a way to read the logs and perhaps post a screenshot?
The mcelog package is deprecated and has been replaced by rasdaemon. You could give it a shot.
I tried to install the rasdaemon to read the mce error information and failed, it seems like I don't have several kernel options enabled for that, such as EDAS related stuff.
Hope rasdaemon could help for you to debug.
Thanks for the information, I guess my Linux knowledge is a bit outdated, I will try rasdaemon however I have never used it before and I dont know if it is going to work out for me.
On the other hand, besides the MCE error message is the system exhibiting any other issue or problem? The system I have is for testing only and I am not really running any task on it so everything I see is normal.
To be honest, I haven't experienced any obvious system issues or crashes yet besides the mce error from the dmesg log.
I remember someone told me once that MCE error could cause random system reboot or application unstable issues, is that true for this case? I would love to see a fix in the future BIOS update.
I am really going to need your help to debug this issue.
I installed "rasdeamon" $ sudo apt install rasdaemon
I have it installed but when running I am getting an error message, see screenshot attached. I am getting a "cant locate a mounted debugfs" error message which I believe I already mounted, see previous commands on screenshot.
I would need that if possible that you help me read the logs and interpret the MCE message which is what I am trying to do to keep up with the investigation on this report and secondly, I need to understand if this error messages is connected to the hardware issues you have had.
Please keep in mind that we provide very limited support for Linux related issues so the more you can help me the better.
Thanks for report back the progress.
How did you enable the rasdaemon service? I assume Ubuntu is currently using systemd now, maybe you should try to enable the rasdaemon service with systemctl enable command. Please use this for reference https://wiki.archlinux.org/index.php/Machine-check_exception https://wiki.archlinux.org/index.php/Machine-check_exception
Also, I found rasdaemon github repo, a paragraph in the README should be beneficial, here is the link https://github.com/sujithshankar/rasdaemon GitHub - sujithshankar/rasdaemon: Cloning from http://git.infradead.org/users/mchehab/rasdaemon.git/ , basically you need to rebuild the Ubuntu generic kernel with following options enabled, which I don't have a hope those options are enabled by default. Then mcelog should be recorded in the journald. Hope this will help.
A script is provided under /contrib, in order to test the daemon EDAC
handler. While the daemon is running, just run:
The script requires a Kernel compiled with CONFIG_EDAC_DEBUG and a running
MCE error handling can use the MCE inject:
For it to work, Kernel mce-inject module should be compiled and loaded.
azuresong never reported any hardware issues, but the MCE logs on boot is concerning/annoying. I had a previous NUC that had to be returned for hardware errors and the other MCE logs were very confusing to debug.
Here's a reddit thread of others who are experiencing the same mce errors on boot on all BIOS >037
You should be able to compile and install mcelog from source according to https://www.mcelog.org/installation.html https://www.mcelog.org/installation.html and get it running. I haven't been able to see it decode any of the MCE errors I'm seeing into /var/log/mcelog yet though.
Also it seems like mcelog is having trouble giving detailed outputs in this case (though I've seen it give proper outputs in the case of real hardware errors I had with a previous board) https://github.com/andikleen/mcelog/issues/70 NUC7PJYH (J5005) - mce: [Hardware Error] · Issue # 70 · andikleen/mcelog · GitHub
Though I can definitively say this issue was introduced in the BIOS update immediately following 037. rguevara, you mentioned that the BIOS cannot be downgraded in any regular method, does that mean there is a non-regular method that can be used to downgrade the BIOS in the meantime?
There is a newer BIOS to be released very soon that addresses this issue. See screenshot attached.
In regards to not being able to download the BIOS via the "regular" methods I was referring to any method we have publicly available.
I hope this helps,