Processors
Intel® Processors, Tools, and Utilities
14538 Discussions

HW Error i9-10850K - Is it CPU or RAM ?

erstrauss
Beginner
8,047 Views

Hi All,

My system:
CPU: i9-10850K
MB: ROG STRIX Z490-E GAMING
RAM: Corsair 32GB, 2 X 16GB
OS: Fedora Linux 33

I see the following, hardware machine check events:

Apr 24 21:59:27 localhost.localdomain mcelog[941]: Hardware event. This is not a software error.
Apr 24 21:59:27 localhost.localdomain mcelog[941]: MCE 0
Apr 24 21:59:27 localhost.localdomain mcelog[941]: CPU 1 BANK 0 TSC 1307f2e3d5ac6
Apr 24 21:59:27 localhost.localdomain mcelog[941]: TIME 1619315967 Sat Apr 24 21:59:27 2021
Apr 24 21:59:27 localhost.localdomain mcelog[941]: MCG status:
Apr 24 21:59:27 localhost.localdomain mcelog[941]: MCi status:
Apr 24 21:59:27 localhost.localdomain mcelog[941]: Error overflow
Apr 24 21:59:27 localhost.localdomain mcelog[941]: Corrected error
Apr 24 21:59:27 localhost.localdomain mcelog[941]: Error enabled
Apr 24 21:59:27 localhost.localdomain mcelog[941]: MCA: Internal parity error
Apr 24 21:59:27 localhost.localdomain mcelog[941]: STATUS d000004000010005 MCGSTATUS 0
Apr 24 21:59:27 localhost.localdomain mcelog[941]: MCGCAP c10 APICID 2 SOCKETID 0
Apr 24 21:59:27 localhost.localdomain mcelog[941]: MICROCODE e2
Apr 24 21:59:27 localhost.localdomain mcelog[941]: CPUID Vendor Intel Family 6 Model 165 Step 5
Apr 24 21:59:27 localhost.localdomain mcelog[941]: mcelog: warning: 8 bytes ignored in each record
Apr 24 21:59:27 localhost.localdomain mcelog[941]: mcelog: consider an update
...
Apr 24 21:59:58 localhost.localdomain mcelog[941]: Hardware event. This is not a software error.
Apr 24 21:59:58 localhost.localdomain mcelog[941]: MCE 0
Apr 24 21:59:58 localhost.localdomain mcelog[941]: CPU 5 BANK 0 TSC 13098a1328846
Apr 24 21:59:58 localhost.localdomain mcelog[941]: TIME 1619315998 Sat Apr 24 21:59:58 2021
Apr 24 21:59:58 localhost.localdomain mcelog[941]: MCG status:
Apr 24 21:59:58 localhost.localdomain mcelog[941]: MCi status:
Apr 24 21:59:58 localhost.localdomain mcelog[941]: Corrected error
Apr 24 21:59:58 localhost.localdomain mcelog[941]: Error enabled
Apr 24 21:59:58 localhost.localdomain mcelog[941]: MCA: Internal parity error
Apr 24 21:59:58 localhost.localdomain mcelog[941]: STATUS 9000004000010005 MCGSTATUS 0
Apr 24 21:59:58 localhost.localdomain mcelog[941]: MCGCAP c10 APICID a SOCKETID 0
Apr 24 21:59:58 localhost.localdomain mcelog[941]: MICROCODE e2
Apr 24 21:59:58 localhost.localdomain mcelog[941]: CPUID Vendor Intel Family 6 Model 165 Step 5
Apr 24 21:59:58 localhost.localdomain mcelog[941]: mcelog: warning: 8 bytes ignored in each record
Apr 24 21:59:58 localhost.localdomain mcelog[941]: mcelog: consider an update
Apr 24 22:00:27 localhost.localdomain mcelog[941]: Hardware event. This is not a software error.
Apr 24 22:00:27 localhost.localdomain mcelog[941]: MCE 0
Apr 24 22:00:27 localhost.localdomain mcelog[941]: CPU 7 BANK 0 TSC 130b13c272bb0
Apr 24 22:00:27 localhost.localdomain mcelog[941]: TIME 1619316027 Sat Apr 24 22:00:27 2021
Apr 24 22:00:27 localhost.localdomain mcelog[941]: MCG status:
Apr 24 22:00:27 localhost.localdomain mcelog[941]: MCi status:
Apr 24 22:00:27 localhost.localdomain mcelog[941]: Corrected error
Apr 24 22:00:27 localhost.localdomain mcelog[941]: Error enabled
Apr 24 22:00:27 localhost.localdomain mcelog[941]: MCA: Internal parity error
Apr 24 22:00:27 localhost.localdomain mcelog[941]: STATUS 9000004000010005 MCGSTATUS 0
Apr 24 22:00:27 localhost.localdomain mcelog[941]: MCGCAP c10 APICID e SOCKETID 0
Apr 24 22:00:27 localhost.localdomain mcelog[941]: MICROCODE e2
Apr 24 22:00:27 localhost.localdomain mcelog[941]: CPUID Vendor Intel Family 6 Model 165 Step 5
Apr 24 22:00:27 localhost.localdomain mcelog[941]: mcelog: warning: 8 bytes ignored in each record
Apr 24 22:00:27 localhost.localdomain mcelog[941]: mcelog: consider an update
Apr 24 22:15:25 localhost.localdomain mcelog[941]: Hardware event. This is not a software error.
Apr 24 22:15:25 localhost.localdomain mcelog[941]: MCE 0
Apr 24 22:15:25 localhost.localdomain mcelog[941]: CPU 1 BANK 0 TSC 133a192c3c238
Apr 24 22:15:25 localhost.localdomain mcelog[941]: TIME 1619316925 Sat Apr 24 22:15:25 2021
Apr 24 22:15:25 localhost.localdomain mcelog[941]: MCG status:
Apr 24 22:15:25 localhost.localdomain mcelog[941]: MCi status:
Apr 24 22:15:25 localhost.localdomain mcelog[941]: Corrected error
Apr 24 22:15:25 localhost.localdomain mcelog[941]: Error enabled
Apr 24 22:15:25 localhost.localdomain mcelog[941]: MCA: Internal parity error
Apr 24 22:15:25 localhost.localdomain mcelog[941]: STATUS 9000004000010005 MCGSTATUS 0
Apr 24 22:15:25 localhost.localdomain mcelog[941]: MCGCAP c10 APICID 2 SOCKETID 0
Apr 24 22:15:25 localhost.localdomain mcelog[941]: MICROCODE e2
Apr 24 22:15:25 localhost.localdomain mcelog[941]: CPUID Vendor Intel Family 6 Model 165 Step 5
Apr 24 22:15:25 localhost.localdomain mcelog[941]: mcelog: warning: 8 bytes ignored in each record
Apr 24 22:15:25 localhost.localdomain mcelog[941]: mcelog: consider an update

1. I'm running with default setting
2. CPU is not getting hot, up to 55C
3. the dmesg command output includes:

[93947.269384] mce_notify_irq: 2 callbacks suppressed
[93947.269387] mce: [Hardware Error]: Machine check events logged

my questions:

1. Are the above errors in dicate CPU issue or RAM issue.
2. What are the next steps to isolate it, and fix the issue?
3. What diagnostics tools evailable to isolate the root cause of the issue?

I'll appreciate your help.

Thank you.

 

0 Kudos
29 Replies
DeividA_Intel
Employee
1,723 Views

Hello erstrauss, 



Thanks for the patience, I will share with you some steps to take and I will ask questions to confirm the information:


1. Before setting the RAM frequency at 2933MHZ, what frequency did you have?


2. Noticed that the CPU speed was higher than the base frequency, were you aware of it?


3. Based on the research, these errors are presented when the CPU is being used out of specification. This of course can damage the computer and you can check more about it here:

- https://askubuntu.com/questions/605369/mce-hardware-error-machine-check-events-logged-appears-in-syslog-what-sho


4. As mentioned in y the last post, since you tried with different RAM and CPU and the errors persisted, it would be better to contact the motherboard manufacturer to check for further troubleshooting or in the Linux forums.



Links to third-party sites and references to third-party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, Intel® is not responsible for the contents of such links, and no third-party endorsement of Intel or any of its products is implied.  




Regards,  


Deivid A.  

Intel Customer Support Technician  


0 Kudos
DeividA_Intel
Employee
1,713 Views



Hello erstrauss, 


  

Were you able to check the previous post and get the information requested? Please let me know if you need more assistance.   


  


Regards,  


Deivid A.  

Intel Customer Support Technician  


0 Kudos
erstrauss
Beginner
1,686 Views

Hi Deivid,

 

I contact Asus; it will take a while to get more help / information from them.

I tried a different approach, instead of configuring the BIOS to the 'safe, in spec' mode, I loaded the bios 'BIOS optimized' profile which takes the system to the best performing configuration.

To my surprise the system now works very well, passed 3 hours stress tests with the same test programs and provides about 15% better performance. I'll continue testing it during the coming weeks.

 

Are there tools that can report all voltage and frequencies of the different system components ?

 

 

1. Before setting the RAM frequency at 2933MHZ, what frequency did you have?

I don't have this information.

If I run the lshw command, now with the working config, I get:

[root@localhost]# lshw -short -C memory
H/W path Device Class Description
=============================================================
/0/0 memory 64KiB BIOS
/0/47 memory 32GiB System Memory
/0/47/0 memory 16GiB DIMM DDR4 Synchronous 2133 MHz (0.5 ns)
/0/47/1 memory [empty]
/0/47/2 memory 16GiB DIMM DDR4 Synchronous 2133 MHz (0.5 ns)
/0/47/3 memory [empty]
/0/54 memory 640KiB L1 cache
/0/55 memory 2560KiB L2 cache
/0/56 memory 20MiB L3 cache
/0/100/14.2 memory RAM memory

Is that reporting of 2.133GHz,  the current RAM speed that you are referring to? (why did the BIOS configured it to that low number ?) 

 

 

2. Noticed that the CPU speed was higher than the base frequency, were you aware of it?

According to: https://ark.intel.com/content/www/us/en/ark/products/205904/intel-core-i9-10850k-processor-20m-cache-up-to-5-20-ghz.html

this CPU might run at clock rate of up to 5.2GHz, is it correct ?

I was re-produce the hardware event using Intel's ClearLinux OS, do you have the tools in the OS to set the CPU clock rate to the best values?

Is there a tool that can tell me, what has changed between the two BIOS configs and HW settings - one that cause issues under certain stress, and the one that works ?

 

Thanks,

ErStrauss

 

0 Kudos
erstrauss
Beginner
1,702 Views

Hi Deivid,

 

I tried a different approach, instead of setting the bios to 'safe, in spec' mode, I set it to 'bios optimized' mode;
to my surprise in this mode I don't see the problem, and the application runs more than 10% faster.

I'll explore difference between this mode and the original mode, and report if I find the exact setting that make this work.

 

1. Before setting the RAM frequency at 2933MHZ, what frequency did you have?
- I don't have that information.
- according to dmidecode the current, working setup DIMM speed is 2133MHz; that is too low to the spec of the DIMM, but works and faster than before.

 

2. Noticed that the CPU speed was higher than the base frequency, were you aware of it?
- according to: https://ark.intel.com/content/www/us/en/ark/products/205904/intel-core-i9-10850k-processor-20m-cache-up-to-5-20-ghz.html

 this cpu support up to 5.2GHz
- yes it is running above the 3600MHz, that is suppose to be that way, right ?
- in the current working stable setup the cpu is running at 4900MHz.
- The test was failing on Intel's Clear Linux OS, was there some miss-configuration in the OS to run the cpu out of spec?

 

I contacted the Asus team, it will take few more days to get detailed information about this case.

 

- Is the 'Intel Extreme Tuning Utility' available for Linux OS ?


Thanks,
ErStrauss

 

0 Kudos
DeividA_Intel
Employee
1,681 Views

Hello erstrauss, 



Thanks for the update, I will try to answer your questions:



1. The BIOS modes have different settings/profiles, for the 'bios optimized' Asus should set 2133MHz as default. These settings are placed by the motherboard manufacturer and may be different between brands.


2. CPU base frequency (3.60 GHz) is the one recommended by Intel. The 5.20 GH0z is the maximum speed that the CPU could reach with turbo boost and/or overclocking. Any speed above the base frequency can damage the CPU and that is why we do not recommend doing it.


3. Unfortunately, there is not an Intel® Extreme Tuning Utility (Intel® XTU) for Linux devices.



At this point, I would like to confirm if the errors disappear after you changed the BIOS mode or if you still have the same behavior/issue?



Do not hesitate to let me know if you need further assistance, but as mentioned (and as you noticed) the solution may be related to the motherboard.




Best regards, 


Deivid A.  

Intel Customer Support Technician 


0 Kudos
DeividA_Intel
Employee
1,670 Views

Hello erstrauss,  


  

Were you able to check the previous post? Please let me know if you need more assistance.   

  


Regards,    


Deivid A.  

Intel Customer Support Technician  


0 Kudos
erstrauss
Beginner
1,658 Views

Hi Deivid,

 

I got the system to a stable state, but I'm not sure it delivers the best possible performance.

 

Enabling Asus XMPI - got the system to a non-bootable state, working with manual configuration with memory at 2133 works.

 

The system works stable at 4800MHs at temperature of 60-65C which is ok, while the CPU draw 120-135Watt.

 

I found Intel instruction for overclocking the CPU.

 

You mentioned risk to the CPU and the RAM, is it only due to high temperature or are there other factors that might damage the hardware.

(for example over 1.35v for the RAM).

 

2. CPU base frequency (3.60 GHz) is the one recommended by Intel. The 5.20 GH0z is the maximum speed that the CPU could reach with turbo boost and/or overclocking. Any speed above the base frequency can damage the CPU and that is why we do not recommend doing it.

Again, the damage is due to temperature or other issues; as the CPU is from the (*K) line of products that should be unlocked for overclocking.

 

3. Unfortunately, there is not an Intel® Extreme Tuning Utility (Intel® XTU) for Linux devices.

Are there any other tools in the Intel Linux - ClearLinux to measure the system performance and tune them?

 

At this point, I would like to confirm if the errors disappear after you changed the BIOS mode or if you still have the same behavior/issue?

Yes, playing with the BIOS setup removed the issue, but it leaves a lot to be desired from perspective of capturing the state of the system configuration. I'm not yet 100% sure the system will not show instability later on with different type of load.

 

Thanks,

ErStrauss

 

 

0 Kudos
DeividA_Intel
Employee
1,644 Views

Hello erstrauss, 



Thanks for the response and, just to finish, I would like to add:



1. Even though your CPU can be overclocked, it does not mean that it is not harmful to the device. This process places the CPU under too much pressure and stress that can get damage itself or the motherboard. Overclocking could be needed only if the task to perform requires more "power" and a boost of performance is needed.


2. For the moment we do not have a tool for Linux besides the Intel® System Support Utility for the Linux, which performs a detailed scan and report of the computer system information to assist you with customer support troubleshooting.


- https://downloadcenter.intel.com/download/26735/Intel-System-Support-Utility-for-the-Linux-Operating-System


3. Bear in mind that Intel® Extreme Memory Profile (Intel® XMP) takes your RAM to its maximum speed, and the constant use of this tool can damage the memory controller hub which is the connection between the CPU and the RAM (the memory controller hub is part of the CPU hardware). This action can be harmful if the RAM speed exceeds the frequency supported by the CPU (2933 MHz).



In conclusion, keep your system at the base frequency unless is needed to perform a specific task. Also, check with the motherboard manufacturer for instructions about the Intel® XMP and overclocking configuration and replacement process if needed.



Please keep in mind that this thread will no longer be monitored by Intel.  


  


  

Regards,  


Deivid A.  

Intel Customer Support Technician  


0 Kudos
DoctorDan
Beginner
1,527 Views

Hello David,

I've got the same error message:

To start with, at the moment the machine has DDR4-2133 ram. I gather from your comments that this may be an issue.

Nevertheless this is what I have done so far:

1 - I have tried ubuntu 20.04 and 21.10, fedora 34, and manjaro. I have tried linux kernels 5.11 and 5.13. They all report the same error.

2 - the Processor diagnostic run on win 1- claims everything is OK.

3 - I have replace the gigabyte z580-A Master with an ASUS z590-Prime. Same error.

4 - I have tried both 1 and 2 sticks of memory. Same error.

5 - I have tried a different brand of DDR4-2133. Same error.

Like erstrauss, Since I don't really understand where these errors come from or what they mean, I wonder about the implications and long-term potential for problems if the errors can't be eliminated.

I doubt that running all of the logs that you requested of erstrauss would show anything different than what he reported.

In my mind, it's difficult for me to eliminated the processor as being part or all of the problem despite the fact that the diagnostic passed. Does it really test everything that could cause this particular problem.

I would appreciate your help in resolving this and in helping me to understand the origin of the messages.

Thank you,

Dan Essin

0 Kudos
Reply