I recently purchased a refurbished HP Z820 workstation with dual Xeon E5-2670 processors (2 dies, 8 cores / die --> 16 cores, 32 threads).
Shortly after receiving the machine I experienced a Windows 10 crash (windows is on a hard drive) and a Red Hat Enterprise Linux 7.3 crash (linux is on an SSD). The windows crash had the following message: "929-Fatal MCA error. MLC error detected CPU 0. Internal timeout error - watchdog timer (3 strike)", while the linux crash contained a phrase like 'CPU0 ... TSC dead'. (I don't have the exact text of the error available right now, but "TSC dead" was definitely there.)
The wikipedia pages for TSC and watchdog timer (https://en.wikipedia.org/wiki/Watchdog_timer Watchdog timer - Wikipedia , https://en.wikipedia.org/wiki/Time_Stamp_Counter Time Stamp Counter - Wikipedia ) sound related. Could the crashes have been caused by the same hardware error?
I have stress tested the processor with Intel's "Processor Diagnostic Tool" (IPDT) and Prime95. IPDT ran with zero errors, and the stress test resulted in reasonable temps of 60 degrees C per core. Prime95 pushes the machine much harder- all 16 cores are above 80 degrees C, and one core reached as high as 89 degrees C (the average across all cores is about 84 C). However- Prime95 is not reporting any errors. Although I am concerned about the high temps when running Prime95, strictly speaking they are below my processor's Tj-Max (100 C).
Is there any tool I could use to test for hardware failures related to the TSC or a "watchdog timer"? If there is an error there is doesn't seem like IPDT or Prime95 is going to find it.
And in general, could the high core temperatures be contributing to instability? Note that the situations when it crashed were not under extremely high load (maybe high levels of I/O, but not numerical computation like Prime95 does).
rileymurray: Thank you very much for joining the Intel® Processors communities.
In regard to your inquiries, from our side, the tool that we recommend to use to test the processor is the Intel® PDT, it is a very reliable tool and if the processor passed that test then it means that it should be working fine.
Now, Prime95 does an intense stress test, so high temperatures are expected. And yes, high temperatures will create instability on the system, when the PC is overheating you will noticed different symptoms, the PC will start to throttle, getting freeze and eventually it will go off by itself, since the processor has the feature to turn off the system if it gets really hot to avoid any type damage on the rest of the components.
The crashes can be very well related to the information provided on the Wikipedia links. We also have reports that it could be related to the memory RAM. To check if the memory RAM is fully compatible will be a good thing to try, the proper memory RAM for your system is the DDR3 800/1066/1333/1600:
In order to try to fix this problem we recommend to do a BIOS update on the PC, so the best thing to do right now, will be to get in contact directly with the manufacturer of it to get the instructions of how to update the BIOS and to check if they have further suggestions on this matter. Also, if warranty assistance is needed due to a hardware failure, if the PC is under warranty they should be able to help you with that as well:
Any further questions, please let me know.
Thank you for your detailed response Alberto!
Could you elaborate on your remark that "we also have reports that it could be related to RAM"? The system has 256GB of DDR3 RAM at 800 MHz. I have already run a couple memory scans, but as you can imagine it takes a very long time (days...) to scan that much RAM. It would be helpful to know in what way the RAM might be a problem in order for me to run more targeted tests.
Also- although the RAM is DDR3 @ 800 MHz, notably it does not support Serial Presence Detect (SPD). Is this normal? Are there things about the RAM that I might want to investigate *because* the RAM does not support SPD?
rileymurray: You are welcome. In regard to your inquiry, on the link below you will find a few details about this error message showing related to a memory error, it was for Windows® 8, but the same problem could happen with Windows® 10:
It is not for sure that the problem happens because of the memory, but just in case, if you have the option to run Memtest and it shows no errors then it should be fine. Another thing to try, will be to test the PC with just one memory stick at the time. Since it is a OEM (Original Equipment Manufacturer) product, it might avoid the warranty if you do that.
On the following link, you might find additional details about this error:
NOTE: These links are being offered for your convenience and should not be viewed as an endorsement by Intel of the content, products, or services offered there.
Any questions, please let me know.
rileymurray: I just wanted to check if the information posted previously was useful for you and if you need further assistance on this matter?
Any questions, please let me know.