Processors
Intel® Processors, Tools, and Utilities
14506 Discussions

All Systems Keep Getting WHEA_UNRECOVERABLE_ERRORS

Stancey
Beginner
1,314 Views

I have; a set of  Four (4) systems that keep getting this error. I had tried various options of changing the clocking up and down the scale using the Tuning tools provided by the vendor ASUS. Even at Base timing, when i really get running I have one machine in the multi-cluster configuration that "Drop out" and recover shortly after.  I HAVE  gone through multiple cycles of updates for the Graphics Cards and the Ethernet Adapters. I have been equally dedicated about following the ASUS updates on their support page for my motherboard. The Memory is in the recommended set of Memory components from ASUS.  The machine is water cooled (AIO Pump) and stays at a comfortably moderate thermal level on all machines.      There are days when I have seen multiple faults in a day bursting for an hour then stopping.  The systems have gone for days without faults and sometimes the fault occurs when the system is idling. When the application is running clean if I should bring up Edge to look at something on the web sites of the vendors, I get a failure but that is only with edge.  Dropping back to IE I can access web sites without issue.  Yes I keep up with malware tools.   I typically run at 2133 memory clock speed because if I go up to the rated memory speed of 3200 I often have errors under certain application loads.    

 

I9 9980XE Processors (18 Cores Rated @3.01 Ghz)
Asus X299-Prime Deluxe II Motherboard Bioa 3301
G.Skill 3200 clockable 128GB/ System running at the ASUS Default Clock speed of 2133 (Have tried others)
Micron 16TB flash memory
Intel P4600 NVMe

Thunderbolt 3 Configured (controller FW 15E6 1.2.25.0 Driver/Control sw L41 1.957.- NVM 36.0)
OneStopSystems Magma
TeraMaster TBT Controller Noontec D8-device Tower Legacy, not RAIDed

Nvidia Graphics
Quadro P400 on displays
RTX5000 GPU
Driver level 461.72

Intel Ethernet 10G 2P X520 Ethernet Adapters with SFP+ Fibre 

0 Kudos
9 Replies
Alberto_Sykes
Employee
1,302 Views

Stancey, Thank you for posting in the Intel® Communities Support.


In reference to this scenario about "WHEA_UNRECOVERABLE_ERRORS", in the following links you will find suggestions or troubleshooting steps to try to fix this problem, please also look at the suggested links under "Related topics" as well since it will take you to other sites with additional recommendations and a possible solution:

https://www.intel.com/content/www/us/en/support/articles/000028099/processors/intel-core-processors.html

https://www.intel.com/content/www/us/en/support/articles/000025090/processors.html

 

"I typically run at 2133 memory clock speed because if I go up to the rated memory speed of 3200 I often have errors under certain application loads.", keep in mind that the memory controller is located on the Intel® Processor, so, it is that unit that establishes which type of Memory RAM to use, if you see the specifications on the link below you will confirm that the proper memory RAM for the Intel® is DDR4-2666, so that should be that max speed of memory to use on your platform, 2666 Mhz:

https://ark.intel.com/content/www/us/en/ark/products/189126/intel-core-i9-9980xe-extreme-edition-processor-24-75m-cache-up-to-4-50-ghz.html


In order to rule out a possible hardware problem with the Intel® Processor, we can run the Intel® Processor Diagnostic Tool, it does an overall test on the unit and if it passes the test it means it is working properly and the problem is related to a different component:

https://downloadcenter.intel.com/download/19792/Intel-Processor-Diagnostic-Tool


Ragarding "changing the clocking up and down", we actually recommend to use the PC at sock configuration with the default BIOS settings, altering clock frequency or voltage may damage or reduce the useful life of the processor and other system components, and may reduce system stability and performance. Product warranties may not apply if the processor is operated beyond its specifications. Check with the manufacturers of system and components for additional details.


We also suggest to contact ASUS directly to gather the instructions on how to do a BIOS update to the latest version, to report this scenario and verify if they have a possible fix for this issue:

https://www.asus.com/support/


Once you get the chance, please let us know the results of trying the steps above so, if necessary, we can further assist you.


Any questions, please let me know.


Regards,

Albert R.


Intel Customer Support Technician


0 Kudos
Stancey
Beginner
1,293 Views

The response was pretty generic and something that I had already done as I am experienced at troubleshooting hardware issues. My problem is that I was still getting errors after all of that and there was not any indication that it was going to improve. Some of the driver updates mitigated the problems but I was still seeing errors characteristic of a sequencing issue in setting up transfers that resulted in a reference to an invalid memory address. When I was writing device driver logic for Linux I was seeing the same errors because of timing issues with the setup of the registers for the transfers.  I have several high performance devices installed on this system (Two with Intel supplied drivers and Two with Nvidia Drivers)  and I was concerned that the timing  of the management of the interface to the PCIe bus was a problem.

0 Kudos
Alberto_Sykes
Employee
1,276 Views

Stancey, Thank you very much for providing that information.


Just to confirm, the problem only happens while using Edge or it also occur when using a different browser?

Additionally, please attach the report of the Intel® PDT test so we can see the results and also attach the SSU report so we can verify further details about the components in your platform, please check all the options in the report including the one that says "3rd party software logs":

https://downloadcenter.intel.com/download/25293/Intel-System-Support-Utility-for-Windows-?product=91600


Regards,

Albert R.


Intel Customer Support Technician


0 Kudos
Alberto_Sykes
Employee
1,270 Views

Hello Stancey, I just wanted to check if you saw the information posted previously and if you need further assistance on this matter?


Regards,

Albert R.


Intel Customer Support Technician


0 Kudos
Stancey
Beginner
1,268 Views

Yes, i did it must not have gotten through.  It happens for others but it does happen spontaneously also. The use of Edge while running the application was a trigger. Since updating to the latest Thunderbolt driver on all machines and the latest NVIDIA drivers it seems to have stabilized long enough to get some work put through although it is running slower because I am using the base clocking for the ASUS motherboard (2133) instead of the memory rating of 3200. I am still convinced that it is linked to the Network Interfaces though I did upgrade to the latest treiber there about a week aand a half ago.

0 Kudos
Alberto_Sykes
Employee
1,257 Views

Stancey, Thank you very much for sharing those updates.


If you need further assistance on this topic, please attach the report of the Intel® PDT test so we can see the results and also attach the SSU report so we can verify further details about the components in your platform.


Regards,

Albert R.


Intel Customer Support Technician


0 Kudos
Alberto_Sykes
Employee
1,249 Views

Hello Stancey, I just wanted to check if you saw the information posted previously and if you need further assistance on this matter?


Regards,

Albert R.


Intel Customer Support Technician


0 Kudos
Stancey
Beginner
1,244 Views

I have been so busy trying to catch up with the workload due to the failures I am swamped. I have not had any errors since i was able to install the latest Windows, Nvidia RTX5000,, and Intel Networking drivers.  My guess is that updating to the newest Intel network drivers and MEC  was the real stabilizing event but can't prove it.

0 Kudos
Alberto_Sykes
Employee
1,228 Views

Stancey, Thank you very much for providing those updates.


Perfect, excellent, it is great to hear that since you install the latest Windows*, Nvidia RTX5000, and Intel® Networking drivers, you have not seen any errors. Thank you very much for sharing the solution on this forum as well, we are sure it will be very useful for all the peers viewing this thread.


Any other inquires, do not hesitate to contact us again.


Regards,

Albert R.


Intel Customer Support Technician


0 Kudos
Reply