Processors
Intel® Processors, Tools, and Utilities
14925 Discussions

Individual cores gone faulty in i9-13900k, throwing general protection fault

sibidharan
New Contributor II
3,332 Views

I bought a 13900k and gave 128 GB RAM. I am running Ubuntu and running 10sof VMs using KVM. I started experiencing random "general protection fault" kernel panics all referred to some type cross cache permission violation which I was able to fix by adding slub_debug=F in kernal parameters as suggested in Kernel panic due to "kmem_cache_alloc+117 from mempool_alloc_slab" on RHEL 7 - Red Hat Customer Portal

I tried to boot into any live USB, it just crashes. It was weird. The kernel is non tainted but it crashes with the same type of permission violation in kmem_cache_alloc, and any live USB I boot even without harddisk had same issues. But with luck I am able to turn on my server, and since I have slub_debug=F added to kernel, it didn't crash during operations and it ran for weeks together. 


It was working for sometime, until one day a power failure happened, and then when I restarted the server, it was saying this error I attached here. The errors before slub_debug=F showed different address in the panic, but they are all page fault errors, but I was convinced it was due to freelist pointer corruption as said by RedHat support. I suspected if its faulty RAM, so I ran memtest and it passed. This time, the error is same across different kernels. Even I tried to boot Windows in a new SSD, it couldn't boot, and I attached the BSOD here, which all points to the "general protection fault" by the processor.

But now, its panicking in the initrd phase, while the kernel is doing some udev stuffs, I am never able to find what is causing this because the logs are not recorded since the panic happens in initrd phase, there is nowhere to write them. Interestingly, the same error in the same location is happening even if I boot different kernels via live USB now. I thought I lost the server. I did memtest, it passed again. I removed each peripheral I have connected and tested, nothing helped, Until I read somewhere to use maxcpus=1 and limit the number of CPUs, and it worked, boom my computer is working. Booted up and running, but now with only one CPU. I didn't know what was wrong, until I did the same in BIOS, limited the number of cores to 1, enabling only one core in performance cores and disabled all efficiency cores. I got 2 logical CPUs due to hyper threading and it is working. 

 

I read in a lot of places that the CPU cores are going faulty, https://access.redhat.com/solutions/3915511 

Similar situation here: https://www.linuxquestions.org/questions/linux-desktop-74/not-present-page-kernel-panic-4175722803/

As said in above link, I also tried to enable the remaining cores after able to boot with only one core successfully. But I see that CPUs are getting into hardlockups or softlockups. I even tried to add softlockup_panic=0 in kernel params, its not panicing then, but just hangs forever. Its a lockup, CPU is not responding. In syslog and kernlog, I see something like this. Permission violation. 

[2.911260] kernel tried to execute NX-protected page - exploit attempt?
(uid: 0)
[2.911260] BUG: unable to handle page fault for address:
fffffe00000453a8
[2.911261] #PF: supervisor instruction fetch in kernel mode
[2.911261] #PF: error_code(0x0011) - permissions violation
[2.911262] PGD 87efc6067 P4D 87efc6067 PUD 87efc4067 PMD 87efc3067 PTE
000000085fc4d163
[2.911264] Thread overran stack, or stack corrupted
[2.911264] Oops: 0011:0xfffffc000000453a8


How I came to the conclusion that individual cores are faulty?

I rolled up my sleves and moved further and enabled all efficiency cores, and only one performance core, boom the computer is working normally. Only if I enable the remaining performance cores, the kernel panic is happenning, and its the same error.  I am now running good with 17 cores and 18 logical CPUs. Its running amazingly well, I am able to boot liveUSB, even able to run windows.

What is wrong here? Is aindividual CPU core in performance cores has gone faulty? I didn't try experimenting with other performance cores yet since my server is back on, i want it running. I will do that experiment eventually. 

0 Kudos
5 Replies
Jose_Intel
Employee
3,309 Views

Hello @sibidharan

 

Thank you for posting on the Intel️® communities.

 

We understand you are experiencing issues with your Intel® Core™ i9-13900K Processor, please allow us to check the issue internally.

 

We appreciate the detailed explanation you provided, as soon as we have any update we will post it here.

 

Best regards,

Jose B.

Intel Customer Support Technician


sibidharan
New Contributor II
3,277 Views

I just changed to 14th gen 19-14900K and all issues are magically gone. The server is booting up butter smooth and no panics anywhere, no lockups anywhere!! 

 

Its the bloody i9-13900K, everyone (or a subset) who bought this is silently suffering.

 

Please change the CPU. Thats the only solution. 

0 Kudos
Jose_Intel
Employee
3,160 Views

Hello sibidharan

 

Thank you for patiently waiting.

 

We highly recommend updating the BIOS the latest version (please contact the system manufacturer).


Also, could you try installing the latest Windows 11?

 

We will wait for the outcome to see if there is any different result. Please let us know.

 

Best regards,

Jose B.

Intel Customer Support Technician


0 Kudos
sibidharan
New Contributor II
3,133 Views

Nothing worked, it crashes, its a faulty processor! My BIOS is latest, thats the first thing I did before diagnosing.

I have XMP disabled, SpeedStep and C States disabled. It still crashed. 

When it didn't crash (both windows and ubuntu + a non tainted kernel)?

1. If I turn of all performance cores except one, or disable all
2. When I disable turbo boost and disable Intel SpeedShift (hand over P-states to the OS, it doesn't crash)
3. Using maxcpus=1 in Kernel Parameter
4. Disabling HyperThreading (it boots but crashes when there is load)
5. Setting SVID behavior to Intel Fail Safe (it boots but crashes when there is load, not crashing with P-cores disabled)

I used a different CPU, a 14900k and it works with all default settings
I used a different motherboard with same 13900k on all default setting and it still crashes. Disabling P-cores make it work)

The windows crashes with General Protection Fault related errors. The ubuntu straight away says its GPF. 

This all boils down the issues to a faulty CPU.

0 Kudos
Jose_Intel
Employee
3,081 Views

Hello sibidharan

 

Thank you for working with us.

 

This time we would like to recommend you process the RMA with us, after verifying that your warranty is still active here, directly contact Intel Customer Support to initiate the RMA process (you cannot do this through the forums). Here are the pages where you can look up contact information, including local/country phone numbers, by geography: 

 

U.S. and Canada: Intel Customer Support 

Europe, Middle East, and Africa: Intel Customer Support EMEA 

Asia-Pacific: Intel Customer Support APAC 

Latin America: Intel Customer Support LAR 

 

We will send some information privately. Please keep in mind that this thread will no longer be monitored by Intel.

 

Best regards,

Jose B.

Intel Customer Support Technician


0 Kudos
Reply