Processors
Intel® Processors, Tools, and Utilities
14823 Discussions

14900ks unstable

Keean
Novice
5,097 Views
I have a new 14900ks installed on an ASUS W680 Pro - ace motherboard with 64Gb of 5600 ddr5 ECC (Kingston) and I am testing on Gentoo Linux using:

taskset -c 0-15 emerge -e @World

This recompiles the whole system using just the P-cores, it takes half a day to a day to complete the recompilation of ~1400 packages.

I have rasdeamon running to log hardware errors.

With the performance profile (Iccmax=307a, pl1=253w, pl2=253w) the CPU is unstable with anything less than VRM load line level 6 (Asus bios)

Interestingly it is also stable at LL6 in the extreme profile (Iccmax=400, pl1=320, pl2= 320).

When using a lower load line (tested from the MB default of three up to 5) RAS shows the errors are consistently on CPU 0x8, and are either instruction fetch failures from the level 0 instruction cache, or TLB errors.

I previously had a 13900ks which ran fine with unlocked power limits (Iccmax=511.75a, pl1=4095, pl2=4095)

I have a pretty good water cooling setup (6x120 shared between CPU and GPU, but GPU is idle in all these tests). Water temperature is 31-32°C once warmed up for the duration of the test, room temp about 25°C.

- Am I right in assuming that CPU=0x8 on all these errors means that P-core 8 might be "bad"?

- is needing load line level 6 to get the CPU stable usual and/or something to worry about?

Thanks for any help you can offer.
Labels (1)
0 Kudos
34 Replies
Keean
Novice
3,543 Views

Even with LL6 I just got this error:

ghc-stage1[3436135]: segfault at 0 ip 00000000000014fe sp 00007ffeee8fb318 error 6 in ghc-stage1[400000+4000] likely on CPU 8 (core 16, socket 0)

When I run the intel processor diagnostic test it passes all the tests.

I wonder if using only p-cores is a factor, as then the power limit is not distributed amongst the e-cores as well? Seems like the intel diagnostic tool could do with an option to test only p-cores when stress testing?

I am worried that I won't be able to get an RMA exchange if the CPU passes the intel diagnostic tests?

0 Kudos
Keean
Novice
3,508 Views

A further update: with conservative power limits (Iccmax=280A, pl1=253W, pl2=253W), I am able to get the CPU to fail almost instantly if I set the task affinity to both hyper-threads on the same core, and select the "problem" core directly, in my case P-cores 4 & 5 (threads 8 & 9 and 10 & 11) like this:

taskset -c 8-9 emerge -1 gcc

taskset -c 10-11 emerge -1 gcc

This fails almost instantly with a segmentation fault on P-cores 4 & 5, but runs to completion and produces identical stage 2 and stage 3 compilers on other cores.

I think compiling GCC is a good test because the bootstrap builds three times, and ensures that a compiler built with the CPU produces working output that generates identical output from the compiler built with itself, so it should detect any random hardware errors that don't cause a crash.

It would appear that setting power limits to make the CPU stable is not actually a solution, all it does is throttle the CPU on all core loads, so that the problem core(s) don't get up to full speed. Testing cores individually needs to be done, and shows that adjusting power limits to stabilise a bad core won't work.

0 Kudos
Keean
Novice
3,500 Views

Update: p-cores 4 & 5 which cause the crashes are actually the 'preferred' cores with the 6.2GHz limit.

I tried controling this with the frequency limit per active cores (p-cores).

Act.cores: 1/2/3/4/5/6/7/8
Max Freq: 6.1/6.1/5.9/5.9/5.9/5.9/5.9/5.9

Re-running the compile on cores 4 & 5 still results in a crash.

Second attempt:
Act.cores: 1/2/3/4/5/6/7/8
Max Freq: 6.0/6.0/5.9/5.9/5.9/5.9/5.9/5.9

Failed as well.

Now testing: 5.9/5.9/5.9/5.9/5.9/5.9/5.9/5.9

This passes the compile gcc test on all cores.

0 Kudos
Keean
Novice
3,453 Views

Even with the max clocks for 1-8 p-cores used clocks reduced to 59 across the board, it failed doing:

taskset -c 0-15 emerge -e @World

With all the 'by cores used' limits for p-cores set to 58 it still failed during the above compilation test with a hard lockup.

Trying again with 57, at this point I think it's worse than the 13900ks it replaced?

The weird thing is it "seemed" stable in windows, I could run the stress tests/benchmarks from intel CPU diagnostic tool, OCCT, XTU, and Cinebench fine under Windows.

 

 

 

 

0 Kudos
Keean
Novice
3,409 Views

Okay have replicated the failure in Windows. I installed Gentoo in WSL2, then did an "emerge -1 gcc" to rebuild the compiler with itself. This runs to completion fine on "default" settings.

However if I open the task manager and select vmmemWSL (the VM running WSL) and set the affinity to CPUs 8 & 9 only I get a more or less immediate crash of the compiler.

 

The results match those in Linux, p-cores 0-3 and 6-7 seems to work fine when vmmemWSL has it's affinity set to the pair of hyper-threads for that core. P-core 4 causes the compile to fail almost instantly, and P-core 5 fails after a few minutes.

0 Kudos
Norman_Trashboat
3,416 Views

Your mistake was, like mine, "upgrading" to this overvolted underbinned cashgrab.

 

You have my condolences.

Keean
Novice
3,404 Views

My conclusion is that the 14900ks cannot run two demanding control threads (like a compiler which is not using float, SSE or avx instruction sets, just regular x86_64 control flow, logic and integer maths) on the same p-core at be full speed without crashing.


Really Intel turbo-boost needs to reduce the max clocks speed when two threads are active on the same p-core.


Reducing the power limit is a work around, that will only work a % of the time. It may reduce the frequency of crashes, but won't stop them happening.


Thread Director in Windows is doing something similar, by trying to only use one hyper-thread at a time on p-cores, until overall package power consumption is throttling the whole chip on the power limit enough that two threads won't push the core too hard.


So the only fix that will really result in a stable CPU is to either disable hyper-threading, or reduce the max clocks speed enough that two threads can run at 100% on the same core without crashing. With this done it should be possible to remove the Iccmax and power limits, and still have a stable CPU.

0 Kudos
redteam
Beginner
2,140 Views

I don't understand why I'd spend $800 on a CPU only to find it can't run stable.

 

It's shocking that on my first visit to this community, I've encountered so many people facing the same issue. It might be time to switch to the red team, huh? 

0 Kudos
Keean
Novice
2,117 Views
I downloaded the new (3501) Asus bios for my motherboard, and applied the Intel Baseline settings. It does seem to reduce the frequency of random crashes but it does not eliminate them.

If I run a long GCC compile task it will eventually fail.

If I bootstrap GCC with 3 worker threads and set the affinity to vCPU 8 & 9 (both in the same 'preferred' p-core), it still fails more or less instantly, and the baseline settings appear to have made no improvement to this at all.
0 Kudos
RamyerM_Intel
Moderator
3,281 Views

Hello Keean, 


Thank you for posting in the communities and for sharing a detailed description of the issue that you are experiencing. I do want to address your concern regarding the RMA. The Intel Processor Diagnostic Tool is a good way to initially diagnose the CPU, but we do not base your replacement on this tool alone. I can assure you that we value your satisfaction, and we aim to provide the best of our services. If the CPU is in need of replacement, we will certainly help you with the process. For now, it is best to continue identifying the issue with the 14900K that you have. Since this might need a thorough investigation, it is best if we continue our conversation by email. I will be sending you an email within the day. Please check your inboxes. 


As for Norman_Trashboat, I can see that you have some dissatisfaction with your current system. If you are encountering issues with your own unit, we highly recommend creating a new thread so we can give you the full focus of our support. 


Ramyer M.

Intel Customer Support Technician 


0 Kudos
Norman_Trashboat
3,236 Views

I created a thread detailing my disappointment with the binning I received, considering the IMC, P, and E cores are all weaker on my 14900KS than the 14900K I gave to my brother I'm pretty upset. I can't even post XMP on my 8000 MT kit of ram and have had to go to JEDEC on a z790 apex encore.  I also contacted intel RMA email and have yet to receive a response.

 

Intel has seriously ruined it's reputation and needs to do something. I can't justify defending this company anymore.

0 Kudos
Keean
Novice
2,650 Views
@RamyerM_Intel I seem to be getting a reply from the System Account when replying to the issue by email, saying "Your Email To Intel Customer Support Has Not Been Delivered". Looks like the case has been closed or something? Please advise.
0 Kudos
Norman_Trashboat
2,028 Views

They're stonewalling warranty claims blaming motherboards. My advice is to make a pro se small claims in the county of purchase against intel.

 

My CPU got "stable" at 5.8 at some rather high voltages. My 14900K was a much better bin overall. This SKU is an insult to their most loyal customers.

 

0 Kudos
Keean
Novice
1,814 Views

After the above, I had one reply to my support ticket to say they were still investigating and to be patient, and then radio silence.

After a lot of testing, I have found the motherboard power limits don't solve the issue, neither does the Intel Baseline Profile in the latest BIOS. I was able to get my 14900KS stable at 5.8GHz on stock voltages (ASUS LLC 4) which is what I am using now. I could get 5.9GHz stable on LLC 6 but not sure it's good to run it like that all the time. I was not able to get it stable at 6GHz and above (basically could not get the preferred cores to boost any faster than the other cores without stability issues). I was also able to get it stable all the way up to 6.2GHz with hyper-threading disabled and LLC3.

0 Kudos
Keean
Novice
1,725 Views
Intel support have contacted me to say:

> your processor appears to be functioning optimally

So apparently 5.8GHz is the max stable clock speed we can expect under Linux at stock settings (Asus LLC3, Asus optimisations disabled, performance or extreme power profile) when compiling.
0 Kudos
Norman_Trashboat
1,695 Views

Intel had no business with this SKU simply put. They really didn't have any business with the 14900K either...

 

We've learned our lesson, you've earned your reputation, Intel.

 

Long gone are the days of when I'd purchase a 990X and push it from 3.73 to 5ghz on air.... Now I buy the best you have to offer and have to underclock and overvolt it to get stability. Thanks Intel.

 

At least they denied my warranty because I had the audacity to try and cool it properly and delid mine. Your lack of a claim is only fuel to the fire of a class action lawsuit.

 

Great showing, Intel!

0 Kudos
ch94
Beginner
1,655 Views

Thanks for posting this and sharing the updates. My 14900KS (102 SP, 115p/77e) was also unstable at stock 5.9/6.2 frequencies at various reasonable (10-150mV) offsets at the 5.9/6.2 points on the V/F curve. I tried resetting everything to motherboard default settings, and applied only 307/400A ICCmax, 150/253W PL1/PL2, and Typical/Trained SVID behaviors, and still experienced strange 3DMark Timespy/Firestrike crashes.

 

My preferred cores are also 4/5, with 0 trailing closely, so I downclocked to 60x2/59x3/58x8 with per-core ratios of 59 on P0, 60 on P4/P5, and 58 on the rest, and applied LLC4 with an undervolt of AC_LL/DC_LL = 0.1/1.03 (DC_LL tuned for my board so that CPU Package Power matches my power limits when the CPU power throttles) and +35mV offsets at VF 8/9/10 (5.9 and both 6.2 points). This configuration has been stable for me on all benchmarks and stress tests, but it's a shame I had to both downclock and undervolt to (1) achieve stability and (2) recover some of the performance lost from the downclock.

 

After seeing your notes about hyperthreading being the most likely culprit on your P4/P5 cores, I'm going to try disabling it and going back to 62x2/59x8 cores..

0 Kudos
Keean
Novice
1,621 Views
Interested to know how you get on.

Unfortunately for me even a single core boosting above 5.8GHz is a problem.

If I run an 'unzip' or 'gcc' compile with the thread affinity set to both hyper-threads in the same core, I can still get it to crash within a few minutes of load.

I suspect that something like 59x2/58x8 might be stable, however as soon as I set any core ratio limits, the Asus BIOS disables TVB (and enables overclocking TVB) with no settings for me to override this which loses me another 100-200MHz of stability...
0 Kudos
ch94
Beginner
1,583 Views

I've turned HT off on all P-cores and settled in on 61x2/59x8. Was still getting occasional crashes at 6.2 and I didn't want to push the 1.53V+ required for 6.2. As this machine is primarily used for gaming and light video editing + development (C-family and some Python), I've resolved to eat the loss of HT and run with what seems like this stable config. I still have a moderate under-volt applied (AC_LL=0.08mOhm at LLC 4 corresponding to DC_LL~1.03mOhm) at the SVID level and kept small positive offsets (20mV/45mV respectively) at 5.9/6.2 points on the V/F curve.

 

In the last day of using the computer Vcore has peaked at 1.501V, which is around what the 61x cores need at light loads, so I'm going to leave it alone and start actually using the machine instead of tweaking it at this point. Pretty happy with the performance; HT off is a marginal loss to small gain depending on what I'm doing. If I were to turn it back on I'd probably go back to 60x2/58x8 though.

 

Have you tested 61x or 62x with HT off? I saw you mention that in one of your updates but I don't think you tested and mentioned how it went if you did.

0 Kudos
Keean
Novice
1,567 Views

I got 6.2x2/5.9x8 @ LLC3 (everything else auto/default except disabling ASUS enhancements) stable at extreme power profile with hyper-threading disabled.

I got 5.9x8 @ LLC5 stable at extreme power profile and 5.8x8 @ LLC3 stable also at extreme profile with hyper threading enabled.

The power profile appeared to do nothing really to stabilise the chip, it just appears to need more voltage, or lower temperatures or lower frequencies.

The chip can seem "generally stable' for normal use, but will crash more or less instantly if you run a multi-threaded 'unzip' or 'gcc compile' with affinity set to both hyper-threads in the same preferred p-core - which as a developer is not stable enough.

Intel basically recommend I try under-volting.

0 Kudos
Reply