Re: 14900ks unstable - Page 2

Keean · ‎04-04-2024

I have a new 14900ks installed on an ASUS W680 Pro - ace motherboard with 64Gb of 5600 ddr5 ECC (Kingston) and I am testing on Gentoo Linux using:

taskset -c 0-15 emerge -e @World

This recompiles the whole system using just the P-cores, it takes half a day to a day to complete the recompilation of ~1400 packages.

I have rasdeamon running to log hardware errors.

With the performance profile (Iccmax=307a, pl1=253w, pl2=253w) the CPU is unstable with anything less than VRM load line level 6 (Asus bios)

Interestingly it is also stable at LL6 in the extreme profile (Iccmax=400, pl1=320, pl2= 320).

When using a lower load line (tested from the MB default of three up to 5) RAS shows the errors are consistently on CPU 0x8, and are either instruction fetch failures from the level 0 instruction cache, or TLB errors.

I previously had a 13900ks which ran fine with unlocked power limits (Iccmax=511.75a, pl1=4095, pl2=4095)

I have a pretty good water cooling setup (6x120 shared between CPU and GPU, but GPU is idle in all these tests). Water temperature is 31-32°C once warmed up for the duration of the test, room temp about 25°C.

- Am I right in assuming that CPU=0x8 on all these errors means that P-core 8 might be "bad"?

- is needing load line level 6 to get the CPU stable usual and/or something to worry about?

Thanks for any help you can offer.

Keean · ‎05-15-2024

I tried under-volting as suggested by Intel, and it did not improve stability, in fact as expected, with lower voltages I had to limit the frequency more to keep it stable.

I also tried over-volting. This was interesting because I did get it stable with 6.2GHz enabled at +250mV offset. However it was overall the same speed or slower as the single p-core being used was hitting 100°C and thermally throttling - but not crashing.

So to get the chip stable during it brief spike to 6.2GHz, we end up thermally throttling after a few seconds to 5.8GHz, resulting in overall lower performance than just limiting clock speed to 5.9GHz.

So here is where I am at:

- limiting frequency to 5.9GHz (@ Asus LLC5) gives best all core performance

- disabling hyper-threading (@ Asus LLC3) gives best single threaded performance.

Overvolting for a higher clock speed, or under-volting to reduce temperature do not result in better performance than the above.

The only remaining possibility I can think of to improve on the above would be to improve the cooling. As my water loop is only at 32°C it's not limited by the fans nor the radiators, which means de-lidding the CPU is the only way to improve things.

ch94 · ‎05-15-2024

undervolting would do the opposite of stabilizing the system. I'm surprised you were recommended that since it seems clear that you want to make the system stable. that is a substantial over-volt that I would expect to thermal throttle quite quickly on those cores.

you're not touching AC_LL or DC_LL in the internal power management, right? just letting SVID behavior set it for you? which setting do you have that on? Depending on the setting, ASUS motherboards undervolt to some extent.

I don't think you're going to be able to squeeze much more performance out of those cores unfortunately. Have you tried disabling HT for just those two cores and leaving it on for all of the others? That should allow you to keep most of the performance while also somewhat taming temps and power consumption.

Keean · ‎05-15-2024

I have asked Intel how they recommend to undervolt, as they suggested it.

My motherboard (W680) does not appear to undervolt. Auto settings result in AC/DC LL of 1.1/1.1 on LLC 3.

What I tried was reducing the AC and DC LL (as Intel say they should be set to the same value), to say 0.2/0.2 with a standard LLC like 3 or 4.

I was not aware HT could be disabled on individual cores, it's not something I can do with this BIOS.

I think there is an easy solution for Intel, and that is to limit p-cores with both hyper-threads busy to 5.8GHz and allow cores with only one hyper-thread active to boost up to 5.9/6.2 they would then have a chip that matched advertised multi-core and single-thread performance, and would be stable without any specific power limits.

I still think the real reason for this problem is that hyper-threading creates a hot-spot somewhere in the address arithmetic part of the core, and this was missed in the design of the chip. Had a thermal sensor been placed there the chip could throttle back the core ratio to remain stable automatically, or perhaps the transistors needed to be bigger for higher current - not sure that would solve the heat problem. Ultimately an extra pipeline stage might be needed, and this would be a problem, because it would slow down when only one hyper-thread is in use too. I wonder if this has something to do with why intel are getting rid hyper-threading in 15th gen?

ch94 · ‎05-15-2024

DC_LL should not be changed, as that should be tied to the LLC value that you're using. Best to leave that on auto as the motherboard will synchronize the value with your chosen LLC. You would reduce AC_LL in that case to properly undervolt. Having AC_LL == DC_LL is going to result in your VRM delivering quite a bit of voltage to your CPU.

I'm not 100% sure if per-core HT is innate to the CPU or the motherboard; it would make more sense to be a characteristic of the CPU imo so perhaps you should check your BIOS version and/or get in touch with BIOS developers for your board.

I'm not sure how technically challenging it would be to implement the solutions you've outlined, though they sound like reasonable ways to address the problem.

Keean · ‎05-15-2024

I thought the same about the DC_LL, but Intel's latest guidance on 13th/14th gen stability (may 8th) is that DC_LL == AC_LL. AFAIK only AC_LL affects the voltage delivered. DC_LL affects the power calculation.

Perhaps there is some benefit in under-volting, but it would have to be combined with reducing the max core frequency. As the CPU thermally limits on all core loads this would improve all-core performance, as measured by work done, rather than improving the max frequency. It will be sacrificing more single threaded performance for all-core performance though.

An interesting thought is that if under voting works, why isn't this included in the VID table? For example if we set the frequency limit at 5.9Ghz, and apply the necessary voltage for this to be stable (LLC5) the core throttles at 5.6GHz on sustained all p-core loads. If we limit at 5.8GHz with the voltage required for this to be stable (LLC3) the core throttles at 5.7GHz. When the core hits the thermal limit, why doesn't the VID reduce the frequency and voltage? Then under-volting would not provide any performance advantage?

ch94 · ‎05-15-2024

I'll have to turn on virtual machine platform and do your cpu affinity test with gcc emerge later -- I started testing having HT on again but with just the two "preferred" cores limited to 6.1, which has been promising so far. Replied in your other thread about the undervolting recommendation and results.

Keean · ‎05-16-2024

I did some tests with reducing the voltage. The results are that limiting the CPU to 5.8GHz is optimal for all core performance. The increased voltage necessary for stability at 5.9GHz results in more thermal throttling, and so the overall performance is worse. The decreased frequency at 5.7GHz is already slow enough that the decrease in voltage does not result in any better performance, it is slower than 5.8GHz.

I then tried to optimise 5.8GHz for which I was using LLC3. Unfortunately LLC2 was not stable, so I am already using the lowest stable load-line calibration. I then bisected the AC/DC load line value, whilst it is stable at lower values like 0.5/0.5 it is running at half the speed, so that's no good. The lowest stable (full speed) value for AC/DC was 0.83/0.83, this was slightly faster than 1.0/1.0, but both of these were slower than leaving the AC/DC on the auto motherboard setting, so I think this is all just noise, and there is no benefit to this degree of slight undervolting. Seems sensible to leave on Auto/Auto.

So for my chip, the best multi-core performance turns out to be limiting p-cores to 5.8GHz @ LLC3

herbms · ‎05-25-2024

Really glad I found your series of posts here – this is very similar to what I'm seeing with my i9 14900K. I don't have any interest in gaming or overclocking so this is a stock build without any attempt to push the processor beyond what the bios is doing by default. I'm running Windows 11 and use the PC exclusively for C++ software development. For the first few months the processor was stable, but i'm now getting multiple random clang compiler crashes that go away after retrying.

I'm now considering buying a new system. My work cannot withstand the downtime of taking out the CPU and doing an RMA exchange.

In my case, compiling a codebase like chromium from scratch with a pristine known-good git checkout has a 100% chance of a clang ICE. The stack traces from these crashes never make any sense either – that's made it hard to narrow down. The clang crash report might show an invalid syntax encountered while parsing some C++ AST but succeed when retrying. It's stochastic in nature – never the same error twice, or with the same file. I also see crashes in Python scripts that run as part of the build as well. I tried compiling the same project in Ubuntu and saw the same results.

I've upgraded the BIOS (I'm using a ASUS Z790 ProArt) enabled the "Intel Baseline Profile", tried setting PL1 to 125 and PL2 253 – the stock limits and I'm still seeing it.

Your theory about a potential bug in hyper-threading is the most plausible explanation I've encountered for this stability issue. Normal stress tests like Prime95, MemTest86, and a collection of tests on my NVME drive all come back clean. I suspect the all-out workload of many competing processes scheduled by a parallel build system such as Ninja is flushing out some issues that the standard stress tests do not.

I just configured to limit the frequency to 5.6Ghz – I just need something that works at this point. On the first few compiles things are looking good. I will update with any other findings.

I was trying to follow from your series – are you still running with Hyper Threading disabled or just the frequency limit?

Keean · ‎05-26-2024

I found that power limits were mostly irrelevant, and once I had the individual cores stable, I could increase the power limits without problem, the CPU will just thermally throttle anyway.

My final stable configs were x58 with LLC3 and extreme power profile which maximises multi-threaded performance. (I could get x59 stable with LLC5 but the extra heat resulted in slower compile times).

The other was to just disable hyper-threading, leave frequency unlimited, extreme power profile, LLC3. This maximised single threaded performance.

As I mostly use for software development, I am using the first config for optimal multi-threaded performance.

With these settings it seems completely stable. I would recommend testing each p-core individually though by running a multi-threaded compile and setting the thread affinity to each p-core. Because there are two vCPU in each p-core that's in pairs like 0 & 1, 2 & 3, 4 & 5 etc...

I have found that if you are right on the edge of stability it's also worth running pairs of p-cores, sets of 4 as well as all 8.

I have written a testing script that bootstraps GCC using, the following vCPU:

- single p-cores: 0-1, 2-3, 4-5, 6-7, 8-9, 10-11, 12-13, 14-15
- 2 p-cores: 0-3, 4-7, 8-11, 12-15
- 4 p-cores: 0-7, 8-15
- all p-cores: 0-15

As the problem is specific to hyper-threading, you don't need to test e-cores.

That set of tests takes about 8 hours to run, but if a config passes, it seems to be completely stable.

Have you updated the BIOS at all? I wonder if motherboard vendors have changed the settings (under-volting) to gain performance?

I am not sure I believe that there is silicon degradation going on, I guess it's possible, but I have had 2 14900ks that have both behaved like this from brand new, and I now suspect my 13900ks had the same problems, although happening much less often, but I just blamed the software. I haven't seen any thorough testing that passed on a new CPU, that has then failed on a CPU after months of use.

herbms · ‎05-27-2024

Huge kudos for this find.

Prior to finding your post, I was trying extremely conservative power limits and capping everything at the intel specifications, but I couldn't get 20% into any build that fully saturates all cores.

So one of the things I find curious: To try to get stability at any cost, I had been running PL1=125, PL2=253 and IccMax 307 (Intel specs for 14900k) for a few weeks. I don't think I ever saw the frequency on any cores hit anywhere near 6Ghz (I generally see ~5.0Ghz under sustained load), yet I was still experiencing a lot of instability. Perhaps when a new compile workload launches there's a momentary frequency spike that falls outside of my ~1s sensor sampling in XTU.

Do you have any ideas?

I've now settled on x57 and I haven't experienced any instability since. I've done 5-6 clean chromium builds (3-4 hours each) on this machine since then and 100% of them ran to completion without a hitch.

I did upgrade the bios within the last few weeks in an attempt to resolve this. ASUS released one that allows you to enable "Intel Baseline Profile" but that didn't help at all.

So if it's useful for others, my setup is basically:

Reset bios to defaults, configure XMP
P core ratio capped at x57, all core.
E core ratio capped at x44, all core.

That's it! Everything runs great now.

Keean · ‎05-28-2024

I think the frequency reporting has a fairly low sample rate, but the CPU can change p-state quickly. So the CPU was boosting up to high clock speeds and crashing too quickly for you to see. It might look power limited to 5GHz but that's when all-core are running. At max frequency (with both hyper-threads) a p-core uses about 60-70W, the package baseline is about 10W, so let's say that's 10W + 50W for each p-core. All four p-cores running flat out would then use about 410W, so setting a power limit of say 250W is going to throttle all core loads, but when only 4 p-cores are active it does nothing (4*50+10 = 210W). So with only 4 p-cores running they can go as fast as the thermals allow.

Unlike synthetic benchmarks which run multiple threads for the same length of time, starting all the cores together, and having them all finish together, compiling is running multiple single-threaded tasks of different lengths all at the same time. This means when compiling a large program cores are starting and stopping all the time, hence in the transient conditions when cores are starting and stopping there is a finite probability the less than 5 cores are active with at least one of them running two hyper-threads, which allows it to boost up to max frequency and crash.

If you were to use 'set affinity' to restrict the compile to each p-core, you could see the crash much quicker, and identify which p-core(s) are causing the problems. I found some of my p-cores were stable at 5.9GHz (14900ks) but the BIOS doesn't provide a way set frequency limits on individual p-cores, only by core usage.

RamyerM_Intel · ‎05-16-2024

Hello Keean,

I assure you that I am right here and I want to help you with your Intel® Core™ i9 processor 14900KS (36M Cache, up to 6.20 GHz). I have sent you an email regarding our email conversation and you may check your inbox for more details. I would like to apologize for the delay in my response and I hope you allow me the opportunity to make it up to you.

As for Norman_Trashboat, I understand that you are frustrated as the warranty claim for the 14900K has been denied. Seeing that you have delid the CPU, I want to set your expectation that physical damages can indeed void the warranty of the CPU. You may also visit this article for more information: Warranty Guide for Intel® Processors

as for ch94, I can see that you are actively engaging with Keean, in this thread. Feel free to do so as it is the best way to empower our community. However, you may also create a new thread so we can tend separately to your concern.

Ramyer M.

Intel Customer Support Technician

Keean · ‎06-03-2024

It appears that even at 5.8GHz the CPU has become unstable again. Not sure whether this is due to 'degredation' (electromigration results from a combination of high heat and high current), or just warmer weather here has reduced cooling efficiency.

As I have intel approved current limits set, my concern is that any electromigration could be caused by current density/heat in individual cores when hyper-threading. It's possible the only safe option is to disable hyper-threading, as no matter what power limits you set (above about 60W) it's possible for all that power to go to a single p-core.

If a single p-core with both hyper-threads fully loaded has a high enough current density at 100°C to cause electromigration, then there is no way to limit the current per core, so there are three options to mitigate this, reduce max core frequency, disable hyper-threading, or reduce max temperature. I tend to think thermal throttling is slow(er) to respond so not a good option.

It is interesting that Intel's 15th gen appears to both have a lower max clock speed (5.5Ghz is rumoured) and no hyper-threading.

For now I have reduced the frequency to 5.7GHz and it is stable again, however at this point it's barely faster than the 13900ks it replaced.

14900ks unstable

Intel® Core™ Processors