Processors
Intel® Processors, Tools, and Utilities
14676 Discussions

The fault is not in your processor, but in your motherboard

AlHill
Super User
6,017 Views

Are you one of those who are constantly complaining about 13th and 14th gen stability and BSODs?  Well, read this:

https://videocardz.com/newz/asus-adds-intel-baseline-profile-to-its-z790-motherboards-amid-core-i9-stability-issues

 

AlHill_0-1713872110993.png

While this specifically mentions ASUS, MSI and Gigabyte are in this mess also.

 

Doc (not an Intel employee or contractor)
[If you find any Intel driver you might need, download and save it now.]

0 Kudos
61 Replies
KrissyG
New Contributor II
2,988 Views

actually....ASUS did very good job in turning their motherboards into AMD-BBQ and fried AMD CPUs often without destroying the board itself.
There are videos....

As for motherboards made for Intel CPUs, the story starts about 1year ago, ASUS forum and Reddit are full of such topics.

However, what the linked page suggest, is to set limits on an i9 14900KS.

The CPU should work with no limits, since it has been made for this.....so now the question, if the motherboard by default runs with no limits, why does the system is unstable if the CPU has been made to run with no limits as well?

I do not see this as solution, this literally turns an i9 into an i7.

0 Kudos
RyanFeeko
New Contributor I
2,980 Views

Meaning to say the CPU is no issue and due mobo cannot operate with high performance of CPU.

Just wondering that makes CPU unstable and unoptimized.

0 Kudos
KrissyG
New Contributor II
2,975 Views

hmm

1. CPU makes system crash when overclocked 

2. CPU at -200 degrees does not make system crash when overclocked.

 

if the motherboard was the only issue, the system would nit boot regardless what temp is the CPU at.

CPU takes too much power, which makes it unstable. 

0 Kudos
RyanFeeko
New Contributor I
2,964 Views

Gotcha!

thank you for your clarification.

If the CPU consumes so much power, can mobo set a limitation or the BIOS setting?

0 Kudos
KrissyG
New Contributor II
2,942 Views

obviously all components of the hardware have a limit. Intel provides specs to motherboard manufacturers on power, voltage etc.

At which point you check specs for a overclocklable CPU and it says 125W but also 253W, and you think, ok, i guess it will draw peak 253W...except if it's a Z Series motherboard, it's power limit will be somewhere between 400W and maybe 700W peak.

The CPU with air cooling, has however different limits as with water cooling and also way different than with liquid helium cooling. 
Therefore since a while the motherboards have options to choose power profiles.

 

CPU has the lower limit, and if its a K version of the CPU....designed to run with no limits....right?

 

A B760 Gigabyte motherboard is not for overclocking, i was not expecting the CPU to draw a lot of power.
On default it allowed the CPU to draw 250W..

However, the peak power that my motherboard can deliver is not 250, i have seen it draw slightly over 300W.
Funny, bcoz XTU graph can show max 300W...and so a short test results with a power usage where the CPU did not even reach it's specs about frequency, but went over the top as for power usage:

KrissyG_0-1713938972635.png


To say it in simple words, AMD forced Intel to use ridiculous means to achive same perfromance, except, the AMD CPUs taht beat Intel in benchmarks and games, use less power than Intel does. At same power consumption, AMD annihilates Intel CPUs, a total reverse situation compared to year 2010.

0 Kudos
Keean
Novice
2,940 Views

I disagree that motherboard power limits are the problem. I don't believe motherboard power limits will protect your CPU. I think damage is caused by a hot-spot which is caused by a combination of core-temperature, core-voltage, and instruction pattern of the code being executed. So damage can occur even when a single p-core is active.

Motherboard power limits will limit all-core performance, but they won't help if only a few cores are active. A limit of 253W is going to result in about 30W being available for each p-core if all 8 are loaded (on an i9), however if only 4 are loaded they each get about 60W, if only 2 are loaded they each get about 120W, if only 1 is loaded it gets about 250W. Let's say that at 100°C 45W of power is enough for a core to have errors, then you can see the 253W power limits only really works when all p-cores are active.

When the CPU is designed, potential hot-spots are identified, and thermal sensors positioned so the core can be throttled before they get hot enough to cause errors or damage the core. It seems like one hot-spot was missed in the CPU design related to hyper-threading and integer address arithmetic.

So any fix for this has to cope with situations when only a few p-cores are fully loaded. From my testing the only solutions are to limit max core ratio and optionally max core temperature. Limiting max core temperature as well let's you run a slightly higher core ratio boosting single-core performance, but will limit all-core performance more.

It might be possible to fix this by Intel adjusting the turbo boost algorithm, by limiting the max boost if both hyper-threads in a p-core are active, this would recover the single-threaded performance lost by limiting the core ratio limits. However I don't know if this is possible with a firmware change only.

0 Kudos
KrissyG
New Contributor II
2,934 Views

@Keean wrote:


When the CPU is designed, potential hot-spots are identified, and thermal sensors positioned so the core can be throttled before they get hot enough to cause errors or damage the core. It seems like one hot-spot was missed in the CPU design related to hyper-threading and integer address arithmetic.


Intel says 100°C is not only the max temperature, but also it is fine if teh CPU runs at such temperature.

That is a hint. Also at 40W a core woudl draw 28Amps, and at 120W it would go 85A - there is no way any spot of the die can handle that much power.


0 Kudos
Keean
Novice
2,911 Views
To the first point, a core is not a single thing, different parts can be different temperatures, so thermal sensors need to be positioned at the hottest points. If they are not, some part can be hotter than 100°C even if the sensor says the core is cooler than that.

To the second point power limits reflect the available power, not the power draw. The point is the core will scale frequency based on available power and thermal headroom. In this case because it gets the temperature wrong it's scaling too aggressively. You can't control this with package power limits, you would need a per-core power limit, which is something we don't have.
0 Kudos
KrissyG
New Contributor II
2,867 Views

@Keean wrote:
....a core is not a single thing....

yea, that is why in the specs it says that a CPU has for exampe 20 cores, and each single one of them has at least one tmperature sensor.
Highest Temperature on any core will trigger thermal throttling for all cores. (altho you can set Tmax for each core i guess) The resistors used for thermal sensing are on the bottom of the die, which is where the pins are, so the IHS has different temperature than the sensors show, like none of the sensors reading will match IHS temperature when under load.


@Keean wrote:

To the second point power limits reflect the available power, not the power draw. 

I am not talking about CPU being idle, but under full load, which will then be at the TDP limit, bcoz then the BSOD, thermal throttling and other unwanted stuff happens.


@Keean wrote:
The point is the core will scale frequency based on available power and thermal headroom. In this case because it gets the temperature wrong it's scaling too aggressively. 

"cale frequency based on available power and thermal" = thermal throttling, taht is the word you are looking for.

So the planned frequency for the core will be the maximum what has been set, not anything else, only after reaching Tmax will it get throttled dwon. You can observe it on any graph that shows frequency for the cores. 

And as i said, the temperature sensor sits on the opposite side as the IHS, it actually shows the hottest spot on a core, there is no better placement as teh side that is not being cooled down.
If you have a CPU taht has integrated GPU, then temperature on taht GPU (if GPU unused) will be the lowest temperature on the whole CPU....and that will be in fact IHS temperature, as seen here on the graph: 

KrissyG_0-1713953877585.png

 




@Keean wrote:
You can't control this with package power limits, you would need a per-core power limit, which is something we don't have.

Hmm, in a single core scenario, you can set a power limit. Remember, P = I x U, power equals to current multiplied by the voltage, and you can set both. 
So in a single core scenario, you set voltage to 1,4V on taht specific core and ICCmax to 10A = 14W.
14W is then the power on the whole CPU package, since only one core is active.

And it  makes no sense to set a power limit per core, they shuffle the load in order to optimize the distribution of heat, so hitting higher load for a shorter period of time is better, than constant load on all cores....that is how you get a better average perfomance.



0 Kudos
Keean
Novice
2,762 Views

> Highest Temperature on any core will trigger thermal throttling for all cores.

Recent CPUs throttle each core independently, Even if you only set one Tj max each core can have a different temperature. Its even more complex with Thermal Velocity Boost as max boost is limited by -1 at 60 deg C. and -2 at 85 deg C. some overclocking motherboards even let you change the temperature inflection points and core ratio offsets on a per-core basis... Intel XTU does this if the CPU and motherboard support it. 13th & 14th gen rely on scaling the frequency of each core separately dependent on available power and thermal headroom to maximise performance.

 

> none of the sensors reading will match IHS temperature when under load.

You can read the individual core temperatures in Hwinfo or OCCT or other monitoring tool. Overall package temperature does not mean much and is not really used any more. When we talk about Tj Max it is the "junction" temperature that is limited, the actual temperature measured on the CPU by embedded sensors, not the IHS temperature. When the chip is designed temperature sensors are placed at the points expected to be the hottest parts of the chip under load. This is not an exact science, and requires the designer to choose the places to put the sensors. In the case of the 13th and 14th gen I think they missed a hot-spot in the instruction fetch or address arithmetic part of the core, and hence why this can get too hot without thermally throttling the core.

 

> I am not talking about CPU being idle, but under full load, which will then be at the TDP limit, bcoz then the BSOD, thermal throttling and other unwanted stuff happens.

Whilst overheating the whole CPU can cause problems, this particular issue that is causing shader compilation errors and game crashes in 13th & 14th gen CPUs can happen with only one or two cores loaded. Motherboard power limits can prevent the cores overheating under full load, but cannot protect when say only half the cores are loaded.

 

> So the planned frequency for the core will be the maximum what has been set, not anything else, only after reaching Tmax will it get throttled dwon. You can observe it on any graph that shows frequency for the cores.

This is not true since Thermal Velocity Boost was introduced.

 

> And as i said, the temperature sensor sits on the opposite side as the IHS, it actually shows the hottest spot on a core, there is no better placement as teh side that is not being cooled down.

There is not a single hot-spot but multiple. The FPU can get hot, the SSE ALU can get hot, there are multiple points where the chip can get hot and which point is hottest depends on the code being executed. A tight loop running SSE code will heat up the SSE ALU, floating point maths the FPU, so there needs to be a sensor in each potential hot-spot, and it is not obvious exactly where the hottest point of the SSE ALU is for example, the designer will place the sensor based on "guesswork" and experience, where they think will be hottest, or where the simulator suggests will be hottest using certain testing instruction patterns.

 

> So in a single core scenario, you set voltage to 1,4V on taht specific core and ICCmax to 10A = 14W.
14W is then the power on the whole CPU package, since only one core is active.

 

We can do some measurements:

- All cores idle package power: 12W

- One core with both hyperthreads active: about 70W

 

A single core is drawing nearly 60W, and a package power limit of 253W is doing nothing to throttle this core on its own, it is free to boost all the way up until it hits the Tj Max and if the thermal sensor is not in the right place it will fail. If you wanted to throttle this single core using power limits you would need to set a limit less than 70W.

 

We can test this by running a compiler like load and setting the thread affinity to both hyper-threads in the same p-core, and showing that even with power limits as low as 150W a single can fail.

 

I posted here about how you can test the CPU to demonstrate the problem: https://linustechtips.com/topic/1567689-intel-13th14th-gen-how-to-test-for-a-bad-core-causing-game-crashes/

 

0 Kudos
KrissyG
New Contributor II
2,704 Views

if you do not set Tmax per core or voltage, then the core with highest temp will indeed trigger thermal throttling.

As ai said, the CPU shuffles the load between cores.....this may not happen if the CPU has 100% load or nearly that, in order to mitigate that, all cores will get some throttling.

"none of the temperature sensors will match IHS temp when under load"

This will stay true, even 2 or 3mm in distance may result in high difference in temperatures, as the die has been made thinner on the 12gen onwards.
Proof: 

KrissyG_0-1713993170546.png

None of the temperatures is as low as the unused grahpics on the CPU....bcoz that temperature is literally IHS temp.


@Keean wrote:

> So the planned frequency for the core will be the maximum what has been set, not anything else, only after reaching Tmax will it get throttled dwon. You can observe it on any graph that shows frequency for the cores.

This is not true since Thermal Velocity Boost was introduced.

Thermal velocity boost for specific cores....this applies only if you change the settings and interfere with default settings....still the set frequency will be achieved if the PC switches from idle to full load and will remain so until Tmax is reached....this still will not change.

Quote from Intel "The frequency gain and duration is dependent on the workload, capabilities of the processor and the processor cooling solution." 
The last part of the definition of Thermal Velocity Boost, literally says what i said.


@Keean wrote:

> And as i said, the temperature sensor sits on the opposite side as the IHS, it actually shows the hottest spot on a core, there is no better placement as teh side that is not being cooled down.

There is not a single hot-spot but multiple. The FPU can get hot, the SSE ALU can get hot, there are multiple points where the chip can get hot and which point is hottest depends on the code being executed.


Quote from Google "CPU cores are physical components of a processor that are responsible for executing instructions"

So i don't understand, what you think SSE ALU or FPU is, the whole die is made of transistors, resistors, maybe some capacitors, but the transistors convert electricity into heat, the transistors that sit in the cores and also the integrated graphics, there is nothing else that gets hot. The temperature sensing resistors are placed under the cores, in other words, under everything that can get hot.
This here is apparently  i7 13700k which i use, the physical size is about 24mm x 12mm (1inch x 0,5inch?), so a Pcore would have a physical size of about 3mm x 4mm, for what i care, the temperature sensing resistor - has exactly that size, and if that is the case, the whole surface under the Pcore is sensing temperature.

KrissyG_2-1713994744986.png

There is nothing else there.

 


@Keean wrote:

So in a single core scenario, you set voltage to 1,4V on taht specific core and ICCmax to 10A = 14W.
14W is then the power on the whole CPU package, since only one core is active.

 

We can do some measurements:

- All cores idle package power: 12W

- One core with both hyperthreads active: about 70W

 

A single core is drawing nearly 60W, and a package power limit of 253W is doing nothing to throttle this core on its own, it is free to boost all the way up until it hits the Tj Max and if the thermal sensor is not in the right place it will fail. If you wanted to throttle this single core using power limits you would need to set a limit less than 70W.

 

On idle the Pcores are almost dead, and most of Ecores also, on idle you may be using just 2~4 cores, so the 12W , (which on idle would be terrible, i get 3W) is the power drawn like but just a few cores at a time. XTU shows actually the amount of cores being active.

And behold of single thread benchmarks.....yes, 1 core can have only 1 thread, therefore, such benchmark will utilize the whole core.
I get max 34W total CPU power consumption on such benchmarks. 
Which means, that one core taht was getting roasted, was using less than 34W.
And if you think that was an Ecore, nope, i double checked it, the benchmark hits a Pcore each time, i would need to disable Pcores in BIOS for the benchmark to use Ecores.

So for my i7 13700k with 8P and 8E cores, the power usage would be like 8P x 30W + 8E x 10W =240W+80W = 320W of total TDP without overclocking. And around 314W was the max i ever saw it draw, on a non OC motherboard.

Load shuffling can be seen in Task Manager and XTU as well, since both can show load on all cores.
If load would remain on few specific cores only, then you would see higher temperatures on such cores.

0 Kudos
Keean
Novice
2,646 Views
> for what i care, the temperature sensing resistor - has exactly that size, and if that is the case, the whole surface under the Pcore is sensing temperature.

The temperature sensors are a lot smaller than you seem to think.

> yes, 1 core can have only 1 thread

No, p-cores can each run two threads at the same time. This is called hyper-threading by Intel. So you need to measure with both threads in the core running.

> Load shuffling can be seen in Task Manager and XTU as well, since both can show load on all cores.

Load shuffling is hiding the failure and making it harder to see, you want to use set affinity to lock the process to one core to make the problem easier to see. The load shuffling is not perfect, and occasionally you will end up with both hyper-threads on the same p-core loaded and it will cause a crash.


0 Kudos
Keean
Novice
2,766 Views

I just typed a long reply that seems to have been deleted for some reason... I'll try a shorter response.

 

> Highest Temperature on any core will trigger thermal throttling for all cores

Read the link below, you can see that several Intel boost technologies affect individual cores, not just limiting the whole chip. providing "current, power, and thermal headroom exists"

https://www.intel.com/content/www/us/en/gaming/resources/how-intel-technologies-boost-cpu-performance.html

It looks like the following boost some cores not all:

- Intel® Turbo Boost Max Technology 3.0

- Intel® Thermal Velocity Boost

- Single-Core TVB

And the following boost all cores the same:

- Intel® Turbo Boost 2.0

- All-Core TVB

- Intel® Adaptive Boost Technology

 

0 Kudos
Keean
Novice
2,762 Views

> And it makes no sense to set a power limit per core, they shuffle the load in order to optimize the distribution of heat, so hitting higher load for a shorter period of time is better, than constant load on all cores....that is how you get a better average perfomance.

 

If you take a game or other program that you know has crashes when running all-core without power limits, and then set a power limit of say 253W, and then use Task Manager to set the thread affinity to half of the p-cores (the half with the preferred cores), and run the game does it still crash? (Its hard to do this because you have to get to the process in Task Manger and set the affinity before it finishes shader compiling). Can you explain why it still crashes?

0 Kudos
vmovups
New Contributor I
2,871 Views

For me the degradation happened when playing games that made the CPU average 110W and sometimes hit 180+W if i remember.

I do not think the CPU can exceed the 307A current limit with such workloads.

My MSI motherboard had a 288W limit preset(the one named tower cooler) and when crashes occurs sometimes it only affect specific cores and the cores often crash at the same exact instruction or in the same part of the program, so i think it's hasty to say the motherboards are the cause.

0 Kudos
KrissyG
New Contributor II
2,865 Views

@vmovups wrote:

I do not think the CPU can exceed the 307A current limit with such workloads.

My MSI motherboard had a 288W limit preset

 P = I x U
288W = 307A x 0,93V So theoretically you can reach any value on current, if only TDP has been limited. Obviously, at such low voltage, the CPU will not hot any higher frequency, so the power drawn will be little.

On load most CPUs will be at 1,4V or so, 288/1,4 = 205A, so you would be about 100A away from your 307A  ICCmax.

0 Kudos
chugzillafx
New Contributor I
2,810 Views

yeah, it is def the MB maker's fault. even the so-called intel por setting on the gigabyte mb's isn't true intel defaults. it sets them at 253-253-517.75 on any CPU i have tested.

i had a 12700F in a brand-new gigabyte aorus elite x Wi-Fi 7 mb and its default settings where set at pl1=pl2-4096 and the iccmax was 517.75 good thing the 12700F is truly locked and it did nothing to it. the 12700F runs the same no matter what settings are applied it's really strange but good.

after that i bought and tested a 14700 thinking ok it's going to be just like the 12700F. NOPE with those crazy default settings it was hitting 100C instantly in CBr23. and i couldn't figure out how to adjust it so it got returned. so that was strange that it wasn't locked like the 12700F. the 12700F never went over 55C on any settings. so however, it's made-locked is really good and a solid build.

then i got a 14700K it did the same thing and of course i had to change all those settings and then use a loadline calibration on top of that to keep it cool.

i really want to keep the 14700K but all these negative reports have me worried even though I'm not having any issues I'm thinking long term of how long this chip will stay good for.

so yeah, for users that have no idea i can see why they are having issues because i was at first until i started asking and watching videos and getting help.

because of not knowing what was going on it took me many weeks-months to learn what i know now where if these mb's used true intel specs out of the gate more users would be happy and not confused about the settings.

then let them make changes themselves after that if they want.

 

 

 

0 Kudos
KrissyG
New Contributor II
2,695 Views

@chugzillafx wrote:

yeah, it is def the MB maker's fault. even the so-called intel por setting on the gigabyte mb's isn't true intel defaults. it sets them at 253-253-517.75 on any CPU i have tested.




On the intel page for i7 14700k, it actually says this next to '125W':

KrissyG_3-1713997716033.png

So 125W has not been exceeded while testing the CPU.....and then there is this next to '253W':

KrissyG_0-1713997820532.png


253W is not the real limit, it is for a longer periode of time tho.
And teh part about 10ms.....it actually covers the 'spikes' that been the topic here so often.

0 Kudos
chugzillafx
New Contributor I
2,691 Views

yeah, it's confusing because everything i have read says 125W is the TDP and not pl1.

pl1=pl2 is 253.

so, who really knows anymore.

and trying to understand that intel wording is more confusing than asking someone who has no idea what's going on IMO lol.

it may as well be in a foreign language they should just break it down in layman's terms.

i have tried and testing my 12700K at 125-253-307 and of course 253-253-307.

0 Kudos
KrissyG
New Contributor II
2,685 Views

yes, that with 125W got me too, i was not expecting the CPU to draw more than twice as that, and that without OC.

For my own use, 125W actually hits the sweet spot, as the CPU at 250W does only few % more performance in the applications i use, therefore the TDP limit is always at 125W.

Just when i replace the thermal paste, i set it to unlimited and if i see the CPU hitting 300W on a stress test = succesfully applied thermal paste and succesfully mounted water block.

0 Kudos
Reply