Re: The fault is not in your processor, but in your motherboard - Page 2

AlHill · ‎04-23-2024

Are you one of those who are constantly complaining about 13th and 14th gen stability and BSODs? Well, read this:

https://videocardz.com/newz/asus-adds-intel-baseline-profile-to-its-z790-motherboards-amid-core-i9-stability-issues

While this specifically mentions ASUS, MSI and Gigabyte are in this mess also.

Doc (not an Intel employee or contractor)
[If you find any Intel driver you might need, download and save it now.]

chugzillafx · ‎04-24-2024

well, I'm using a Noctua NH-U12A air cooler to cool my 14700K and with 253-253-307-56 seconds on the timeout then MCE-CEP off I'm good but i also have a loadline calibration on. set at 5 then i have a few of the auto settings on standard.

seems to work well but i really don't know if that's the best way or should i undervolt.....then there are 2 different section to apply an uv also.

so yes, very confusing for the new-average user trying to learn.

then what's better BIOS or XTU?

intel needs official uv guides to help us out.

KrissyG · ‎04-24-2024

@Keean wrote:
> for what i care, the temperature sensing resistor - has exactly that size, and if that is the case, the whole surface under the Pcore is sensing temperature.

The temperature sensors are a lot smaller than you seem to think.

> yes, 1 core can have only 1 thread

No, p-cores can each run two threads at the same time. This is called hyper-threading by Intel. So you need to measure with both threads in the core running.

> Load shuffling can be seen in Task Manager and XTU as well, since both can show load on all cores.

Load shuffling is hiding the failure and making it harder to see, you want to use set affinity to lock the process to one core to make the problem easier to see. The load shuffling is not perfect, and occasionally you will end up with both hyper-threads on the same p-core loaded and it will cause a crash.

you could turn pretty much any metal to a temperature sensing resistor, size does not matter, but for a core it would actually matter.
Too small and it would not pick up a possible hot spot, too big and it could potentially grab just the average temperature instead of the highest value. If a Pcore is about 3x4 mm, then having a 3x3 temperature sensor is not a wronrg thing.
Pretty sure it huge compared to a chunk of transistors and pretty sure it sits dead centre of a core.

And what you think of, is not what it seems, such resistor can be 3x3mm, but thickness of 1µm or less.

As an example, SMD resistors, sometimes used as a fuse, and two normal resistors in the middle:

The ones in the middle have actually a core made of some ceramic material, the actual resistive part is wrapped like a spring around the core, and it's diametre can be as much as 1mm, and as little as 0,1mm.

Not sure how thick is the layer on the SMD resistors, 10µm? who knows. These SMD resistors do also have a ceramic core, except the core is flat to reduce height.

...it does not matter, if it's a Pcore or Ecore, a single core has 1 thread only, you need multiple cores for 2x number of threads.
That is why you can run a single thread benchmark, and multithread. The first option actually tests just one core but at full load, the second one does all cores at full load, and all their capabilities too.

So if you run a single core benchmark, there will be only one thread.....

Edit

So i went BIOS, disabled all Ecores and all P cores but one.
The task manager actually showed 2 threads , but the TDP while running a stress test never exceeded 42W.
The thing with 2 threads but 1 core makes no sense.

Keean · ‎04-24-2024

> temperature sensing resistor

The sensors are not resistors but Digital Thermal Sensors that are not mounted to the core but actually part of the silicon core layout. They are actually 'inside' parts of the core, and they are really based on Bipolar Junction Transistors (BJT) that can be etched into the silicon as part of the normal CPU manufacturing process. In other words they are just some of the billion or so transistors that are etched onto the chip.

> single core has 1 thread only, you need multiple cores for 2x number of threads.

Read about hyper-threading here: https://www.intel.com/content/www/us/en/gaming/resources/hyper-threading.html

> TDP while running a stress test never exceeded 42W.

What CPU was that? My measurements were on a 'KS' CPU but I would expect similar from a 'K' Cpu. I notice the boost ratio is only x53 so it will use a lot less than mine which was boosting to x58 (actually reduced from the stock Intel x59 for the KS).

I would not expect you to have any stability issues at all on a CPU that only boosts to x53 - have you had any problems?

KrissyG · ‎04-25-2024

Silicon bandgap temperature sensor.....yes, it measures voltage and compares them.... voltage difference , now Google how a temperature sensing resistor works.....by measuring voltage drop..... voltage difference.

Yea, temperature causes silicon to change it's resistance, of course, not in a linear way like it happens with T100 or was it P100 which is commonly used as a temperature sensor.

The CPU name is in the screenshot, it's i7 13700k , the P and E cores are almost exactly same as in any other 13 gen and 14 gen CPU, they will draw similar power.

Also, there is no option to disable P cores and run only on E cores.

And yes, the page about hyper threading does not explain how a single core can have two threads, while all other cores are simmingly not there. This makes no sense, you need more than one core to even be able to use hyper threads....hmm, anyway this while thing here went way off the topic.

Fact is, intel gave instructions to motherboard manufacturers, is why you can run the CPUs at 250W which is also stated in CPU specs. Except from ASUS, i would say Intel was literally surprised how well AMD wins the market, and by now, AMD is number 1.

Seems like only a wonder can save Intel CPU technology.

Keean · ‎04-25-2024

My CPU only seems to be stable if I limit it to x58 on all p-cores, doesn't matter what I set the motherboard power limits to (I didn't try lower than 150W though).

vmovups · ‎04-25-2024

The CPU heats up in no time when executing AVX FMA and blends. If you enable only the 2 pcores that go to the max boost frequency, stability goes in the trash bin too.

What Keean meant is that it's going to be difficult to measure the local temperature increase in the parts that execute the code, with AVX FMA code and only two cores you'll have a small region that is hotter than the rest of the core and the temperature sensor can only measure the heat that spread to it, so if it measures 100°C then inside the FPU it is probably hotter?

There is also a difference between Windows 10 and 11 with how it manages cores because they handle thread director differently, does this match the thread shuffling behavior? I haven't seen any behavior like it on Windows 10 and i don't believe it would be possible to do something like that without hampering performance.

KrissyG · ‎04-25-2024

temperature sensors are not on the same side of the die as the IHS is, my screen shots are the proof of that. The GPU integrated im my CPU shows the IHS temperature, i have a dedicated GPU.

Keean · ‎04-25-2024

Have a look at this thermal image of a chip, you can see different structures are different temperatures, it's not a simple as one whole core is the same temperature. Hotspots can be as small as an individual transistor or a single connection. There is always some temperature loss between the hotspot and the temperature sensor, which is why they place multiple sensors even within one core, and try and get them as close to the hotspots as possible

https://www.infratec.eu/thermography/non-destructive-testing/e-lit/

In particular look at the microscopic lense x1 and x3 images.

KrissyG · ‎04-25-2024

You mean this image here? (microscopic 3x)

Did you actually look at the scale on the right side? the coldest temp in picture is 20°C, the hottest 27°C, and areas around 27°C parts, are green ? 23~24°C.
The scale shows up to 31°C but i can't seem to find anything with that color.

Difference in 3°C, that is not a hot spot.

The back of my Gigagyte B760 motherboard (the white cable that runs to the backplate is a temperature sensor, the spot it sits on is not a coincidence....the CPU on the other side has similar hight as that rectangle in the middle of the backplate; the stock backplate is underneath):

Thermal view CPU in idle:
Uunfortunately the back plate reflects the surroundings, therefore the lowest reading is incorrect, the highest reading of 30°C is actually a reflection of my face on the surface of a bolt or on a screw.
However, the highest temp in picture - even if it is a relfection, is 30°C, so the red parts on the right are less warm than 30°C.

Thermal view XTU stress test at 250W for 5 minutes:
This is the interesting part, the back plate still reflects some, but it's real temperature is at about 50°C, even tho it is separated by a layer of non conductive plastic.

Most thermal cameras have an adaptive temperature scale, this is why same color will be shown as different temperatures, this is why on the left side you see 53°C shown as same color as 83°C which you see on the right side - both bright yellow color.

The VRM MOS never went over 55°C, i usually have one fan cooling teh VRM MOS and one that cools the back plate and the motherboard itself.

The distance i took pictures from is not ideal when recording high temperatures, bcoz the rising hot air, prevents an accurate reading.
The motherboard can get as hot as 98°C, at which point the CPU at 100°C with continuous thermal throttling and VRM MOS at 65°C. Yes, eventually the motherboard is the highest temperature in the system, or second to the highest.
It takes about 15minutes for teh motherboard to reach idle temps, while it takes 2 minutes for the VRM MOS to reach idle temps.

Continous 125W load looks like this:

And at those 125W and the situation as seen above, the gigabyte motherboard shows me this:

The real hot spots can not happen in an enviroment, which spreads the heat - for example an IHS prevents that, so does a heatsink, or the surface of the motherboard.
Only a material which is thermally non conductive, would allow hotspot where the temperature difference is very high.

Keean · ‎04-29-2024

It all depends how small the hotspot is. In this case it could be one, or a few of the billion or so transistors on the chip. Think of it this way, imagine welding steel using an electric current (spot welding), the point of contact gets hot enough to melt the steel, yet the bulk of the steel hardly changes temperature at all.

levicki · ‎04-27-2024

It is totally disingenious to say this is only mainboard manufacturers' fault.

Intel is the one who is selling unlocked K SKUs which will accept any values for voltage / power / current.

It doesn't really matter who sets them -- BIOS or the user using Intel XTU from Windows.

The result is the same.

If you want to stop CPU from operating at 1.5 V then don't allow CPU voltage control MSR to accept value >= 1.5 V. It's as simple as that. Washing hands and throwing mainboard manufacturers under the bus doesn't help Intel's image at all.

Keean · ‎04-27-2024

No matter what power limits I set, even as low as 150w for a 14900ks my CPU was not stable for compiling code. The only way I could get it stable was to set the p-core ratio limit by cores used to x58 for 1 to 8 cores used. Interestingly at that speed, I can raise the power limits back up and it remains stable...

KrissyG · ‎04-27-2024

@levicki wrote:
It is totally disingenious to say this is only mainboard manufacturers' fault.

Intel is the one who is selling unlocked K SKUs which will accept any values for voltage / power / current.

It doesn't really matter who sets them -- BIOS or the user using Intel XTU from Windows.

The result is the same.

If you want to stop CPU from operating at 1.5 V then don't allow CPU voltage control MSR to accept value >= 1.5 V. It's as simple as that. Washing hands and throwing mainboard manufacturers under the bus doesn't help Intel's image at all.

Exactly.

Nichronos · ‎04-28-2024

@levicki I fully agree with you!

But the reality is even darker, 14900K/KF are requesting above 1.50v for the 6.0GHz single/two core boost and it does that every time you open a alt-tab, open a program or browse with chrome! I measured it with multimeter behing the CPU socket on the back of the mobo voltage spikes up to 1.65v which is insane. Leaving the CPU for 1-2 months with that voltage wont end well and degrades it with more than 100mv. then the constant BSOD and crashes start to happen! I have never disabled my default power limits which are 251W/307A and had multiple chips degraded in under a month except one which had fixed core ratio and voltages!

Currently the only solution to actually preserve your new i9 is not with applying some forced power limits which cripples your CPU, but setting the turbo ratio to all core boost instead of the default per core and manually limiting the maximum voltages! Also is good to disable Turbo Boost 3.0 which forces more load on those specific two favored cores mentioned and has nothing to do with the actual turbo.

I have written a small guideline for people to follow so their i9 stay safe in this post:
https://community.intel.com/t5/Processors/The-reason-behind-the-fast-13-14th-gen-i9-degrade-and-how-to/m-p/1593145#M71800

KrissyG · ‎04-28-2024

@Nichronos wrote:
@levicki I fully agree with you!
But the reality is even darker, 14900K/KF are requesting above 1.50v for the 6.0GHz single/two core boost and it does that every time you open a alt-tab, open a program or browse with chrome! I measured it with multimeter behing the CPU socket on the back of the mobo voltage spikes up to 1.65v which is insane.

if you look at the thermal cam images i posted, you will realize, that a motherboard itself has a voltage drop, literally the space between the VRM MOS and the CPU where is nothing else but copper, loses some voltage. I would actually expect 1,8V on the VRM MOS and 1,5 on the CPU.

I measured the 12V EPS that delivered the power to the CPU, and a single 8pin EPS has 12,28V at the power supply, and at CPU full load, it goes down to 12,02V at the EPS power connector on the motherboard. That is 0,27V voltage drop on thick cables, that delivery about 3,5~4A current per rail/pin.

What does the XTU or other software tell you, that the CPU is getting? more than 1,5V?

hong620 · ‎04-29-2024

wanna see the reality?

inspected as defective ->

got exchange new one ->

apply 'Intel MainLine' in new firmware (GIGABYTE mobo so even got much conservative value then others) ->

it still push Much insane voltages then Previous, Tons of error and instability.

not even 14900K kind hardcore one, it's just 13700K.

of Cource Each Defective Issue is got various cause, but as far as peoples in global experience and researched, those defective cause straight connect in to Fundamental cause from Intel,

either that's value from mobo or CPU, Seems Yield rate and Lottery of most of unlocKed CPU line ups weren't much suit with they're Goals in Clock.

even in bring insane limitations and lower the clocks and watercoolers, still somehow keep reach the throttling limit but keep kick-off the run, then earn damage the certain part of core.

either you got the superuser role or not a intel employee or contractor,

this kind post on forum doesn't much helped, looks like a troll to many intel customers in here.

still i can waste tons of time and money for 'Exchange Infinity ammount of CPU' from intel Distributors,

but they're don't had much Buffer Stocks, even i'd seen someone appeal the '5 times exchange but still defective' in the community.

at least in 'dictionary definition' pick the Intel CPU and Mobo is choice of customer like me.

i can't much got refund my rigs from either distributor or sellers at this rate

at this time, intel doesn't consider recall at all cause they don't have any 'alternate product' free from this issue same as cpu gate 'MELTDOWN-SPECTRE' cases.

but one thing i can sure is, if Intel doesn't recall the products, customers totally turn they're back from Intel eternity.

even in Intel got astronomical Federal Government subsidy, if they waste government money for recall the product instead of build a fab in America?

Uncle Sam will very mad at those.

if they refuse to recall then use those for Build a Fab?

no one will purchase 'credit default' state of intel product at that state so.. even in days become those fab start to mass productioning on customer products, it will remained as 'dead inventory' just as Boeing did.

if someone Intel Employee officially answer the statement on this issue, Please refund my CPU and mobo, then i'll quit out from Intel's Life.

KrissyG · ‎04-29-2024

@Keean wrote:
It all depends how small the hotspot is. In this case it could be one, or a few of the billion or so transistors on the chip. Think of it this way, imagine welding steel using an electric current (spot welding), the point of contact gets hot enough to melt the steel, yet the bulk of the steel hardly changes temperature at all.

spot welding has about the surface of a single core, here the power is divided into P and E cores, you don't get 200~300Amps on a single spot, you get max 35A per spot. The CPU die is about 24mm x 12mm....that is far from a spot weling surface.
The design flaw, plus the stock frequency should not be as high.

Im not into overclocking bcoz the gains are so little that it makes no sense, however, i ordered a Z790 motherboard with very similar specs, just for testing.
Somehow my theory that the motherboard heats up the CPU from the pins side becomes more and more possible.

Keean · ‎04-29-2024

You are still thinking much too big, even a single transistor can overheat and there are over a billion of them on a modern CPU.

KrissyG · ‎04-30-2024

you feed same voltatge across a core, which is some 1,5V max, how do you want to overheat 1 out of lets say 500 millions of same transistors, where all feed of same 1,5V rail?

Keean · ‎04-30-2024

Short version: Power Density

Density: each transistor might generate a certain amount of heat, so putting too many too close together will create a hotspot.

Power: is dependent on current (I^2 * R) so the more current the hotter a transistor will be. Too much current through too small a transistor will create a hotspot.

There is also thermal runaway as the resistance of silicon decreases with temperature. As the silicon gets hotter more current will flow for the same voltage, more current increases the temperature, hotter silicon has lower resistance, which allows more current to flow until failure...

KrissyG · ‎04-30-2024

@Keean wrote:
Short version: Power Density
Density: each transistor might generate a certain amount of heat, so putting too many too close together will create a hotspot.
Power: is dependent on current (I^2 * R) so the more current the hotter a transistor will be. Too much current through too small a transistor will create a hotspot.
There is also thermal runaway as the resistance of silicon decreases with temperature. As the silicon gets hotter more current will flow for the same voltage, more current increases the temperature, hotter silicon has lower resistance, which allows more current to flow until failure...

....this is so beyond the topic.

"Power: is dependent on current "
No.....what is changing, as you mentioned yourself, is the resistance ....I^2 * R, bcoz of the temperature,
the VRM MOS = voltage regulator, feeds the CPU with variable voltage, and with rising resistance, need higher voltage, with higher voltage and same current = more power = more heat.

Resistance decides how high current can become, but to achieve a higher current, you need higher voltage...I= U/R
For a cold CPU, you can easily achieve higher current, for a warm CPU you will not see as high of a current, but higher voltage instead.

"so the more current the hotter a transistor will be".....
.....current is the result of voltage running through a resistance, the voltage decides the current in the first place, later resistance changes with temperature.
Higher voltage -> higher current-> higher power->more heat

"resistance of silicon decreases with temperature"

No....resistance of a metal increases not decreases .....the number is getting bigger....you are confusing something here....

You can limit the current, in order to not melt the CPU, bcoz high current at low voltage will result in huge voltage drop on elements that are not designed for it, there will be voltage drop NOT on the transistors, bit somehwere elese.....so the VRM MOS will try to compensate the voltage drop and feed the CPU with higher voltage, so that the core has it's 1,4V = that is what kills a CPU.

The record of 9GHz happend with a voltage of 1,85V, at over -200°C or so.

Yea, so at 300W i get a Vcore of 1,56V ~ 190Apms running through the point at which Vcore is being measured. Surprisingly VID is at 1,45V

That is literally the reason, at high power consumption, i get barely any better performance, bcoz its not the transsistors taht get the power.

At 1,85V in theory, i would get about 420W at the CPU