Core i5-4300u thernal throttling stuck

Matt_V_1 · ‎09-22-2015

Edit: Since my original posting, I've discovered that bit 1 (Thermal Status) of MSR_CORE_PERF_LIMIT_REASONS is asserted, which I'm guessing is why the CPU is running slowly. Bit 1 means "Thermal Status (R0) When set, frequency is reduced below the operating system request due to a thermal event."

The CPU temp is low (as evident by my pcm.x output) and TM1, TM2, CRIT, and PROCHOT status bits are 0 in in IA32_THERM_STATUS. I can't seem to find why Thermal Status is currently asserted, and it seems as if it might have gotten stuck.

I found erratum HSD54 "TM1 Throttling may continue indefinitely" which sounds similar, though I haven't been able to replicate on another system, and I find it hard to believe that the processor would've gotten so hot as to trigger TM1. I'm wondering if there might be other causes for the thermal throttling to get stuck, such as ESD. Any ideas?

Original post:

I'm seeing a problem with an embedded motherboard that uses the Core i5-4300u CPU, where every now and then it will slow way down and stay that way until power-cycled. The problem is very rare and I can't seem to trigger it. I've seen it on two separate systems, both using the same model motherboard.

It's running linux, with a 3.4.13 kernel, and is using "performance" mode for the cpufreq governor.

When this happens, cpufreq-info reports that the CPU is running at 800MHz, but it performs much, much slower.

I've run pcm.x v2.9 when this happens, and see that AFREQ sits at 0.32 on all cores and NEVER changes regardless of load. FREQ will never go higher than 0.09 even when running four instances of burnP6 (a cpu-burn utility). FREQ will go down to 0 when I kill the burnP6 processes.

I've noticed that the C1 residency is 72.11%, which shows very high given that we're putting a heavy load on the CPU.

In addition, running 'top' in linux will show 100% CPU usage on all cores.

I'm running out of ideas of how to troubleshoot this. Can anyone offer any suggestions?
Thanks!

Matt

root@linuxs# ./pcm.x 5

Intel(r) Performance Counter Monitor V2.9 (2015-08-07 10:23:17 +0200 ID=721d9e3)

Number of physical cores: 2
Number of logical cores: 4
Number of online logical cores: 4
Threads (logical cores) per physical core: 2
Num sockets: 1
Physical cores per socket: 2
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2500000000 Hz
Package thermal spec power: 15 Watt; Package minimum power: 0 Watt; Package maximum power: 0 Watt;
Delay: 5
Trying to use Linux perf events...
Can not use Linux perf because your Linux kernel does not support PERF_COUNT_HW_REF_CPU_CYCLES event. Falling-back to direct PMU programming.

Detected Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz "Intel(r) microarchitecture codename Haswell"

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3MPI : number of L3 cache misses per instruction
L2MPI : number of L2 cache misses per instruction
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
IO : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature

0 0 0.09 0.97 0.09 0.32 238 141 K 1.00 0.07 0.00 0.00 N/A N/A N/A 60
1 0 0.09 0.98 0.09 0.32 34 156 K 1.00 0.07 0.00 0.00 N/A N/A N/A 56
2 0 0.09 1.09 0.09 0.32 3 89 K 1.00 0.13 0.00 0.00 N/A N/A N/A 60
3 0 0.10 1.12 0.09 0.32 116 104 K 1.00 0.14 0.00 0.00 N/A N/A N/A 56
-----------------------------------------------------------------------------------------------------------------------------
SKT 0 0.09 1.04 0.09 0.32 391 492 K 1.00 0.10 0.00 0.00 0.72 0.00 0.72 56
-----------------------------------------------------------------------------------------------------------------------------
TOTAL * 0.09 1.04 0.09 0.32 391 492 K 1.00 0.10 0.00 0.00 0.72 0.00 0.72 N/A

Instructions retired: 4612 M ; Active cycles: 4453 M ; Time (TSC): 12 Gticks ; C0 (active,non-halted) core residency: 27.89 %

C1 core residency: 72.11 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %;
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; C8 package residency: 0.00 %; C9 package residency: 0.00 %; C10 package residency: 0.00 %;

PHYSICAL CORE IPC : 2.07 => corresponds to 51.79 % utilization for cores in active state
Instructions per nominal CPU cycle: 0.18 => corresponds to 4.62 % core utilization over time interval
----------------------------------------------------------------------------------------------

package/CPU energy (Joules) DIMM energy (Joules)
----------------------------------------------------------------------------------------------
SKT 0 18.11 N/A
----------------------------------------------------------------------------------------------
^CDEBUG: caught signal to interrupt (Interrupt).
Cleaning up
Zeroed PMU registers

TimP · ‎09-25-2015

With an older HSW laptop, I noticed that using all logical processors would depress clock rate for several seconds after the threads terminated, so it was never advantageous to do that (but there is no BIOS option to disable HyperThread). I didn't succeed in investigating a possible correlation with the temperatures. The most frequent remaining problem (under Windows 8.1) is the occasional unexpected presence of multiple Common Language Runtime spinning processes (possibly started by some web page) which have to be killed. I would think top should show the presence of excessive threads.

Matt_V_1 · ‎09-25-2015

Thanks for the post, Tim. In my case, I've killed all processes other than the kernel threads and sshd. I've also removed most kernel modules, but nothing seems to have done the trick.

Running 'top' shows 0% load. If I spawn a number of burnP6 threads then I see the load in 'top' go up to 100%, but pcm.x shows that the cpu is spending 78% of its time in C1. Running other tools such as i7z or powertop show the CPU is running at an 800MHz clock but is effectively ~230MHz due to the large amount of time that it spend sleeping.

I checked msr 0x198 to see what the current multiplier is reported as, and I see a value of 0x08. I then checked 0x199 to see the requested multiplier and it's 0x13. It seems like the CPU is being forced into a low-power state of some sort.

I've checked for PROCHOT and over-temp bits, but don't see anything wrong. I've also tried disabling the speedstep bit but no luck.

Does anyone know if there is certain msr that I can poke to try to 'wake' it up?