Diagnosing Severe Throttling on Xeon-D

Greg_S · ‎11-11-2019

One of my Xeon D1521 based servers (Linux 4.15 w/Ubuntu 18.04) is experiencing bizarre performance throttling and am struggling to diagnose the cause. The four cores (hyperthreading disabled) normally run at 800Mhz in power saving mode with a modest application load. After 2-3 days uptime, core frequencies suddenly plummet under 250Mhz. On two occasions, performance mysteriously recovered after 2-3 days of degraded performance. I captured the following data during the most recent failure/recovery:

MSR_IA32_THERM_STATUS (0x19c): Bit 10 (Power Limitation Status) goes HIGH in all four cores for the event duration.

MSR_IA32_PACKAGE_THERM_STATUS (0x1b1): Bit 10 (Power Limitation Status) goes HIGH for the event duration.

MSR_CORE_PERF_LIMIT_REASONS (0x690): Bit 2 (Power Budget Management), Bit 13 (Core Frequency P1 Status) and Bit 15 (Core Frequency Limiting Status) all go HIGH for the event duration.

MSR_PKG_PERF_STATUS (0x613) and MSR_DRAM_PERF_STATUS (0x61b) both count up RAPL throttling during the event.

MSR_TURBO_ACTIVATION_RATIO (0x64c) is updated every second or so with values from 7-9 (vs 24 before/after an event).

From my reading of the SDM, this indicates the RAPL power (but not thermal) limits have been exceeded and the cores are being forcefully throttled. However, MSR_PKG_ENERGY_STATUS/MSR_DRAM_ENERGY_STATUS report power usage of 11.0-11.5W / 0.4-0.6W respectively. Package power consumption actually increasing to 12-14W range during an event (likely due to the cores being overloaded). Core/package temperatures are always under 40C before/during/after an event (so hard to believe the cause is thermal). Have increased the power limits in MSR_PKG_POWER_LIMIT to the max but no change.

The application load includes 12 processes performing significant AVX2 256-bit calculations (real-time image reprocessing from IP cameras). While this increases power / reduces max frequency, could that be relevant when already running in power savings mode @800Mhz on a 2.4Mhz (non-turbo) rated part? The motherboard includes a BMC but it reports no events. Any thoughts welcome as running out of ideas of what else to monitor to track down the cause.

Thanks.

ArvindS_Intel · ‎11-12-2019

Hi Greg,

Thank you for raising this query. We suggest you to post your query in the following forum for better discussion on your query.

https://forums.intel.com/s/topic/0TO0P00000018NWWAY/processors

Have a nice day ahead.

Regards,

Arvind

Intel Developer Zone Support

Greg_S · ‎11-12-2019

Arvind-- Appreciate the reply. I had intended to post this in Performance and Platform Monitoring since my goal is understanding how the platform monitoring MSRs indicate the throttling cause. Since it was my first post, it appears to have gone to moderation and believe it was redirected here (else I clicked something very wrong when posting). If you think the processors forum is more relevant that performance and platform monitoring, I can (re)post it there.

Is there an option to move a post (don't see one) or just repost?

Thanks.

jimdempseyatthecove · ‎11-13-2019

>>After 2-3 days uptime, core frequencies suddenly plummet under 250Mhz.
>> normally run at 800Mhz in power saving mode

What happens if you periodically, say once a day, exit power saving mode, then re-enter power saving mode.

I am thinking either a counter overflowed, or at reduced voltage a higher current is induced and is causing some other event that you are not looking at and causes the frequency to drop.

Also, find a posting by John McCalpin. He is likely the leading expert on MSR_... status.

Jim Dempsey

Greg_S · ‎11-13-2019

Jim-- appreciate the reply. Agree the semi-periodic nature has a potential overflow feel to it. I actually logged the raw energy counter MSR values to see if they looked like overflow candidates, but nothing obvious. At the most recent event start, they were 611=bee4de9c 613=0156721d 619=a312f51a 61b=0002fc0c. Interesting thought about toggling between performance and power-savings to see if that changes/resets anything.

My recent experiments have focused on trying to force the severe throttling on demand. By reducing the RAPL package power limit below actual power consumption (by modifying MSR_PKG_POWER_LIMIT), I can force throttling and the same MSR flags trigger (power limitation, power budget, core freq p1, core freq limiting). However, the MSR_PKG_PERF_STATUS counter increments nonstop (not surprising since I set it for 8W and actual load is around 12W-- no amount of throttling will make up the difference). There was no DRAM throttling (not surprising since I did not change that limit). In the "real" event, both PKG and DRAM are throttled, but only around 10% of the time. Am wondering if the short-term power consumption is bursty enough to occasionally trigger the short-term RAPL limits (though they claim to be disabled) and once engaged, does it create some kind of feedback loop due to power fluctuation to keep re-triggering.

>> Also, find a posting by John McCalpin. He is likely the leading expert on MSR_... status.

Indeed-- have read many of his posts. That was the other reason I tried to post this in the Performance & Power Monitoring forum. He seemed the most likely to understand what the MSR reason/status bits really mean. Once I finish with my next set of experiments, will repost with more data over there in hopes he might have an idea.

Thanks!

jimdempseyatthecove · ‎11-15-2019

Could you state your requirements to operate at such a reduced power setting? Apparently you are attempting to slowly run an application.

It appears that you are on an edge situation (After 2-3 days uptime...), if this is power related, can you remove components from your system. Perhaps use one memory stick, un-mount a storage device (O/S may periodically poll mounted devices), kill unnecessary services (e.g. oomkiller, software update, cloud synchronization, etc...).

Jim Dempsey

Greg_S · ‎11-15-2019

Apologize I was not clear-- this is a standard Ubuntu 18.04.3 LTS install on a basic server motherboard (ASRock D1251D4I). It hosts four LXC-based containers running basic applications of which the most resource intensive makes extensive use of AVX2 256-bit. (The latter may or may not be relevant). That the server defaults to "powersaving" vs "performance" is arguably an 18.04 distro bug. That is why the base core speed is 800Mhz (with performance dynamically increasing under higher loads).

My particular application load is efficient enough to performs fine at 800Mhz and only consume 12-13W (on a 45W rated CPU).

When the bizarre throttling occurs, the core clocks reduce to 250-350Mhz, the server becomes sluggish and application performance suffers.

My goal with this investigation was to get the hardware to indicate why it periodically throttles way below the OS-requested frequency (ignoring that the OS frequency choice is inappropriately low for a server). Seems like there should be some platform monitoring capability to provide this info.

Unfortunately, power/thermal management has turned into a hodgepodge of different technologies/controllers/drivers/applications all trying to do their semi-independent thing. This system has a dedicated BMC (AST2400), the Intel ME (whatever it is doing), firmware/SMM, drivers compiled into the Linux kernel, system apps (thermald, ondemand and who knows what else), tuning applications, and multiple processor technologies (RAPL, HWP, Speedstep, etc). This makes trial-and-error blind changes (versus hardware feedback) pretty rough.

Note the problem has not occurred for six days, likely due to some change I made during my "force the problem on demand" experiments. Am making some final changes to my monitoring code (which now includes BMC data, Linux cpufreq data and lots MSRs data) and plan to reboot so that the problem can reoccur again. Hopefully I capture enough data during the next event to narrow things down or at least have detailed baseline data for more extreme changes (such as reducing DRAM to lower its power as you suggest).

Thanks.

Greg_S · ‎11-17-2019

And for anyone curious, here is what an event looks like via platform telemetry:

2019-11-17 09:10:41.002 (1.996s):
BMC0: +1.05_PCH=1.05V +1.50_PCH=1.52V +12V=13.70V +3.3=3.32V +3.3VSB=3.34V +5V=5.34V ATX+5VSB=5.10V BAT=3.12V VCCM=1.21V VCORE=1.81V
BMC1: CPU=43C CPU_FAN1=5000RPM MB=29C
BMC2: CASE_OPEN=OK CPU_CATERR=OK CPU_ERR2=OK CPU_FIVR_FAULT=OK CPU_PROCHOT=OK CPU_THERMTRIP=OK
LOAD: 1.43 1.54 1.37 1/384 11913
FRQ0: cmax=2400 cmin= 800 scur=2379.057 smax=2400 smin= 800
FRQ1: cmax=2400 cmin= 800 scur= 843.732 smax=2400 smin= 800
FRQ2: cmax=2400 cmin= 800 scur= 799.980 smax=2400 smin= 800
FRQ3: cmax=2400 cmin= 800 scur= 799.977 smax=2400 smin= 800
CPU0: 46C cd cl cr hot pl th1 th2 thm
CPU1: 46C cd cl cr hot pl th1 th2 thm
CPU2: 46C cd cl cr hot pl th1 th2 thm
CPU3: 47C cd cl cr hot pl th1 th2 thm
PKG_: 46C cd cl cr hot pl th1 th2 thm
RAPL: cmt cnt cos! cp1! edp hot pb pc thm ub vr
POWR: pkg=11.913W (23.781j) ram=0.468W (0.935j)
CNTR: PKGCLIP=00000000 PKGPWR=881ca832 RAMCLIP=00000000 RAMPWR=908b8c22
MISC: Speedstep-Target= Turbo-Ratio=24
VOLT: 0.933V 0.935V 0.838V 0.935V
CHNG: A_RATIO 1:0f->18 2:18->14 3:16->18
CHNG: S_RATIO 0:10->18 1:0c->12 2:18->0d 3:0c->18

Above is immediately prior to event start. Base CPU is 800Mhz with individual cores dynamically scaling up with the load. BMC is happy. Power usage is low at 12W. S_RATIO is the O/S requested SpeedStep multiplier and A_RATIO is the CPU reported value (due to latency). PKGCLIP/RAMCLIP are the RAPL power clipping counters-- zero since no event since boot.

2019-11-17 09:10:43.031 (2.030s):
BMC0: +1.05_PCH=1.05V +1.50_PCH=1.52V +12V=13.70V +3.3=3.32V +3.3VSB=3.34V +5V=5.34V ATX+5VSB=5.10V BAT=3.12V VCCM=1.21V VCORE=1.81V
BMC1: CPU=43C CPU_FAN1=5000RPM MB=29C
BMC2: CASE_OPEN=OK CPU_CATERR=OK CPU_ERR2=OK CPU_FIVR_FAULT=OK CPU_PROCHOT=OK CPU_THERMTRIP=OK
LOAD: 1.47 1.55 1.37 7/385 11919
FRQ0: cmax=2400 cmin= 800 scur= 415.696 smax=1700 smin= 800
FRQ1: cmax=2400 cmin= 800 scur= 415.502 smax=1500 smin= 800
FRQ2: cmax=2400 cmin= 800 scur= 416.152 smax=1500 smin= 800
FRQ3: cmax=2400 cmin= 800 scur= 415.512 smax=1500 smin= 800
CPU0: 43C cd cl cr hot PL th1 th2 thm
CPU1: 43C cd cl cr hot PL th1 th2 thm
CPU2: 43C cd cl cr hot PL th1 th2 thm
CPU3: 43C cd cl cr hot PL th1 th2 thm
PKG_: 43C cd cl cr hot PL th1 th2 thm
RAPL: cmt cnt COS CP1 edp! hot PB pc thm ub vr
POWR: pkg=12.020W (24.396j) ram=0.518W (1.051j)
CLIP: CLIP: pkg=1249 (1.219727) ram=1 (0.000977)
CNTR: PKGCLIP=000004e1 PKGPWR=8822c185 RAMCLIP=00000001 RAMPWR=908c9860
MISC: Speedstep-Target= Turbo-Ratio=16
VOLT: 0.643V 0.639V 0.645V 0.638V
CHNG: A_RATIO 0:18->08 1:18->08 2:14->08 3:18->08
CHNG: S_RATIO 0:18->12 2:0d->12 3:18->12
CHNG: T_RATIO 18->10

The event has started. The RAPL power clipping counters are incrementing (mostly package but some RAM). All core+package report PL (power limit) clamping and the package reports PB (power budget) clamping. All this while consuming only 12W of power. T_RATIO is MSR_TURBO_ACTIVATION_RATIO which drops into the 7-9 range during the event.

2019-11-17 10:53:11.031 (3.003s):
BMC0: +1.05_PCH=1.05V +1.50_PCH=1.52V +12V=13.70V +3.3=3.32V +3.3VSB=3.34V +5V=5.34V ATX+5VSB=5.10V BAT=3.12V VCCM=1.20V VCORE=1.81V
BMC1: CPU=39C CPU_FAN1=5000RPM MB=26C
BMC2: CASE_OPEN=OK CPU_CATERR=OK CPU_ERR2=OK CPU_FIVR_FAULT=OK CPU_PROCHOT=OK CPU_THERMTRIP=OK
LOAD: 5.11 4.04 3.65 4/395 17904 (up 23.50 hours)
FRQ0: crange= 800..2400 srange= 800.. 900 cur= 417.918 Mhz
FRQ1: crange= 800..2400 srange= 800.. 900 cur= 416.067 Mhz
FRQ2: crange= 800..2400 srange= 800.. 900 cur= 415.837 Mhz
FRQ3: crange= 800..2400 srange= 800.. 900 cur= 415.452 Mhz
CPU0: 39C cd cl cr hot PL th1 th2 thm 
CPU1: 40C cd cl cr hot PL th1 th2 thm 
CPU2: 40C cd cl cr hot PL th1 th2 thm 
CPU3: 39C cd cl cr hot PL th1 th2 thm 
PKG_: 40C cd cl cr hot PL th1 th2 thm 
RAPL: cmt cnt COS CP1 edp! hot PB pc thm ub vr 
POWR: pkg=11.774W (35.357j) ram=0.566W (1.700j)
CLIP: CLIP: pkg=393 (0.383789) ram=4 (0.003906)
CNTR: PKGCLIP=000ba104 PKGPWR=ce7d0152 RAMCLIP=00001b0a RAMPWR=9de14fbc
MISC: Speedstep-Target= Turbo-Ratio=8
VOLT: 0.647V 0.647V 0.650V 0.643V
CHNG: T_RATIO 09->08

Above is 90 minutes into the event. The lower clock has dropped the temps slightly, but this is not a thermal event. Power is largely unchanged and likely why there is no recovery. Seems like something has decided that 12W is too much power (for a 45W TDP part) and trying to reduce it. Voltages are 30-50% below pre-event values but the CPUs are saturated so no savings there. The obvious answer would be the RAPL power limits are set wrong, but the controlling package limit register (MSR_PKG_POWER_LIMIT) is set for 45W/1s and 54W/.008s and the DRAM limit is disabled. Full turbostat dump below for complete details.

turbostat version 17.06.23 - Len Brown <lenb@kernel.org>
CPUID(0): GenuineIntel 20 CPUID levels; family:model:stepping 0x6:56:3 (6:86:3)
CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM TM
CPUID(6): APERF, No-TURBO, DTS, PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, EPB
cpu2: MSR_IA32_MISC_ENABLE: 0x4000850089 (TCC EIST No-MWAIT PREFETCH No-TURBO)
CPUID(7): No-SGX
cpu2: MSR_MISC_PWR_MGMT: 0x00002000 (ENable-EIST_Coordination DISable-EPB DISable-OOB)
RAPL: 5825 sec. Joule Counter Range, at 45 Watts
cpu2: MSR_PLATFORM_INFO: 0x20080833f2811800
8 * 100.0 = 800.0 MHz max efficiency frequency
24 * 100.0 = 2400.0 MHz base frequency
cpu2: MSR_IA32_POWER_CTL: 0x29040059 (C1E auto-promotion: DISabled)
cpu2: MSR_TURBO_RATIO_LIMIT: 0x1b1b1b1b1b1b1b1b
27 * 100.0 = 2700.0 MHz max turbo 8 active cores
27 * 100.0 = 2700.0 MHz max turbo 7 active cores
27 * 100.0 = 2700.0 MHz max turbo 6 active cores
27 * 100.0 = 2700.0 MHz max turbo 5 active cores
27 * 100.0 = 2700.0 MHz max turbo 4 active cores
27 * 100.0 = 2700.0 MHz max turbo 3 active cores
27 * 100.0 = 2700.0 MHz max turbo 2 active cores
27 * 100.0 = 2700.0 MHz max turbo 1 active cores
cpu2: MSR_CONFIG_TDP_NOMINAL: 0x00000018 (base_ratio=24)
cpu2: MSR_CONFIG_TDP_LEVEL_1: 0x6402d000130168 (PKG_MIN_PWR_LVL1=100 PKG_MAX_PWR_LVL1=720 LVL1_RATIO=19 PKG_TDP_LVL1=360)
cpu2: MSR_CONFIG_TDP_LEVEL_2: 0x6402d000000000 (PKG_MIN_PWR_LVL2=100 PKG_MAX_PWR_LVL2=720 LVL2_RATIO=0 PKG_TDP_LVL2=0)
cpu2: MSR_CONFIG_TDP_CONTROL: 0x00000000 ( lock=0)
cpu2: MSR_TURBO_ACTIVATION_RATIO: 0x00000009 (MAX_NON_TURBO_RATIO=9 lock=0)
cpu2: MSR_PKG_CST_CONFIG_CONTROL: 0x00008401 (locked: pkg-cstate-limit=1: pc2)
cpu2: POLL: CPUIDLE CORE POLL IDLE
cpu2: C1: MWAIT 0x00
cpu2: C1E: MWAIT 0x01
cpu2: C3: MWAIT 0x10
cpu2: C6: MWAIT 0x20
cpu2: cpufreq driver: intel_pstate
cpu2: cpufreq governor: powersave
cpufreq intel_pstate no_turbo: 1
cpu2: MSR_MISC_FEATURE_CONTROL: 0x00000000 (L2-Prefetch L2-Prefetch-pair L1-Prefetch L1-IP-Prefetch)
cpu0: MSR_IA32_ENERGY_PERF_BIAS: 0x00000006 (balanced)
cpu0: MSR_RAPL_POWER_UNIT: 0x000a0e03 (0.125000 Watts, 0.000061 Joules, 0.000977 sec.)
cpu0: MSR_PKG_POWER_INFO: 0x2f02d000c80168 (45 W TDP, RAPL 25 - 90 W, 0.045898 sec.)
cpu0: MSR_PKG_POWER_LIMIT: 0x781b000158168 (UNlocked)
cpu0: PKG Limit #1: ENabled (45.000000 Watts, 1.000000 sec, clamp ENabled)
cpu0: PKG Limit #2: ENabled (54.000000 Watts, 0.007812* sec, clamp ENabled)
cpu0: MSR_DRAM_POWER_INFO,: 0x2f002800070026 (5 W TDP, RAPL 1 - 5 W, 0.045898 sec.)
cpu0: MSR_DRAM_POWER_LIMIT: 0x00000000 (UNlocked)
cpu0: DRAM Limit: DISabled (0.000000 Watts, 0.000977 sec, clamp DISabled)
cpu0: MSR_IA32_TEMPERATURE_TARGET: 0x00660a00 (102 C)
cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x883d0c00 (41 C)
cpu0: MSR_IA32_PACKAGE_THERM_INTERRUPT: 0x00000003 (102 C, 102 C)
cpu2: MSR_PKGC3_IRTL: 0x00000000 (NOTvalid, 0 ns)
cpu2: MSR_PKGC6_IRTL: 0x00000000 (NOTvalid, 0 ns)
cpu2: MSR_PKGC7_IRTL: 0x00000000 (NOTvalid, 0 ns)

jimdempseyatthecove · ‎11-19-2019

This is way beyond my background. This said, I found this article:

https://csis.pace.edu/~mqiu/papers/ACM_TECS_a24.pdf

Search for scur

On 24:10 I find three references (the only references) to Scur

These all appear to be Input values to statements, however Smin <- Scur would seem to indicate that Scur is a current (at this moment in time) measurement.

As to how Scur gets below Smin without resetting Smin I haven't a clue.

Because the above is listed as The TARS algorithm, and the previous page, 24:9 states the algorithm is a Thermal-Aware Task Scheduling in 3D Chip Multiprocessor, I suggest that instead of trying to determine the specific cause, that instead you try a work-around (aka hack).

What I would suggest you do is to is to affinity pin your processes and/or threads within each process to specific logical CPUs (hardware threads). As to the pinning arrangement, that will have to be determined through experimentation.

Jim Dempsey

Greg_S · ‎11-20-2019

Interesting paper-- thanks for the link. If I understand correctly, the initial scheduling phase iterates through multiple methods of generating the schedule DAG (via "rotation" of the DFG). Smin represents the particular DAG schedule with the lowest peak temperature (PT) during its run. It becomes the input to the subsequent optimization step which tries to reduce the frequency of individual tasks without decreasing performance beyond the execution time target. Assume the intent is for long running tasks where a final semi-static schedule provides predictable results over time.

To your point, I could potentially use a similar approach of allocating different image processing tasks to different cores (via affinity), measuring for a semi-optimal result, processing with that affinity until performance measurements showed it becoming non-optimal (from changes in source image characteristics due to daily environmental factors). Would probably squeeze out some additional performance.

My immediate problem is more fundamental. I am only getting 25-35% of rated server performance before it goes into power limiting mode (not coming remotely close to power/thermal limits). Suspect the cause is buggy firmware. When the slowdown occurs, ACPI events are generated telling the OS to clamp to the 800Mhz base frequency. I can tell the OS to ignore the ACPI events and update appropriate MSRs to what would seem normal. However, requests for any frequencies above 800Mhz (via MSR_PERF_CTL 0x199) are ignored suggesting there is a frequency multiplier clamp register I have yet to discover.

jimdempseyatthecove · ‎11-21-2019

Greg, I cannot think of any more straws to grasp at. It is not unusual for Intel to have undocumented features. One may be diagnostic, while another may be tuning (behavioral).

I do not know if this link will provide additional assistance: https://access.redhat.com/articles/3436091

While this article doesn't directly address your issue, it may provide you with information as to who to ask (or what type of entity to ask).

Good luck.

Jim Dempsey

Greg_S · ‎11-22-2019

Appreciate the different ideas you have provided Jim. Am now testing your something-microcode/RedHat suggestion. Another completely-reliable D1521 server has Mar-2019 microcode while the problematic server had Jun-2019. Could not find the Mar-2019 package for this OS version so removed the microcode update outright (easy enough to reinstall later). Will see if the default microcode makes a difference.

Once the problem occurs, it will survive a Linux restart and persist. I believe that restart/enter BIOS/(possibly change a setting)/exit BIOS clears it. Power-down and then power-up definitely clears it. Don't know if a regular Linux restart performs a hard processor reset or not. That might be the differentiator. My current guess is a BIOS/firmware bug, potentially related to uninitialized memory, making it believe power limits are exceeding and then using an undocumented P-state/multiplier clamp as its mitigation.

What surprised me is how little actionable platform troubleshooting data is provided by the CPU. Decent data to determine whether the CPU is operating within spec, but if not, pretty much trial and error to troubleshoot.

jimdempseyatthecove · ‎11-23-2019

>>Once the problem occurs, it will survive a Linux restart and persist. I believe that restart/enter BIOS/(possibly change a setting)/exit BIOS clears it. Power-down and then power-up definitely clears it.

That is strange. If I were paranoid, I might think that there could be an issue with the Intel Management Engine as opposed to the CPU. You might take a look at https://www.intel.com/content/www/us/en/support/products/34227/software/chipset-software/intel-management-engine.html as a place to start.

EDIT: Start here first: https://www.intel.com/content/www/us/en/support/articles/000008927/software/chipset-software.html

(Low-power, out-of-band (OOB) management services)

Jim Dempsey

Greg_S · ‎11-25-2019

Since removing the microcode update and power-cycling, the issue has not occurred in 85 hours. If still good tomorrow, will do a live install of the Mar-2019 microcode from the working server and let it bake further. Could be coincidence (especially if the real cause is uninitialized memory) but is interesting.

>> If I were paranoid, I might think that there could be an issue with the Intel Management Engine as opposed to the CPU.

This is absolutely possible, but incredibly difficult to diagnose/confirm/deny. Both the main BMC (AST2400 running AMI Megarac) and the Intel ME have the capability to cause this issue. They both have direct access to the required registers, persist beyond reboots, etc.

Megarac is a low-end low-quality management implementation, but don't think it is the cause. Its sensor-reading interface is super fragile. Have seen it get wrong MB temperature and get stuck until reboot (temp never changed). It has lost all the sensor readings (its current state-- assume they will come back at reboot). Have seen it think PROCHOT/THERMHOT,FIVR-FAULT/etc were all triggered (but they were not as confirmed by the CPU itself). If it had the capability to throttle the system, seems like one of the bogus sensor events would have done so (but they happened during non-events). Seems like Megarac monitors and logs but does not take direct action when it detects a fault. It is Linux 2.6 based (with root access) so is easy to poke around.

The Intel ME implementation in Xeon-D includes "Node Manager Base for adaptive power management". Unclear whether the power management is enabled on this motherboard (my reading suggests the BIOS and/or BMC must do it). I have been unable to access it via ipmitool. Found an Intel project called NMPRK that claimed to access it, but has compile errors and is quite dated (2015). I would disable the entire ME if possible, but Intel has worked hard to prevent that. That CVE page is a horror show made worse because it is impossible to know what actually changed and whether the newer firmware is more secure or simply replaces old exploits with new exploits.

jimdempseyatthecove · ‎11-26-2019

Greg,

I replied to this thread yesterday, but the website ate my reply (logged me out while replying).

You haven't disclosed the reason for wanting to power consumption from 45W to ~10/0.5 (active/inactive). I can only guess that is to maintain uptime in the event of running off an uninterrupted power supply. Have you considered using a lower performing processor such as

https://ark.intel.com/content/www/us/en/ark/products/87261/intel-pentium-processor-n3700-2m-cache-up-to-2-40-ghz.html

Jim Dempsey

Greg_S · ‎12-03-2019

Jim-- apologies for the slow reply as I took some time off for Thanksgiving.

>> You haven't disclosed the reason for wanting to power consumption from 45W to ~10/0.5 (active/inactive). I can only guess that is to maintain uptime in the event of running off an uninterrupted power supply.

I am not purposely forcing low power. The servers are all DC powered and runtime during power outage is a consideration, but only to the point of choosing a 45W TDP CPU (vs some monster 125W) and disabling Turbo-mode (to reduce peak power usage). If the CPU used full power while at 100% application load, that would be fine.

This is stock Ubuntu 18.04.3 LTS w/kernel 4.15.0-70. Likely due to a bug in the distro or one its components, it defaults to "power savings" mode (it should be "performance" for a headless server). In this mode, the OS uses the max-efficiency Mhz of the CPU as its base (800Mhz for the Xeon D1521) and dynamically increases individual core frequencies up to the max-non-Turbo limit (2.4Ghz for the Xeon D1521) when under load. Because of the light application load (not adding more to a server with issues), busy core frequency averages ~900Mhz (taking the small dynamic increases into account). During one experiment, I captured the ACPI _PSS table which includes estimated power usage for each frequency multiplier from 800-2400Mhz. The estimate for 800Mhz was around 11.5W which is close to what I measure via RAPL counters when running in this mode.

Right before the holiday break, I forced all cores into "performance" mode to see the difference. Frequency under load is now closer to 2000Mhz and power consumption is 15-16W. Though the busy frequency has roughly doubled, much more time is spent in low-power C-states (thus only 25% increase in power). So whether the OS is mistakenly set to "power savings" or correctly set to "performance", it seems to operate as expected within each mode.

During a throttling event, the OS requested frequency is clamped to 800Mhz (likely because this is "max efficiency" for the D1521) and some other mechanism (RAPL power clamping?) further reduces performance. However, it never saves any more power because 11-12W is pretty much max efficiency for this processor with any kind of load (as demonstrated by the OS "power savings" operation).

The throttling event has not occurred since the last power-off restart w/microcode-update removed (11 days ago). Am pretty convinced the cause is either BIOS/firmware, microcode and/or Intel ME (possibly combined with uninitialized memory). Next experiment is probably another cold-boot after power-off (w/o microcode) to see if the problem reoccurs. Is my 11 days of non-issue due to the missing microcode or is it uninitialized memory that randomly came up "good" or something else entirely?

Thanks, -Greg

jimdempseyatthecove · ‎12-04-2019

>>or something else entirely?

My lifetime (50+ years) of programming experience with intermittent bugs is if you do not know the cause (point to the error in coding), but some other change makes the symptom go away, that you cannot be assured that the bug was fixed. If the problem is/was due to uninitialized variables then the appearance of the symptom could be due to the value observed in the uninitialized variable. The system change (firmware) may still have the bug, however the uninitialized value is different and your system works by chance.

Given the behavior observed the following is pure guesswork

At some point the controlling system (CPU or ME) observes something that initiates a power saving mode. Power savings is typically done by manipulating a multiplier. Reducing power may involve decrementing a counter. The assumption is that on trigger, the controlling system has a go/nogo based on if a measured value indicates NOT already at minimum power... but this is for one core or HW thread or region of die whatever. Now then after that observation (NOT already at minimum power) the decision is made to decrement the power settings of all manageable power settings. However, either NOT all such settings were NOT at minimum power setting OR immediately after the go/nogo indicates go, something else decremented a power setting (or multiple settings). One of the possible actors is your code that is tweaking the power settings, another possibility is a firmware bug.

Jim Dempsey

Greg_S · ‎12-05-2019

>> One of the possible actors is your code that is tweaking the power settings, another possibility is a firmware bug.

Except that I am not. While I have done some select experiments to try and trigger the issue or to try and recover after the issue started (and those experiments involved manipulating registers/settings/etc), this problem happens without me changing anything. If I was tweaking power registers and such, would indeed assume that was the cause. This happens with stock OS, BIOS/Intel ME as shipped on the motherboard, etc. The only semi-unusual thing under my control is running AVX2 heavy applications.

>> At some point the controlling system (CPU or ME) observes something that initiates a power saving mode.

Agreed. I believe this is due to an outright bug as I cannot detect any legit reason to trigger an event.

>> Power savings is typically done by manipulating a multiplier.

It clamps the multiplier to 8 as one of its mitigations. This makes sense as 800 Mhz is the CPUs self-reported "max efficiency" frequency.

>> Now then after that observation (NOT already at minimum power) the decision is made to decrement the power settings of all manageable power settings.

Agreed. Other performance clamping mechanisms are clearly invoked. The CPU reports that RAPL power-clamping is occurring.

>> However, either NOT all such settings were NOT at minimum power setting OR immediately after the go/nogo indicates go, something else decremented a power setting (or multiple settings).

Am not convinced of this. I believe the explanation is that the original event triggering is an outright bug. Once triggered, per above, it engages all the power mitigation mechanisms it can. Then comes the failure to recover (otherwise, the event might be unnoticed or just appear as the occasional performance glitch). Two obvious explanations: the triggering bug condition never clears (counter overflow that never recovers) or it calculates a recovery power point based on some expected power reduction. For example, if the logic waited for power usage to reduce 25% below the trigger point (12W), it would be hard pressed to ever recover as the throttling would not reduce power to 9W due to minimum power consumption.

Did a power-off reboot this morning (no microcode update) and will see what happens. Since it takes days to occur, the sample size for these experiments will always be limited and some guessing is required.

Thanks, -Greg

jimdempseyatthecove · ‎12-05-2019

I wish I had some chicken bones to toss and read....

A potential (from prior experience on unrelated issue) cause could be one of the controlling bit fields in a power management settings had changed from reserved to defined and that there is two places in the code (firmware, OS, ME) that needed to be changed (throttle down and throttle up), but that only one place was addressed. IOW throttle down is working, throttle up is not. The symptoms sounds like the throttle-up code is reading a control register, ANDing a bit mask, shifting left (doubling), and inserting back into bit field. The problem being a newer architecture expanded the bit field in the lsb direction e.g. added a 1/2 increment so to say. If the throttle-down code used this knowledge in the bit mask manipulation, but the throttle-up did not, then this would cause the symptom observed.

Is the CPU a (much) newer generation than the motherboard. Or if this is the case for issue in O/S, how much newer is the CPU verses the O/S?

Jim Dempsey

Greg_S · ‎12-05-2019

>> Is the CPU a (much) newer generation than the motherboard. Or if this is the case for issue in O/S, how much newer is the CPU verses the O/S?

The Xeon D series are system-on-chip offerings in FCBGA (soldered to motherboard) packages. Not possible to mismatch the CPU and motherboard. Moreover, there is no external support chipset (platform control hub) as it is integrated into the CPU package. These are server class solutions intended for the microsever/edge-server market. The D1521 was released 2015Q4 so a bit dated but still in active production (with the first two Xeon D15xx chips being released nine months prior).

From what I can tell, the same motherboard is also available with the D1541. I have previously reviewed the bug (errata) documentation, but don't recall any issues specific to one or the other. With a plug-in CPU and discrete PCH, there are lots of combinations. With Xeon D15xx, it is extremely difficult to mismatch anything.

>> I wish I had some chicken bones to toss and read....

Don't forget the entrails. ;-) There is an MSR present on many CPUs (but missing from the D15xx series) that (according to the SDM) provides much more specific throttling causes. No idea if it would have helped in this case. It is troubling that divination via bones/entrails could possibly provide a more accurate diagnosis than the platform monitoring registers provided by the Xeon D series. An opportunity for future improvement should any of our Intel friends be following this thread.

Greg_S · ‎12-10-2019

Latest update-- the microcode update is almost certainly part of the issue. With the microcode removed and two different power-off reboots and 12 days of uptime, no issues. After a hot-reload of the latest microcode (no reboot required), the problem resurfaced within 48 hours.

Killing all processes on the server will cause the power consumption to drop from 11.25-11.50W (when triggered in this case) to 11.10-11.20W. That is enough to cause the power clamp to disengage. However, running the most basic process will increase power consumption by a couple hundred milliwatts and the power clamp reengages. It seems fairly certain that once the false power limit overload is "detected", the power clamp hardware is reprogrammed to keep the power below whatever the power consumption was at the trigger point in time. When it occurs at very low power utilization (11-12W in my case), it renders the server effectively useless.

If buggy microcode is the primary cause, then why is this not more widely reported? I suspect it is a combination of issues-- possibly microcode plus firmware. How the microcode operates is another closely guarded secret (with customers being punished when it goes wrong). Seems likely that as part of AVX2 opcode decoding, the microcode flags specific AVX2 instructions as "power intensive". That would be an input to other power management sub-systems to compensate for the additional AVX2 power draw (as is documented). Guessing 24x7 AVX2 applications are probably unusual (or maybe unusual for D15xx CPUs), reducing the potential impacted population. If this issue triggered on a high-load server, am not certain it would be easily noticed since power utilization is already high.

Tomorrow I will install the older microcode used on my other D15xx server and see that also works-around the issue.