Software Archive
Read-only legacy content
17061 Discussions

Diagnosing Severe Throttling on Xeon-D

Greg_S
Beginner
3,580 Views

One of my Xeon D1521 based servers (Linux 4.15 w/Ubuntu 18.04) is experiencing bizarre performance throttling and am struggling to diagnose the cause. The four cores (hyperthreading disabled) normally run at 800Mhz in power saving mode with a modest application load. After 2-3 days uptime, core frequencies suddenly plummet under 250Mhz. On two occasions, performance mysteriously recovered after 2-3 days of degraded performance. I captured the following data during the most recent failure/recovery:

MSR_IA32_THERM_STATUS (0x19c): Bit 10 (Power Limitation Status) goes HIGH in all four cores for the event duration.

MSR_IA32_PACKAGE_THERM_STATUS (0x1b1): Bit 10 (Power Limitation Status) goes HIGH for the event duration. 

MSR_CORE_PERF_LIMIT_REASONS (0x690): Bit 2 (Power Budget Management), Bit 13 (Core Frequency P1 Status) and Bit 15 (Core Frequency Limiting Status) all go HIGH for the event duration.

MSR_PKG_PERF_STATUS (0x613) and MSR_DRAM_PERF_STATUS (0x61b) both count up RAPL throttling during the event.

MSR_TURBO_ACTIVATION_RATIO (0x64c) is updated every second or so with values from 7-9 (vs 24 before/after an event).

From my reading of the SDM, this indicates the RAPL power (but not thermal) limits have been exceeded and the cores are being forcefully throttled. However, MSR_PKG_ENERGY_STATUS/MSR_DRAM_ENERGY_STATUS report power usage of 11.0-11.5W / 0.4-0.6W respectively. Package power consumption actually increasing to 12-14W range during an event (likely due to the cores being overloaded). Core/package temperatures are always under 40C before/during/after an event (so hard to believe the cause is thermal). Have increased the power limits in MSR_PKG_POWER_LIMIT to the max but no change.

The application load includes 12 processes performing significant AVX2 256-bit calculations (real-time image reprocessing from IP cameras). While this increases power / reduces max frequency, could that be relevant when already running in power savings mode @800Mhz on a 2.4Mhz (non-turbo) rated part? The motherboard includes a BMC but it reports no events. Any thoughts welcome as running out of ideas of what else to monitor to track down the cause. 

Thanks.

0 Kudos
21 Replies
jimdempseyatthecove
Honored Contributor III
286 Views

Greg,

You certainly have a demonstrative issue with the power management. Have you contacted Intel Support on this issue?

Note, I am aware that your Intel first level contact will likely be Software Support... and may claim this is not a software issue. You will likely need to prod him/her to move this up and over to the correct channel for reporting this issue for resolution. I am sure Intel will address this issue... after this reaches the right hands.

You did an excellent job at diagnosing your problem.

Jim Dempsey

0 Kudos
Reply