- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One of my Xeon D1521 based servers (Linux 4.15 w/Ubuntu 18.04) is experiencing bizarre performance throttling and am struggling to diagnose the cause. The four cores (hyperthreading disabled) normally run at 800Mhz in power saving mode with a modest application load. After 2-3 days uptime, core frequencies suddenly plummet under 250Mhz. On two occasions, performance mysteriously recovered after 2-3 days of degraded performance. I captured the following data during the most recent failure/recovery:
MSR_IA32_THERM_STATUS (0x19c): Bit 10 (Power Limitation Status) goes HIGH in all four cores for the event duration.
MSR_IA32_PACKAGE_THERM_STATUS (0x1b1): Bit 10 (Power Limitation Status) goes HIGH for the event duration.
MSR_CORE_PERF_LIMIT_REASONS (0x690): Bit 2 (Power Budget Management), Bit 13 (Core Frequency P1 Status) and Bit 15 (Core Frequency Limiting Status) all go HIGH for the event duration.
MSR_PKG_PERF_STATUS (0x613) and MSR_DRAM_PERF_STATUS (0x61b) both count up RAPL throttling during the event.
MSR_TURBO_ACTIVATION_RATIO (0x64c) is updated every second or so with values from 7-9 (vs 24 before/after an event).
From my reading of the SDM, this indicates the RAPL power (but not thermal) limits have been exceeded and the cores are being forcefully throttled. However, MSR_PKG_ENERGY_STATUS/MSR_DRAM_ENERGY_STATUS report power usage of 11.0-11.5W / 0.4-0.6W respectively. Package power consumption actually increasing to 12-14W range during an event (likely due to the cores being overloaded). Core/package temperatures are always under 40C before/during/after an event (so hard to believe the cause is thermal). Have increased the power limits in MSR_PKG_POWER_LIMIT to the max but no change.
The application load includes 12 processes performing significant AVX2 256-bit calculations (real-time image reprocessing from IP cameras). While this increases power / reduces max frequency, could that be relevant when already running in power savings mode @800Mhz on a 2.4Mhz (non-turbo) rated part? The motherboard includes a BMC but it reports no events. Any thoughts welcome as running out of ideas of what else to monitor to track down the cause.
Thanks.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Greg,
You certainly have a demonstrative issue with the power management. Have you contacted Intel Support on this issue?
Note, I am aware that your Intel first level contact will likely be Software Support... and may claim this is not a software issue. You will likely need to prod him/her to move this up and over to the correct channel for reporting this issue for resolution. I am sure Intel will address this issue... after this reaches the right hands.
You did an excellent job at diagnosing your problem.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »