There doesn't appear to be much documentation or information regarding the power management features on the Haswell processors and what is deemed "normal". Apologies for this post, but the only detailed answers I have read was from these forums.
1:) I have seen issues on several servers running the E5 v3 haswell processors. Namely CPU frequency throttling, which is apparent from "turbostat" outputs that show the frequency operations at under 800MHz and the Bit2 register active under the MSR "MSR_CORE_PERF_LIMIT_REASONS"
This would occur after a PSU flapping down and back up again. If the PSU redundancy was restored the processor would still be stuck in this state until reboot.
Upgrading the Dell BIOS 1.5.4 > 2.1.7 alleviated this issue. Is this a known bug?
2.)I am still seeing further strange CPU related issues, which may be attributed to a similar clocking issue, but I have not caught it in flight.
I often see servers with the following BIT's always set to active:
cpu0: MSR_CORE_PERF_LIMIT_REASONS, 0xc500c100 (Active: bit15, bit14, Amps, ) (Logged: bit31, bit30, PkgPwrL1, Amps, )
cpu1: MSR_CORE_PERF_LIMIT_REASONS, 0xc500c100 (Active: bit15, bit14, Amps, ) (Logged: bit31, bit30, PkgPwrL1, Amps, )
Bit 14 refers to the Max n-core turbo frequency and 15, the core frequency limiting status. Is this deemed a normal state? From memory the servers did not have any Active bits set at all on the older 1.5.4 BIOS that I reviewed. (post reboot)
If this is not normal, is there documentation to find out why these are being set or a way of disabling these power saver settings completely?
I have only looked at this MSR very briefly. We had a system with a bad cooling system that allowed a good exploration of the IA32_THERM_STATUS MSR, but I did not learn about the MSR_CORE_PERF_LIMIT_REASONS until after we returned that bad node....
From a single test that I ran with an OpenMP version of the STREAM benchmark (using all cores in a 2-socket system), I saw a total of 8 bits that were set at least some of the time.
I don't understand the difference between the "Multi-core Turbo Status" and the "Core Max n-core Turbo Frequency Limiting Status" -- in particular why I never saw the former set, but often saw the latter set.
I have noticed a degree of inconsistency in the definition of related flags across Intel designs. For example, on our Xeon E5 v3 systems, the RAPL MSR_PKG_PERF_STATUS MSR increments whenever the actual frequency is below the requested frequency. Since the requested frequency is the max single-core, non-AVX Turbo frequency, and we almost always run with many cores operational, this MSR increments almost all the time. On the other hand for the Xeon Phi 7250 processors (Knights Landing), this flag only increments when the actual frequency drops below the nominal (1.5 GHz) frequency due to power limitations (even though the requested frequency is 1.6 GHz on this system).
Since there are two sets of "n-core Turbo" frequencies for the Xeon E5 v3 ("normal" and "256-bit AVX"), I have to wonder how each of the frequency limit flags defined in the MSR_CORE_PERF_LIMIT_REASONS register relates to this new feature. But I have not wondered enough to try to figure out the answers....
Lots more information may be available via the performance counters of the Power Control Unit in the Xeon E5 v3 uncore, but those counters are also challenging to understand.....