Thanks Dr. McCalpin, for

Travis_D_ · ‎09-02-2018

Is there any performance monitoring event that would let me capture, e.g., with `perf record`, where code-related frequency transitions occur?

For example, when some code drops into the L1 or L2 frequency license due to some AVX/AVX2/AVX-512 instruction.

McCalpinJohn · ‎09-03-2018

Transitions related to SIMD instruction set width appear to be (at least somewhat) asynchronous relative to the code that triggers them.

On a Xeon Platinum 8160, I did some experiments to try to understand the transitions, but I have not written up the results. (I was mostly trying to work out a protocol to ensure that the transitions occurred *before* the code that I wanted to test. This was not as easy as I expected.)

Starting from an "idle" core (i.e., after a "sleep(1);" call in C, I started executing a short AVX-512 code loop (a few thousand register-contained FMAs), with frequent reads of the TSC, the 3 fixed-function performance counters, and the new CORE_POWER.LVL*_TURBO_LICENSE core performance counter event. I don't recall the specific delays and durations, but the sequence of events was:

Start out running AVX-512 code at 1.0 GHz at very low IPC
- This phase was very short
- IPC was consistent with running the AVX-512 code through the bottom 128-bits of the SIMD units behind Port 0 and Port 1.
Halt for a short period (microseconds)
Continue running AVX-512 code at 1.0 GHz at a higher (but still low) IPC
- This phase was also very short, but slightly longer than the first phase
- IPC was consistent with running the AVX-512 code through the full 256-bits of the SIMD units behind Port 0 and Port 1.
Halt for a short period (microseconds)
Continue running AVX-512 code at 1.0 GHz at the expected IPC
- This phase was longer than the previous two phases -- maybe 10's of microseconds?
Halt for a short period (about 6 microseconds, if I recall correctly)
Continue running AVX-512 code at 3.5 GHz at the expected IPC
- 3.5 GHz is the maximum single-core Turbo frequency for this processor model.

Interpreting the results is tricky because the processor it is not possible to read the performance counters atomically. The processor halt for the SIMD transitions and frequency transitions *usually* happen in the middle of the AVX-512 code, but when they happen in the middle of the performance counter reads, the results are very confusing. I have not figured out a way to automate identification these confusing cases.

Miscellaneous Observations:

I have never found any performance counters that identify cycles during which the processor core is "emulating" wider SIMD instruction sets before the corresponding pipelines are powered up.
- This emulation appears to be occurring below the level of uops, and is transparent to the counters at the level of uop issue.
Halted cycles show up during SIMD transitions (except for Sandy Bridge, where the transition from emulation mode to full speed requires no halt).
- On Haswell, SIMD transitions were limited to less than ~1000 per second, so the 10 microsecond halt that I measured corresponds to less than 1% worst-case throughput reduction. I have not repeated this testing on SKX.
I have not figured out any way to prevent my SKX (Xeon Platinum 8160) processors from dropping to 1.0 GHz when idle.
- The transition from 1.0 GHz to maximum frequency has a latency that depends on lots of BIOS and MSR settings.
- The latency between the initial use of SIMD registers and the subsequent processor halt is variable.
- The duration of the processor halt during the final frequency transition seems shorter than on some other recent processors.
Be careful of the compiler inserting code that you don't control!
- For example, gcc will replace a simple assignment loop with a call to "memset()", even at very low optimization levels. If you are trying to measure scalar code, the use of wider SIMD registers in "memset()" will queue up a SIMD transition that is quite likely to halt the processor (and change the frequency) during the loop you are trying to measure.
- I did not have this problem with icc (though it can happen at higher optimization levels)
- My workaround was to include an assignment that the compiler did not recognize as a zeroing idiom:
  - for (j=0; j<SIZE; j++) array = j>>30;
  - SIZE is always small for these L1-contained tests.
  - Someday some clever compiler writer will notice this case and "fix" it.

McCalpinJohn · ‎09-03-2018

I forgot to add that I have not found any events in the Power Control Unit of the Uncore that look like they will provide what you are looking for.

The PCU section of the Uncore Performance Monitoring guides has been trimmed down a lot in SKX, and does not appear to be internally consistent. For example, the performance counter events implicitly referred to in Table 2-274 and 2-276 of the "Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual" (document 336-274-001, July 2017), are not even listed in the rest of the section. These events appear unchanged from the earlier Xeon E5 processors, so they might work, but I have not yet tried to combine the information in the corresponding sections of the Xeon E5 v1/v2/v3/v4 manuals to see if there is a consistent interpretation....

Travis_D_ · ‎09-03-2018

Thanks Dr. McCalpin, for sharing your testing result.

If I interpret what you wrote correctly, the test starts of at 1 GHz because the processor was idle before the start of the test so has dropped to a lower p-state? So your test is essentially testing both the p-state ramp up to full frequency at the same time as the various AVX-512 related transitions.

I am surprised you were not able to find a way to prevent the CPU from dropping down to the lowest p-states. Assuming Linux, did you try poll=idle as a kernel boot parameter, and also the "user" acpi_cpufreq governor setting? That is, assuming you are using acpi_cpufreq CPU driver, which I think is a given since intel_pstate is, as far as I know, not yet released for Skylake server.

When you say 3.5 GHz max

About my original question, I did have some luck with the event CORE_POWER.THROTTLE, which is described as:

Core cycles the out-of-order engine was throttled due to a pending power level request.

I am not exactly sure how to interpret it: I get a value of about 20,000 for a process that I expect does one transition, but for processes that do more I get only 100,000 despite expecting more than 100 times more transitions. I admit my testsing was very brief however. Using perf report I was able to get it to point almost exactly to the AVX-512 instructions that would have caused a transition. However, if this event actually counts the cycles that the core is halted due to transitions (the types of transitions you mention above in your testing results), it isn't exactly clear how the events will be delivered, e.g., all in a burst when the core comes unhalted?

The text does not make it clear what it is measuring. A careful reading would seem to indicate some type of throttling short of a "full stop", since it mentions "the out of order engine" being "throttled" rather than say the "core" being "halted". However, I cannot be sure it was written as carefully as I am trying to read it.

If you do further experiments, you might want to check out this counter. If I find anything I'll report it here.

I did do some tests on "steady state" behavior of AVX and AVX-512 instructions on a Skylake server CPU. By "steady state" I mean running the same sequence enough that all the transitions mentioned above were complete. I'll mention what I found in case it is interesting.

There are three frequency levels a chip may run at, with different names. Sometimes Intel calls them "turbo" "AVX2 turbo" and "AVX-512 turbo" but I think these names are misleading. I'll prefer to call them the L0, L1 and L2 "license" since that's what Intel calls them in technical descriptions for performance counter events. The L0 license is the fastest, and the single-core L0 license is the number you see "on the box" for turbo speed. L1 and L2 are almost without exception less than L0 and L2 almost without exception lower than L1. The magnitude of the gap various by processor model, but the general trend is the more active cores, the larger the relative gap between L0 and the other levels. Some chips run at less than half speed in L2 versus L0 for the same active core count!

Despite L1 being called "AVX2 turbo" I found that you can execute most AVX2 instructions indefinitely without leaving the L0 license. The only exceptions were the so-called "heavy" instructions: which are those that use the FP unit, such as floating point add, multiply, FMA, etc. This also includes integer multiplication since those are executed on the FP unit! All other instructions, including all shuffles, integer instructions other than multiplies, loads and stores, don't reduce the CPU speed at all.

If you do execute the heavy instructions, the CPU generally still stays in L0, unless you execute them "densely", i.e., at a high IPC. For example, I found that a chain of FP FMAs, latency 4, so one FMA every 4 cycles, stayed in L0 no matter how long it was. Same for two parallel chains (one FMA every 2 cycles) also stayed in L0. At one FMA every cycle (half the theoretical maximum), however, the processor transitions to L1 (although I didn't probe the transition behavior: e.g., how many instructions you have to execute at an IPC of 1 before the transition occurs).

So except for numerical codes that make optimized use of the FP unit, you'll stay in L0 with AVX/AVX2 instructions.

The situation for AVX-512 is analogous, except replace all occurrences of L0 with L1, and L1 with L2. That is, any AVX-512 instruction causes a drop to L1 license (although I didn't investigation the detailed transition behavior), while only "dense" FP instructions as described above cause a drop to L2 frequency.

Note that this is true even on a machine with 1 FMA unit. On those machines, you are probably better off writing your FP code with 256-bit instructions (including 256-bit variants of AVX-512 instructions), since the throughput is the same (2x256 FP units or 1x512-bit unit) but you run at the L1 frequency rather than the L2 frequency.

McCalpinJohn · ‎09-05-2018

I just wrote a substantial response that disappeared when the Intel website decided to log me out. I don't have time to reproduce it.

I think my processors are starting at 1.0 GHz because either (1) the core is in C1E state (due to an MWAIT hint), or (2) the package is in C1E state (because all the cores are in C1). It is hard to tell what is going on because Intel has not released Volume 2 of the SKX datasheet (with the register descriptions), and much of the pre-release documentation that I have is clearly wrong. If I have some time to get back to this, I will try to disable Core C1E (using /dev/cpu_dma_latency) and keep a "spinner" process on another core to prevent Package C1E.

Your observations on the power licenses are interesting, and provide the opportunity for another set of transitions (e.g., with all AVX-512 units turned on at low power, but then starting to run a high density of high-power instructions). The uncertainty about the computation of power license levels as a function of instruction "density" is irritating -- more undocumented features, more uncertainty about what is supposed to be happening, and more difficulty in generating controlled experiments.

Travis_D_ · ‎09-06-2018

McCalpin, John wrote:

I just wrote a substantial response that disappeared when the Intel website decided to log me out. I don't have time to reproduce it.

Bummer. The Intel forum implementation is ... well ... let's say leaves something to be desired (and this is largely a solved problem). I have gotten in the habit of ctrl+A, ctrl+C before I hit submit.

I think my processors are starting at 1.0 GHz because either (1) the core is in C1E state (due to an MWAIT hint), or (2) the package is in C1E state (because all the cores are in C1). It is hard to tell what is going on because Intel has not released Volume 2 of the SKX datasheet (with the register descriptions), and much of the pre-release documentation that I have is clearly wrong. If I have some time to get back to this, I will try to disable Core C1E (using /dev/cpu_dma_latency) and keep a "spinner" process on another core to prevent Package C1E.

To be clear, you start your test (and the reporting of the results above), "from scratch" without any warmup period, right? So yes I would expect them to start at a low frequency. I'm not totally familiar with the C1E state as opposed to p-states: this is some state the entire socket can enter when all cores are in the C1 state (or higher)? It takes a while to ramp up when exiting C1 if the socket was in C1E?

Your observations on the power licenses are interesting, and provide the opportunity for another set of transitions (e.g., with all AVX-512 units turned on at low power, but then starting to run a high density of high-power instructions). The uncertainty about the computation of power license levels as a function of instruction "density" is irritating -- more undocumented features, more uncertainty about what is supposed to be happening, and more difficulty in generating controlled experiments.

I think there are at least two types of transitions: plain frequency transitions which happen all the time, e.g., due to the active core count changing or the CPU entering a higher license, and transitions which involve actually powering up and enabling additional circuitry, such as the high 128-bit lanes for AVX/AVX2 or the high 256-bit lanes for AVX-512. It is possible that the "halt" portion of the latter type of transition takes longer, although here is some disscussion (which seems to have been written by someone you might know) that indicates a value of 10 us which in my experience is fairly similar to the plain transitions which I have measured at around 8 us. That said, the effective total transition perdiod is larger than the pure halted portion since as you mention above, as as Agner found for earlier chips, there is a period where the chip is running but executing instructions at a reduced frequency.

McCalpinJohn · ‎09-07-2018

These tests are starting from idle on purpose, specifically so I could monitor the initial transitions. The code is single-threaded, there is nothing else running on the node, and I execute a "sleep(1);" in the C code immediately before the loop.

"Core C1E" state is a variant of the "Core C1" state. In "Core C1", the core clocks are halted and there is no p-state transition. In "Core C1E", the core clocks are halted and then there is a p-state transition to the "maximum efficiency frequency" (1.0 GHz on my processors, but 1.2 GHz on most earlier Xeons). "Core C1E" is entered explicitly by the kernel using the MWAIT instruction with a "hint" of 0x1 in the EAX register.

"Package C1E" state is a hardware-controlled state that causes a p-state transition to the "maximum efficiency frequency" if all cores are in C1 or higher-numbered Core C-states.

Documentation on these states is included in Volume of the datasheets for Xeon E5 v1 and Xeon E5 v2, but this documentation has been dropped from the datasheets in newer processors.

The discussion at the link above contains a lot more detail than I was able to recall from memory. All of the transitions are similar in that they involve halting the core for a period of time. Sometimes this involves a frequency transition, sometimes it involves a voltage transition, but it is possible that a significant increase in load (e.g., from turning on extra pipelines) can require a halt even without a frequency or voltage change. All of these are tricky to measure, and processors have increasing autonomy in controlling their operations -- starting with Haswell, for example, there is no longer a 1:1 relationship between p-states and voltage.

Travis_D_ · ‎09-09-2018

Thanks for the details, Dr. McCalpin.

Is the code you used to test this available anywhere or is it proprietary?

Performance monitoring event for frequency transitions