I was wondering if there is any way to guarantee that the E5 v3 range will not perform frequency transitions.
I am developing a real-time audio DSP application using an RTOS on dual E5-2687Wv3 and have run into problems relating to the uncore PCU event FREQ_TRANS_CYCLES. A certain amount of processing must happen every ~20uS and most of the time execution time is stable to within 1uS. However, occasionally execution time spikes considerably. Whenever we get a spike, we also get a burst of a few 10,000 of the PCUBox event FREQ_TRANS_CYCLES. When using Sandy bridge or Ivy bridge chips (E5-1650, E5-1650v2, E5-2697v2) we do not get these spikes in execution time.
We have the chip running in a fashion where all cores are busy all the time and uncore performance counters confirm that all the cores stay in C0.
EIST and Turbo Mode are disabled in the BIOS, and we have tried various poking of power management MSRs to try and prevent the FREQ_TRANS_CYCLES. I can go into detail of what we have tried if you like.
I thought I would ask in the general sense whether these transitions are something that we can actually avoid, or will the chip essentially do its own thing? Are there any guarantees as to what operating frequency the chip can run at if EIST and Turbo Mode are disabled? If it has to drop for some reasons (I have read about the AVX base frequency) is there a way to fix it at a highest guaranteed frequency? Enabling EIST and writing to MSR 0x199, I can fix the frequency of all the cores at 1.4GHz and avoid the transitions - but 1.4GHz seems rather low! (nominal freq is 3.1GHz for this chip)
Any advice on this appreciated and let me know if you would like more details or information,
Three possibilities come to mind:
(1) As discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring..., the Haswell Xeon E5 v3 cores experience a ~25,000 cycle stall when the processor shifts into "256-bit mode". The processor will shift out of "256-bit mode" if no 256-bit registers have been used in the last millisecond. I have not looked to see if there is also a stall introduced in that transition.
You can avoid these transitions by avoiding use of 256-bit registers. (You can still use the AVX2 instruction set, but limited to the 128-bit versions of the instructions.) It does not appear that Intel has documented any way to keep the processor in "256-bit mode" all the time.
(2) Haswell has a new MSR (MSR_TURBO_ACTIVATION_RATIO) that can change the Turbo behavior and I have seen the BIOS program this incorrectly on a number of systems. This is discussed in the thread at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring... This MSR basically tells the hardware to ignore the specific frequency request above this frequency and simply let the hardware run at the fastest frequency possible. This is not a big change on my systems, since Linux "acpi-cpufreq" only allows requesting explicit frequencies up to the nominal value. For higher frequencies the kernel requests the max Turbo frequency available for single-core operation. The problem that I ran into was the BIOS programming 0x18 (24d) into this register on a 2.5 GHz system. When I requested 2.5 GHz operation, the hardware noted that the requested multiplier (0x19=25d) was bigger than the value in MSR_TURBO_ACTIVATION_RATIO, so it allowed the frequency to jump to the maximum value.
(3) If you run into package power or temperature limits the Power Control Unit in the uncore will initiate frequency changes as often as once per millisecond. So far we have only been able to run into power limits by running highly optimized DGEMM or LINPACK codes using more than 1/2 of the cores on the chip, and we have only run into temperature limits on parts with defective thermal solutions. The Xeon E5 v3 provides an MSR (MSR_CORE_PERF_LIMIT_REASONS, described in Chapter 35 of Volume 3 of the SW Developer's Manual) that will tell whether any of a variety of throttling events have occurred since the last reset and whether the package is currently experiencing any of those throttling events.
Many thanks for the suggestions,
I am in the process of doing some tests but will be off over the weekend so will come back next week with some results. Broadly so far:
1) This is promising and compiling with /arch:SSE4.1 does have an effect, but I need to do further tests as our code relies on _m256 intrinsics a lot and so the nature of the code changes greatly if these bits of code are removed and it is compiled with /arch:SSE4.1.
A question: Do you know whether this type of stall triggers FREQ_TRANS_CYCLES events? If so, can the FREQ_TRANS_CYCLE event be considered more of a catch-all for these types of stalls and not strictly for frequency transitions?
2) Also kind of promising: reading MSR 0x64C (MSR_TURBO_ACTIVATION_RATIO) came back as 0ULL. I tried writing 0x1f to the MAX_NON_TURBO_RATIO bits and it stuck, however this did not seem to have any effect as the spikes & FREQ_TRANS_CYCLES events still happen.
3) MSR_CORE_PERF_LIMIT_REASONS log bits all came back as zero.
The 256-bit transitions seem like the most likely candidate for the transitions you are seeing. I have not tried any of the uncore PCU counters, so I don't know if this is an event that would be picked up by the FREQ_TRANS_CYCLES event (or if there are any other PCU counters that might detect this). Neither the "256-bit emulation" nor the transition to 256-bit mode displayed any clear signals in the core performance counters -- except for the ~25,000 cycles that the core was halted.
BTW, I also noted that the 10-microsecond stall for transitioning into 256-bit mode disappeared at the lowest available frequency on my Xeon E5-2660 v3 systems. I did not check carefully to see if it was still there at slightly higher frequencies, but it was definitely present everywhere from the "minimum AVX frequency" to the maximum Turbo frequency.
I don't think that the Intel compilers have this option, but the GNU compilers have an option to generate 128-bit AVX vectors. This will (of course) reduce the peak theoretical performance compared to the CORE-AVX2 target, but should provide significant advantages over the SSE4.1 target (such as 3-operand instructions, FMA instructions, etc). Intrinsics are probably going to be a problem....
I have not looked at the MSR_TURBO_ACTIVATION_RATIO register across a lot of systems (or operating systems), so I don't know whether it is handled differently. I use it in conjunction with changes to MSR_TURBO_RATIO_LIMIT, MSR_TURBO_RATIO_LIMIT1, and MSR_TURBO_RATIO_LIMIT2 to obtain stable frequencies between the nominal frequency and the max Turbo frequency for varying active core counts. This is discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring...
I am not surprised that the MSR_CORE_PERF_LIMIT_REASONS log bits came out zero, but it is important to check all the bases.