I'm trying to determine if (what appears to be) unexpected (below base frequency) throttling on my new system is being caused by AVX usage when I run various stress programs like Prime-95 and the Intel Processor Diagnostic Tools Floating Point and Prime number tests.
I came across an Intel document at http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-up... that indicates reduced processor speeds may be encountered when AVX instructions are run on an E5-1650 v3.
I have an E5-1650 v4, but can not immediately find similar specifications for it (as to if and how much throttling would be expected for AVX usage).
The symptom I'm seeing on my new system is that some stress tests (like Intel XTU Stress CPU, and the Intel Processor Diagnostic tool CPU Load test) will run up to all available cores at 38x.
But, running one or more threads of Prime-95 (which normally has 'AVX ' and 'AVX2' listed as enabled) will cause the cores running these threads to down-clock to 35X. Cores not running Prime-95 still report 38X (under the Windows 7 High performance profile). Temps. all appear to be fine.
Similarly, I see slower 35x multiplier use when running XTU benchmark, and the Intel Processor diagnostic tools Floating point and Prime number tests.
I'm trying to figure out if what I see is expected for this CPU. Any help would be appreciated.
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Unfortunately, Intel only published the full set of frequency values for the Xeon E5 v3 family...
Going back to the Xeon E6-1650 v3, there are three tables in the specification update (linked in the original note).
- Table 1 gives an overview of the processor: nominal frequency, DDR frequency, number of cores, TDP, etc.
- Xeon E5-1650 v3 says 3.5GHz, 6 cores, 140 Watts
- Table 2 shows the maximum Turbo frequency as a function of the number of cores in use
- Xeon E5-1650 v3 says max Turbo is 3.8 GHz using 1-2 cores and 3.6 GHz using 3-6 cores
- Table 3 shows the "AVX Core Frequency", and the max "AVX Turbo" frequency as a function of the number of cores in use.
- Xeon E5-1650 v3 says "AVX Core frequency" is 3.2 GHz, while "AVX Turbo" frequency is 3.7 GHz using 1-2 cores and 3.5 GHz using 3-6 cores.
This information is all rather confusing, so I will try to clarify what it means, and then guess how it might apply to the Xeon E5-1650 v4.
- First, the label "AVX" is a poor choice. The difference between Tables 2 and 3 is the use of 256-bit registers, not the use of AVX/AVX2 instructions. The processor will run at the speeds shown in Table 2 (the "non-AVX" case) if they use scalar AVX/AVX2 instructions and/or 128-bit SIMD AVX/AVX2 instructions.
- Second, the "max Turbo" frequencies shown in Tables 2 and 3 are the maximum frequencies that will be allowed for the various core counts assuming that no other limit has been reached. Power and Temperature are the most important limiters.
- Third, the "AVX Core frequency" is effectively a *lower bound* on frequency due to package power limitations.
- In some places Intel has stated that with 256-bit register use, *all* operation above the "AVX Core frequency" is "opportunistic" Turbo behavior.
- The "AVX Core frequency" is set so that running an extremely computationally intense job using 256-bit registers will hit the power limit at or slightly above this frequency. I have tested this on a fairly large number of Xeon E5 v3 processors running the Intel HPL benchmark on all cores, and all of them hit the power limit at somewhere between the "AVX Core Frequency" and 0.1 GHz above the "AVX Core Frequency".
- I have not tested this on our Xeon E5 v4 processors, but the principle is likely the same --- running 256-bit registers requires that the upper 128-bit pipeline be enabled, and that should allow you to run out of power before you hit the maximum Turbo frequency.
- Fourth, the same mechanism that throttles core frequency when hitting the power limit also throttles core frequency when hitting temperature limits.
- I had one Xeon E5 v3 node with a cooling problem, and found that the processors were throttled by hitting the temperature limit long before they were throttled due to power limits.
- Since a significant portion of the power consumption under heavy load is due to leakage current (which increases very rapidly with temperature), a more aggressive cooling solution that keeps the die temperature lower should also help the processor attain higher Turbo frequencies before hitting the power limit.
- Depending on how much of the power consumption is leakage current and how much is dynamic power consumption, an extremely aggressive cooling system might enable running at the "Max AVX Turbo" frequency before hitting the power limit.
The Xeon E5-1650 v4 is a 6-core processor with a nominal frequency of 3.6 GHz and a maximum Turbo frequency of 4.0 GHz. I don't see the table of maximum Turbo boost as a function of the number of cores in use in the datasheet or specification update, but the information is available from MSR 0x1AD MSR_TURBO_RATIO_LIMITS. I will assume that the values at https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors are correct, giving a maximum Turbo frequency of 4.0 GHz for 1-2 cores and 3.8 GHz for 3-6 cores. Assuming that the Xeon E5 v4 is governed by similar physics, I would expect it to have a power-limited frequency of about 3.3 or 3.4 GHz. With a good cooling system that would correspond to a sustained (power-limited) frequency in the range of 3.5-3.5 GHz when running a compute-intensive workload using 256-bit registers on all the cores.
Unfortunately I don't know how to obtain the "AVX Core Frequency" or the "Max AVX Turbo Frequency" as a function of core count by reading MSRs on any Intel systems, so if Intel does not publish these numbers, then we are limited to guessing and measuring.
Curiously, the same issue applies to the Xeon Phi x200 (Knights Landing) processors, but the formula for determining the maximum frequency when using 512-bit registers was published as a footnote in the Xeon Phi x200 product brief. Maybe the values for Xeon E5 v4 are also hiding in plain sight in a different document....
I've been playing around seeing what programs trigger the slower speeds and what ones don't. (As an aside, so far I've never managed to see 40X used in any situation, for what it's worth. CPU package temp never gets above 62 C, and the cores always report less than that).
Since Intel implemented a separate 'AVX/'Register size use' set of speeds, one would have thought that putting these on the matching ARK pages would have made things much easier...
It seems that the limit I'm seeing may not be total power (if the readings reported by AIDA64 are correct) but perhaps some internal subset limit.
I found that Prime-95 has a configuration file that says you can disable 'AVX' and 'AVX2' usage. (As you mentioned it may not really be AVX per-say as much as associated register size usage).
So, I tried running Prime-95 with and without 'AVX and AVX-2' enabled and a 'Balanced power' Windows 7 setting.Running without these enabled shows all busy cores running at 38X (No throttling).
The peek power I see (as reported by AIDA64) are:
Idle : 12X, 8.47 W, 11 A
Prime-95, one thread, AVXs enabled, Throttling 35X (one core) = 24.46 W, 22 A, CPU 1.112 V
Prime-95, 12 threads, AVXs enabled, Throttling 35X (all cores) = 59.60 W, 59 A, 1.010 V
Prime-95, one thread, AVXs disabled, No Throttling 38X (one core) = 28.58 W, 24 A, 1.191 V
Prime-95, 12 threads, AVXs disabled, No Throttling 38X (all cores) = 71.44 W, 60 A, 1.191 V
Intel Processor diagnostic Tool, 'AVX test', No throttling 38X (one core) = 26.19W, 22 A, 1.191 V
Intel P.D.T. 'Floating Point test', Throttling 35x (all cores) = 37.80 W, 34 A, 1.112V
If that's to be believed (and I'm guessing the real power should be a bit higher), there's a throttling case of one-core at only 24.46 W total, yet a Non-throttling all-cores case with higher total power of 71.44 W.
I also found a tool called 'Limit reasons' that claims to indicate the reason for throttling (by showing little 'flags'). Not sure if it really works for Xeons, but when I see the 35x mode, it sets a flag labeled 'Core P1'. Unfortunately, it doesn't bother to explain what that means...
Would all that seem to indicate that what I'm seeing are the 'normal' limits expected from this type of CPU and usage (perhaps except for the 40X never being seen) ?
The "throttling" when using a single core with 256-bit instructions is characteristic of all of the Xeon E5 v3 processors that support Turbo (according to the tables in the Xeon E5 v3 Specification Update). Max Turbo frequency drops of 0.2 GHz are most common, but 0.1 GHz and 0.3 GHz are also common in the tables. (It is a pain to compute these, since the three tables present the SKUs in different orders, so you have to do a lot of searching to find the info for any single product.) It is not clear whether this reduction in maximum Turbo frequency is due to critical path differences when the upper 128-bit pipeline is enabled or whether it is due to local power density limitations (or both, or something else), but there are lots of plausible mechanisms that would account for this.
It looks like the Xeon E5-1640 v4 is a 140 Watt part, so none of your numbers are very close to the limit. This may indicate that the Prime-95 benchmark does not make terribly effective use of vectorization. You might want to try the Intel xHPL benchmark -- this generates the highest power consumption (and lowest sustained frequencies) of any benchmarks we have looked at. The benchmarks are available in the packages at https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite
Intel has a number of mechanisms that influence the Turbo frequency used, and it can be hard to understand what is controlling the frequency. In particular, I find that "Energy Efficient Turbo" often leads to lower frequencies (for both the cores and the uncore), as does anything other than "performance" in the "Energy-Performance Bias" setting. The behavior is generally entirely reasonable for general-purpose use, but to reduce performance variability I usually pin everything to "max performance" values when I am doing measurements that feed into performance models.