On Linux systems if you have root access you can use the "rdmsr.c" tool from "msrtools-1.2" to read MSRs 0x198 and 0x199 on each core.
According to section 35.1 of Volume 3 of the Intel Architecture Software Developer's Guide (document 325284-051, June 2014), bits 15:0 of MSR 0x199 provide the "target performance state value". If the OS is configured to request maximum performance (for example, by using the "cpufreq" utility with the "performance" governor), then these bits will be programmed with the highest CPU multiplier supported by the processor.
- For example, on my Xeon E5-2680 (Sandy Bridge EP) processors, the nominal frequency is 2.7 GHz (multiplier=27 decimal, 1B hex), and the highest supported Turbo frequency is 3.5 GHz (multiplier=35 decimal, 23 hex), so the OS will program all cores with the value of 0x23 in bits 15:0 of MSR 0x199.
Bits 15:0 of MSR 0x198 provide the "current performance state value". This is an instantaneous value, and so may not represent the average over any period of time. It is also possible that the act of interrupting the processor core to ask it to read this MSR will change the value. Intel recommends using the ratio of elapsed core cycles to elapsed reference cycles over an interval to obtain the average "actual" frequency during the interval. The "fixed-function" performance counters include the ability to measure both "actual" and "reference" core cycles, and these can be read in user space using the RDPMC instruction if the kernel has set the CR4.PCE bit. (This is the default in RHEL 6.4 and newer -- I don't know about other kernels and distributions).
- For example, on my (aggressively cooled) Xeon E5-2680 (Sandy Bridge EP) processors, "current performance state" almost always returns 0x1f (decimal 31), which corresponds to the 3.1 GHz that is the maximum Turbo frequency allowed when all cores are active.
I have not done a lot of testing on the Haswell EP, but my initial results showed that the "current performance state" was not very stable -- it fluctuated somewhat randomly between the base frequency and the maximum all-core Turbo frequency, with an average value somewhere in the middle. It will take experience with more systems and more workloads to draw any conclusions.
In case you are not comfortable with reading MSRs directly, there are two tools that might be of help:
Thomas's answer is much more practical than mine.
On Linux systems you can use "perf stat" on an executable -- the default output includes the average frequency (computing using the performance counters for unhalted core cycles and unhalted reference cycles). I don't know if Windows has any comparable tools.
For Xeon E5 Haswell processors (the "Grantley" platform) the maximum Turbo frequency depends on the number of cores "active" (C0 or C1 states) as well as whether the core is executing 256-bit AVX/AVX2 instructions. The full list of maximum Turbo frequencies for "normal" and "AVX" operation for these processors is contained in the "Intel Xeon Processor E5 v3 Product Families Specification Update" (document 330785, version 002, October 2014).
On the Haswell, allowing applications running on Intel OpenMP to use more than one thread per core seems to depress Turbo boost, cutting performance of interspersed single thread regions. The Haswells I've seen either don't have a BIOS option to disable HT, or (on E7) won't boot up with HT disabled. It seems necessary to limit number of threads to number of cores and use affinity setting to keep the threads on separate cores, in order to avoid the single thread slowdown (which persists far longer than the milliseconds I would have expected).
I don't get the same effect when running applications with libgomp. The dual core Haswell runs well with 3 threads and no affinity setting. One difference we've noted is that libgomp takes significantly longer than libiomp5 to enter parallel regions.
My Xeon E5 v3 (Haswell EP) systems (three different model numbers) are all running with HyperThreading disabled in the BIOS. The option we disabled might have been called "logical processors" ? (Not sure about the exact wording -- I do recall that it was not completely unambiguous.)
I verified that the lower frequencies on the Haswell EP systems are triggered by the use of 256-bit registers, not by AVX encodings of scalar or 128-bit register operations.
I have not tried to disable HyperThreading on my client Haswell processor (Core i7-4960HQ). These don't appear to have a *specified* decrease in maximum Turbo frequency when 256-bit AVX instructions are in use, but it would be easy to believe that the max frequency is lower in practice due to current, power, or temperature limitations.
I wouldn't be surprised that use of AVX-256 instructions might influence turbo stepping, but I didn't see such an effect.
I'm using mostly AVX2 with both Intel and gnu compilers. The gnu compilers run slightly faster at 3 threads than 2, with no impact on the single threaded regions.