I have few random number generators, each of them has ANSI C, SSE, AVX2 and AVX512 versions. SIMD functions implemented as inline assembler in c code.
For CPU's without AVX512, but with AVX2 there is increased speed about 30-40%. However for AVX512 in contrast with AVX2 < 10% and sometimes AVX512 versions are even slower.
As I understood googling this questions, such phenomena can be explained by AVX512 throttling.
So the next two questions arose: how to measure CPU frequencies while executing asm code or there is another reason for such low improvements?
If the fixed-function hardware performance counters are enabled, you can read the core cycle counter with simple inline assembly code. The function "rdpmc_actual_cycles()" in "low_overhead_timers.c" at https://github.com/jdmccalpin/low-overhead-timers provides a specific example (using mixed C and assembly code).
Because of the strong variations in clock frequency with recent processors, I strongly support your idea of measuring the actual frequency that the processor was running at during the test. I usually read the elapsed time with the RDTSCP instruction, the reference cycles not halted with the "rdpmc_reference_cycles()" function, and the actual cycles not halted with the "rdpmc_actual_cycles()" function. From these you can easily compute the fraction of time the processor was not halted and the average frequency during the interval. Note that the performance counters are separate for each logical processor, so you will need to bind the execution to a single logical processor for the measurement.
You could also take a look at avx-turbo - I wrote this to do more or less what you are trying to do: detect sequences of code which triggers the various AVX turbo frequencies (Intel calls them licenses, and there are three of them: L0, L1 and L2).
The primary approach it uses is to use the APERF and MPERF MSRs which measure actual cycles and nominal cycles and from there to calculate the frequency. You could put your code in there as a new test. It only runs on Linux, but I guess you could use the techniques on other platforms if you find a way to read those MSRs.
You might also be interested in this article which explains in some detail the AVX frequency mechanism and gives some advice. Finally, as also mentioned in that article, you can actually measure directly the amount of time the CPU spends in each of the three frequency licenses using the events CORE_POWER.LVL0_TURBO_LICENSE, CORE_POWER.LVL1_TURBO_LICENSE, CORE_POWER.LVL2_TURBO_LICENSE. You can use these to confirm your frequency measurements, and to rule out other effects such as power limit or thermal throttling.
APERF and MPERF have the advantage of being 64-bit counters, so there are no overflows to worry about. If I recall correctly, some earlier versions of the Linux kernel (2.6 maybe?) cleared these counters when it read them, but later kernels (3.0+) just leave them running and take differences. Like the fixed-function performance counters, these have logical processor scope, so you have to pin the code under test to one logical processor if you want differences to make any sense.
APERF and MPERF have the disadvantage of being MSRs, so if you are not running inside the kernel you have to do a system call to read them. This increases the overhead from ~20 cycles (to execute RDPMC) to >2000 cycles (to execute a pread()) on the /dev/cpu/[nn]/msr device driver for the logical processor ("[nn]") of interest -- and also requires root privileges. Starting in the 4.3 kernels, the Linux "perf" subsystem supports APERF and MPERF, so it is easy to use for whole program measurements with something like "perf stat -a -e msr/aperf/,msr/mperf/ a.out". (Ref: http://www.paradyn.org/petascale2016/slides/CSCADS2016_perf_events_status_update.pdf)
McCalpin, John (Blackbelt) wrote:
APERF and MPERF have the disadvantage of being MSRs, so if you are not running inside the kernel you have to do a system call to read them. This increases the overhead from ~20 cycles (to execute RDPMC) to >2000 cycles (to execute a pread()) on the /dev/cpu/[nn]/msr device driver for the logical processor ("[nn]") of interest -- and also requires root privileges.
Yes, for sure. It is only appropriate when you measuring large regions, such those taking millions of cycles or more. You can actually control the priviledges required to read the MSRs by adjusting the permissions on the msr dev file, but that's probably a bad idea unless the machine is only running trusted code, or unless you maybe lock it down to a specific group.
rdpmc is the best approach for short code regions, but you need to ensure the fixed function counters are enabled.