Hello! I noticed that code alignment influences the result of my benchmarks.
For example, consider the following code, without any instruction in between the rdtsc at the beginning and the rdtscp at the end.
asm volatile( // ".align 16\n" "lfence\n" "rdtsc\n" "mov %%eax, %%r8d\n" // nothing in between "rdtscp\n" "sub %%r8d, %%eax\n" : "=a"(timing) : : "rcx", "rdx", "r8", "memory");
Without the `.align 16` code, the latency measured by this code averages to 33.32 cycles (with a standard deviation of 1.65 cycles).
With the `.align 16` code, the latency measured by this code averages to 32.44 cycles (with a standard deviation of 1.55 cycles).
If I used `.align 32`, the latency measured by this code averages to 32.20 cycles (with a standard deviation of 1.39 cycles).
Why is it the case that code alignment affects the results in such way?
Thanks in advance!
It looks like all of your values are within the variation of each other?
RDTSCP is a microcoded instruction, executing 15 to 24 micro-ops on different Intel processor architectures. I don't know of any studies of the impact of code alignment on microcoded instructions..... RDTSCP also has a one-way instruction execution fence -- it cannot begin execution until all prior instructions in program order have executed (not necessarily retired). This may also interact with microcode execution in unexpected ways....
I have done a number of studies of the overhead and variability of various timers. My tests typically include saving the results from each call (sometimes 32 bits, sometimes all 64 bits), which requires careful attention to implementation so that the stores of the results don't miss in the L1 Data Cache. In one set of tests on a Xeon E5-2680 system, a set of ~560 individual measurements showed 36 cycles 382 times, 40 cycles 180 times, and 56-108 cycles five times. RDTSCP is much faster on a (Skylake Xeon) Xeon Platinum 8160, returning 18 cycles about 40% of the time and 20 cycles about 60% of the time. Some of my tests used a single RDTSCP instruction per loop iteration, so they will always have the same alignment, while other tests unrolled the loop to see if loop control overhead made a difference. I was mostly thinking about the differences between consecutive RDTSCP calls and those separated by the loop compare and branch, but alignment could have had an influence on these results as well....
A more recent version of my tests is included in the LowOverheadTimersTests directory in https://github.com/jdmccalpin/low-overhead-timers and there is a discussion of some of the issues in timing very short code sections at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/