In profiling our market feed handling application, VTune noted a high rate of calls to "rdtscp" as a culprit to such a high CPI rate in the application. I'm trying to convince our team that using a serialization call such as this is necessary only when *benchmarking* our code, but a vanilla "rdtsc" should be fine in production to get a general understanding of where time is spent between major blocks of our code.
However, in testing out the effects of using vanilla "rdtsc", VTune flags this call with the same fingerprint that I attributed to the serializing effect of "rdtscp"; namely, a high count for L1 stalls that are not caused by data misses, and high counts for cycles with 0 or only 1 execution port utilized. At least I *thought* those were the symptoms of a serializing call, but perhaps I was mistaken.
Why would this be for calls to "rdtsc"?
RDTSC and RDTSCP are microcoded instructions that require ~15-25 micro-ops and that have a latency of 25-35 cycles (this varies a bit across processor models, see http://www.agner.org/optimize/instruction_tables.pdf). ;
For an aggressive out-of-order processor like recent Intel models, I don't know whether the presence of long-latency instructions in the pipeline influences how the processor reports the program counter when a sampling interrupt is received. It is easy to imagine that a long-latency instruction like one of these could show up disproportionately in the profile. Does anyone here have any insight on this particular issue? I have not found the discussion in Chapter 6 of Volume 3 of the SW Developer's Manual to be particularly helpful...
I was hoping you would see this and respond! :)
Regarding the earmarks of a serializing instruction, wouldn't you expect to see notable count differences in the PMU counters referenced (e.g., L1 latency w/o data miss, and percentage of time with only 0 or 1 execution ports in use) since only one of these is negatively impacting the function of Broadwell's large OoO window?
I recently did a lot of experiments with RDTSC and RDTSCP to see how much wall clock time they added to the total execution time in realistic situations.
The experiment was simple -- execute either RDTSC or RDTSCP before and after a simple summation loop of variable length. The summation is on a scalar double precision variable, and the compiler leaves the loop with a single VADDSD instruction which should have an execution cost (due to pipeline latency) of 3 cycles per iteration. The experiment was done on a Xeon E5-2680 (Sandy Bridge), with the frequency pinned to 2.70 GHz so that TSC cycles and core cycles are the same.
For each inner loop length, the (RDTSC/P, inner loop, RDTSC/P) sequence was repeated 100 times and the average elapsed cycle count per inner loop was computed. If the RDTSC/P and loop control had zero overhead, then the expected execution time would be 3 cycles per element of the inner loop.
The results not simple, but are certainly very interesting...
The attached image ("RDTSC_overhead.png") shows the average cycle counts as a function of inner loop length when using either RDTSC or RDTSCP.
Both curves show a clear break in behavior starting at an inner loop length of 33. This corresponds to the expected behavior of the branch predictor -- for loops with a trip count of 32 or less, the branch predictor "remembers" which iteration will be the final iteration and correctly predicts that the conditional branch at the end of the loop will fall through on that last iteration. For inner loop lengths of 33 or larger, the branch predictor does not "remember" which iteration is the final one, so there is an extra ~20 cycle penalty for mispredicting the direction of the branch at the end of the final inner loop iteration.
For k>=33, the two curves are very clean -- both have a slope of 3 cycles per inner loop iteration, but with different intercepts. With the RDTSC instruction as the timer, the intercept (i.e., the extrapolated value for k=0) is 17 cycles, while the case with RDTSCP as the timer shows an intercept of 43 cycles.
- With RDTSC, the overhead of 17 cycles shows that the latency of the RDTSC instruction is clearly overlapped with the latency of the mispredicted branch at the end of the inner loop, and likely also with the 3-cycle latency of the final VADDSD instruction.
- With RDTSCP, the overhead of 43 cycles shows that the latency of the RDTSCP instruction is not overlapped with the latency of the mispredicted branch at the end of the inner loop or with the latency of the final VADDSD instruction.
- Assuming 17 cycles for the mispredicted branch and 3 cycles for the VADDSD, the effective latency of the RDTSCP instruction is 23 cycles, which is plausible. (Repeated execution of RDTSCP give slightly higher latency, but that may be an artifact of the requirement that consecutive RDTSCP instructions must return monotonically increasing values.)
For k<33 the interpretation is less clear. Both cases show two sub-regions, with different behavior:
- Using RDTSC:
- For k<12, the average execution time is a constant 28 cycles -- similar to the minimum repeat time of the RDTSC instruction in isolation.
- For 11<k<32, the average execution time is roughly the constant 28 cycles plus one cycle for each additional element (beyond the 11 from the first segment).
- Using RDTSCP:
- For k<10, the average execution time is approximately 39 cycles plus one cycle per element -- similar to the minimum repeat time of the RDTSCP instruction in isolation.
- For k>10, the average execution time increases by three cycles per additional element.
These behaviors probably indicate interesting and profound things about the microarchitectural implementation of the RDTSC/RDTSCP instructions, but I am unable to come up with a single coherent story right now....
The overall lessons are pretty clear:
- Once the loop is long enough to include a mispredicted terminal branch, the effective latency of the two RDTSC instructions is near zero, while the effective latency of the RDTSCP instruction is in the range of 24 cycles.
- If the loop takes less than ~40 cycles, the overhead of the RDTSC or RDTSCP instructions will make interpretation extremely difficult.
Very interesting test. One thing is for sure from this graph - "rdtscp" is certainly a boost from the old "cpuid;rdtsc" sequence. And from testing on our production code, switching out "rdtscp" for "rdtsc" made no noticeable dent in tick-to-trade latency through the System Under Test (SUT) in our lab as measured via network taps attached to a Corvil latency monitoring appliance. Hmmm. . .
If changing from RDTSCP to RDTSC does not change the overall performance by a detectable amount, then it is likely that neither of them is actually a problem, and the VTune analysis is biased.
In my post above, I made reference to the minimum latency for consecutive RDTSC and RDTSCP instructions, but I did not include the details. On my Xeon E5-2680 (Sandy Bridge) with the frequency pinned to 2.70 GHz (so TSC clocks and core clocks are the same):
- The latency for a sequence of repeated RDTSC instructions (including storing the 64-bit output) is almost always 28 cycles.
- The latency for a sequence of repeated RDTSCP instructions (including storing the 64-bit output) is almost always either 36 or 40 cycles.
For these experiments I only did 64 measurements per test and made sure that the array that I wrote the timing values to was dirty in the L1 data cache at the beginning of the tests. I manually unrolled the loop by 8 and ignored any results that look like they were biased by the overhead of loop control (typically only happened between the last read of the first iteration of the loop and the first read of the second iteration of the loop).
The fixed-function performance counters for unhalted cycles have very slightly higher minimum latencies:
- The latency for a sequence of repeated RDPMC instructions reading Fixed-Function Counter 1 ("Actual Cycles Not Halted") varies from 40-42 cycles.
- The latency for a sequence of repeated RDPMC instructions reading Fixed-Function Counter 2 ("Reference Cycles Not Halted") is either 27 or 54, with an average of about 42 cycles.
- The programmable counter counts only the number of reference clocks, while this counter increments 27 times (the TSC ratio) for every tick of the reference clock.