In profiling our market feed handling application, VTune noted a high rate of calls to "rdtscp" as a culprit to such a high CPI rate in the application. I'm trying to convince our team that using a serialization call such as this is necessary only when *benchmarking* our code, but a vanilla "rdtsc" should be fine in production to get a general understanding of where time is spent between major blocks of our code.
However, in testing out the effects of using vanilla "rdtsc", VTune flags this call with the same fingerprint that I attributed to the serializing effect of "rdtscp"; namely, a high count for L1 stalls that are not caused by data misses, and high counts for cycles with 0 or only 1 execution port utilized. At least I *thought* those were the symptoms of a serializing call, but perhaps I was mistaken.
Why would this be for calls to "rdtsc"?
RDTSC and RDTSCP are microcoded instructions that require ~15-25 micro-ops and that have a latency of 25-35 cycles (this varies a bit across processor models, see http://www.agner.org/optimize/instruction_tables.pdf). ;
For an aggressive out-of-order processor like recent Intel models, I don't know whether the presence of long-latency instructions in the pipeline influences how the processor reports the program counter when a sampling interrupt is received. It is easy to imagine that a long-latency instruction like one of these could show up disproportionately in the profile. Does anyone here have any insight on this particular issue? I have not found the discussion in Chapter 6 of Volume 3 of the SW Developer's Manual to be particularly helpful...
I was hoping you would see this and respond! :)
Regarding the earmarks of a serializing instruction, wouldn't you expect to see notable count differences in the PMU counters referenced (e.g., L1 latency w/o data miss, and percentage of time with only 0 or 1 execution ports in use) since only one of these is negatively impacting the function of Broadwell's large OoO window?
I recently did a lot of experiments with RDTSC and RDTSCP to see how much wall clock time they added to the total execution time in realistic situations.
The experiment was simple -- execute either RDTSC or RDTSCP before and after a simple summation loop of variable length. The summation is on a scalar double precision variable, and the compiler leaves the loop with a single VADDSD instruction which should have an execution cost (due to pipeline latency) of 3 cycles per iteration. The experiment was done on a Xeon E5-2680 (Sandy Bridge), with the frequency pinned to 2.70 GHz so that TSC cycles and core cycles are the same.
For each inner loop length, the (RDTSC/P, inner loop, RDTSC/P) sequence was repeated 100 times and the average elapsed cycle count per inner loop was computed. If the RDTSC/P and loop control had zero overhead, then the expected execution time would be 3 cycles per element of the inner loop.
The results not simple, but are certainly very interesting...
The attached image ("RDTSC_overhead.png") shows the average cycle counts as a function of inner loop length when using either RDTSC or RDTSCP.
Both curves show a clear break in behavior starting at an inner loop length of 33. This corresponds to the expected behavior of the branch predictor -- for loops with a trip count of 32 or less, the branch predictor "remembers" which iteration will be the final iteration and correctly predicts that the conditional branch at the end of the loop will fall through on that last iteration. For inner loop lengths of 33 or larger, the branch predictor does not "remember" which iteration is the final one, so there is an extra ~20 cycle penalty for mispredicting the direction of the branch at the end of the final inner loop iteration.
For k>=33, the two curves are very clean -- both have a slope of 3 cycles per inner loop iteration, but with different intercepts. With the RDTSC instruction as the timer, the intercept (i.e., the extrapolated value for k=0) is 17 cycles, while the case with RDTSCP as the timer shows an intercept of 43 cycles.
For k<33 the interpretation is less clear. Both cases show two sub-regions, with different behavior:
These behaviors probably indicate interesting and profound things about the microarchitectural implementation of the RDTSC/RDTSCP instructions, but I am unable to come up with a single coherent story right now....
The overall lessons are pretty clear:
Very interesting test. One thing is for sure from this graph - "rdtscp" is certainly a boost from the old "cpuid;rdtsc" sequence. And from testing on our production code, switching out "rdtscp" for "rdtsc" made no noticeable dent in tick-to-trade latency through the System Under Test (SUT) in our lab as measured via network taps attached to a Corvil latency monitoring appliance. Hmmm. . .
If changing from RDTSCP to RDTSC does not change the overall performance by a detectable amount, then it is likely that neither of them is actually a problem, and the VTune analysis is biased.
In my post above, I made reference to the minimum latency for consecutive RDTSC and RDTSCP instructions, but I did not include the details. On my Xeon E5-2680 (Sandy Bridge) with the frequency pinned to 2.70 GHz (so TSC clocks and core clocks are the same):
For these experiments I only did 64 measurements per test and made sure that the array that I wrote the timing values to was dirty in the L1 data cache at the beginning of the tests. I manually unrolled the loop by 8 and ignored any results that look like they were biased by the overhead of loop control (typically only happened between the last read of the first iteration of the loop and the first read of the second iteration of the loop).
The fixed-function performance counters for unhalted cycles have very slightly higher minimum latencies: