I'm developer of some profiler.
I use RDTSC (read timeStamp counter) to obtain timestamp that I use to measure functions wall time.
Some time ago I faced up withquite astrange problem. On one of Pentium D computers in our company this instruction became to return a non-monotonous values (other Pentiums D are ok).
I thinkit appears because TCS on cores of this processor got out of sync. I.e.
1) t1 = RDTSC
2) ... perform some operation...
3) t2 = RDTSC
I think I'm getting t2< t1 because on stage (2) OS remaps the thread from one core to another.
My questions are:
1) Is it possible that Pentium D has TSCs on cores out of the sync?
2) Are there any instructions, that willlet me to get to know, are the TSC on cores synchronized?
3) Are there any instructions, that will let me to synchronize cores TSCs?
Thank you in advance.
One of our engineers responds as follows:
RDTSC is not a serializing instruction. On out-of-order machines, there is no guarantee back-to-back RDTSC will return monotonically increasing values. There is a well-known technique to ensure monotonic behavior, by placing a serializing instruction immediately before RDTSC. A common choice of serializing instruction is CPUID.
Naturally, adding a serializing instruction before RDTSC adds to the overhead of the timing measurement. Depending on your timing measurement philosophy, you have to decide (a) measure frequently (thereby requiring extra overhead to ensure monotonic RDTSC) or (b) minimize measurement overhead thru amortization of each RDTSC over many instructions (if the RDTSC is done not too frequently, the finite length of the OOO window effectively guarantee monotonic behavior). On the other hand, if you choose (a). you may have to invest in other techniques to calibrate how much overhead your in-situ CPUID+RDTSC measurement cost you.
Taking the approach of (a) also meant you may have to characterize the statistical variance of the measurement overhead, because both CPUID and RDTSC are complex instructions consisting of relatively long sequence of micro-ops, they likely execute and complete with difference number of cycles each time. In particular, execute CPUID with different input value is likely to take varying amount of cycles.
1. Back-to-back RDTSC returning non-monotonic value is not unexpected.
2. If you application requires frequent and monotonic RDTSC, you must add CPUID (with EAX=0) immediately before each RDTSC. You may need to decide what to do with increased measurement overhead...
3. If your app's sampling period can be sufficiently large, you may be able to use RDTSC without a serializing instruction and still get monotonic behavior, the key is to ensure you unroll to have a large enough number of instruction between two RDTSC.