As I don't not sure where to post this question I hope someone can at least give me a hint where this should go :).
I am currently doing a performance analysis of a larger software project by timing the code with some macros that evetually use the RDTSC assembler command to read out the time stamp counter. This works fine so far but to judge the results I need to now as much about the generated overhead as I can. As I am comming from the low level hardware-near side of programming, the first idea was to get some info about the cycle count for each assembler command I use to come up with some rough estimates. Unfortunately, I found no cycle count specification for the RDTSC command in the Intel docs for the IA-64 / IA-32 architecture.
I now that different architectures like the P4 Netburst and the Centrino/ Core Duo platform have large differences in terms of low level design (pipeline etc.), so I was wondering if RDTSC is implemented different in these designs and therefore has no cycle count specification in the standard documents. The reason for this conclusion was a huge difference in the results produced by the following code:
void main (void)
static LONG shi, slo, ehi, elo;
static LARGE_INTEGER s,e,r;
static short a = 0;
static int i = 0;
for ( i = 0 ; i < 1000 ; i++ )
mov shi, edx
mov slo, eax
mov ehi, edx
mov elo, eax
s.HighPart = shi;
s.LowPart = slo;
e.HighPart = ehi;
e.LowPart = elo;
r.QuadPart = e.QuadPart - s.QuadPart;
I wanted to use this routine to identify the minimum overhead for each call to RDTSC but I got very different results. For example a P4 630 (3GHz, Prescott, HT (disabled), EIST (disabled)) needed approximately 102 cycles and slightly older P4 (3GHz, none HT model) needed about 98 cycles. The funny thing now is that my one year old Laptop ( 1.5GHz Pentium M / Centrino) needed only 43 cycles for the calls to RDTSC.
Ok this is already to much text I guess - sorry for that - so the final question is:
Does anybody know of some more specific documents on that and where to find them? Or will I just have to live with that facts eventually?
Thanks a lot for reading and maybe even answering this post.
so you needn't adjust your code between 32- and 64-bit mode.
Recent Intel CPUs don't report an rdtsc result proportional to actual CPU clock ticks. They count bus clock ticks, multiplied by the intended CPU clock speed ratio, so as to avoid breakage due to CPU clock speed adjustments. Then, resolution clearly is limited by bus clock speed.
As MS Vista, as well as recent linux OS, are supposed to support HPET timers, you may be interested in those.
One of ourengineers responds as follows:
Please see replies to other RDTSC and Pentium D thread.
The idea of measuring how many cycles it takes to execute RDSTC from a stand-alone harness and transfer the measurement result to a different set up is flawed. In an out-of-order machine, there are many boundary and initial conditions that can contribute to the execution cycles of an instruction or a sequence of micro-ops. RDTSC is one of those complex instructions that consist of a sequence of micro-ops. How that sequence of micro-ops is dispatched thru which set of issue ports and binds with available execution units is a dynamic situation inside the hardware that software can not control. Thus, the visible latency of such an instruction via a measurement harness may not be the same each time a program issue RDTSC to the cpu.
In most situations, in-situ measurement philosophy of amortizing each RDTSC over many instructions is more practical and easier to implement than assuming one can inject RDTSC frequently and apply some constant execution cycle to account for the RDTSC overhead. There is also the consideration that frequent in-situ RDTSC (and its preceding serializing instruction) will perturb the interaction between your target workload and the OOO machine significantly.