Q&A: RDTSC to measure performance of small # of FP calculations

Intel_Software_Netw1 · ‎11-14-2006

The following is a question received by Intel Software Network Support, followed by the responses provided by our Application Engineering team:

Q. I am developing on Pentium 4 with Windows XP and DevCpp/GCC-Compiler. I need to measure the performance of a small number of floating point calculations (for example, fadd, fsub and so on) . The Enhanced Timer is not suitable because of the overhead. Now I am looking for examples of how to measure it by using processor clocks (RDTSC command). The most referred information I found was in the document "Using the RDTSC Instruction for Performance Monitoring", which is an Intel Corporation document (1997). The source code part of this document (3A) is exactly what I need to do. To use it, it is necessary to calculate the overhead first (variable base) and "warm up" the cache, due to the effects of cache misses and other processes which are using the same processor. Unfortunately the variable base is changing all the time, so it is not possible to produce repeatable measurement. I would be grateful to get some advice.

A. We forwarded this question to several engineers, and received the following responses:

#1. My first question in response to "I need to measure a small number of floating point operations" would be WHY? If it's only a small number, it can't be performance critical!

If it's a large number (even if occasionally in some inner loop), then "overhead" doesn't matter and you can use your timing routines (ETimer, RDTSC, or QueryPerfCounter, etc.). You may have to re-harness your code do this, of course.

#2. Measurements with RDTSC are most credible if the number of clocks between the pair of RDTSC instructions is at least a few thousand, preferably tens of thousands.

Typically, one wraps the interesting code sequence in a loop. Also (for certain OS reasons), you should repeat this multiple times -- the first measurement is usually wrong.

i.e.:

repeat 5 times
start = RDTSC
loop 50,000 times
the small number of FP instructions I want to test
endloop
end = RDTSC
end repeat
When you do it this way, loop overhead is pretty incidental, and you can just compute (end-start)/50000 for each iteration and get your performance. I would print each of the 5 trials.

I would expect the first iteration result to be quite different from the following 4 results.

#3. You should also be aware of the caveats around measuring something on a Pentium 4 processor. It will be significantly different on our new cores.I recommend you get a copy of the Intel VTune Performance Analyzer.

#4. Our guess is that you might be reverse engineering performance for key FP sequences and working out cache latencie s and stride/timing semantics using these annotations.

If this is the case, our guess is that you are likely doing this while playing off requisite algorithm/blocking strategies, perhaps even while comparing our u-arch with a competitive one.

More unlikely, but if it turns out that you are tuning small code sequences on OOO machine, our recommendation would be to guide you otherwise.

RDTSC on the Pentium 4 processor is noisy, synchronizes the pipeline, and at last check, had a latency of ~90-120 clocks in Pentium 4 (former codename Northwood) processor implementation.

This would certainly introduce "Heisenberg" uncertainity aspects into your measurements.

Which version of GCC are you using? Hopefully something post 3.3.* and even 4.1.* would be better.

In the end, if you choose to use a "counter", you will be challenged by SNR issues unless you integrate in the set-up/design of your performance experiments.

Q. In response to #1:
For my scientific project, I need to measure the performance of a small algorithm with different architectures (Processors, operating systems and so on). The algorithm contains just the addition and subtraction of floating point numbers (4-5 operations). I have already measured it with counters like QueryPerfCounter and got some results. To achieve them, I needed to deal with such effects like overhead of loops, calling the QueryPerfCounter from the Windows API, cache refresh and others, which produce a big overhead comparing with operations I need to measure. So after taking into account all of these effects, the results are unfortunately not precise enough. For this reason, I have made the decision to measure primary with RDTSC.

In response to #2:
I have found two methods of measurements in the document "Using the RDTSC Instruction for Performance Monitoring". One of them is dealing with the small length of code, such as in my case. To overcome the effects of instruction and data cache misses, the technique of cache warming is applied. Here is the assembler code (Should be repeated 3 times):

CPUID
RDTSC
mov cyc, eax
CPUID
RDTSC
sub eax, cyc
mov base, eax

Since the variable base is changing for each measurement, it is impossible to get repeatable results. This is my main problem at the moment.

In response to #3:
Is the Intel VTune Performance Analyzer also suitable for small number of operations (like in my case)?

In response to #4:
I am using DevC++ 4.9.9.2 with GCC. I would be glad if you describe more details about these issues.

A. Our engineers responded:

#1. Here's some additional data covering RDTSC operation. I took the time to dust off previous work and corresponding diagnostic programs and re-examined them for validity. First, as to whether or not executing RDTSC distorts the measurement: the fact is that it will for shorter instruction sequences that exec ute within the "shadow" of an instance of RDTSC execution. Presuming that no power/thermal events that affect core clock frequency take place, on Pentium 4 microarchitecture RDTSC is ~80 clocks and on Intel Core microarchitecture RDTSC is ~65 clocks. This was the basis of my Heisenberg allusion, in that upon inserting pair of RDTSC, one essentially cannot measure time spans less that about twice the pipeline "shadow". Even then, one must be cognizant that recovered precision is in direct proportion to time span duration relative to time span of twice the pipeline "shadow". So this is lower bounds of what minimum time span can be measured using present instruction-based technology. Second, as to whether there is jitter among pairs of RDTSC used to measure time span of an instruction sequence. If purely executing within the core, e.g. recurrence relations, Pentium 4 is very faithful here when executing from the trace cache and no jitter is ever seen (at least by me). On Core u-arch, one will experience jitter, perhaps up to ~25% but typically ~5% of time span being measured. I attribute this to variances in instruction fetch/decode operations when code is not ideally placed relative to measurement and control-flow groups of instructions. There is usually always jitter among pairs of RDTSC used to measure time span of an instruction sequence if there are outstanding memory operations in the pipeline. The standard deviation of measured values range up to ~30% for short less iterative sequences and around 5-10% for long more iterative sequences. This is true on both Pentium 4 and Core u-arch's and use of simple binning is advised, especially when looking for best case performance.

#2. Regardless of what the PRM says, the sequence:
rdtsc
{a small number of instructions with a cumulative latency less than hundreds or thousands of clocks}
rdtsc

is very unlikely to yield a reliable result. The CPUID step is probably not needed either, although there are other opinions on that point.

The rdtsc instruction is serializing with respect to itself. It is not actually serializing.

Fundamentally, what all the respondents are saying is that this is an out of order machine, and the very notion of deterring the latency of a 3 instruction sequence is quite slippery. You can get very reliable measurments of larger blocks of code (with a few caveats as noted below). But don't try to measure something small. AND, check that your result is repeatable, and your measurement stable.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

levicki · ‎12-29-2006

I would like to add that from my experience it is possible to measure even single instruction performance (approximate of course due to dependencies and what not) if you repeat it at least 100,000 times in a loop between two RDTSC instructions.

hutch-- · ‎12-29-2006

In response to the original question, I suggest that on late PIV hardware (Northwood and Prescott core machines) that you have little chance of getting reliable timings for a short instruction sequence for a variety of reasons.

In the Intel staff responses it has already been mentioned that the first iteration is almost exclusively slower than later iterations but there is another factor that has always effected timings under ring3 access in Windows 32 bit OS versions. Faced with higher privileged processes being able to interfere with lower privilege level operations, you will generally get at least a few percent variation on small samples and it gets worse as the sample gets smaller.

You can reduce this effect by setting the process priority to high or time critical but you will not escape this effect under ring3 access. I have found from practice that for real time testing you need a duration of over half a second before the deviation comes down to within a percent or two.

What I would suggest is that you isolate the code in a seperate module in an assembler and write code of this type.

push esi
push edi

mov esi, large_number
mov edi, 1
align 16
@@:
; your code to time here
sub esi, edi
jnz @B

pop edi
pop esi

Adjust the immediate "large_number" so that the code you are timing runs for over a half a second, over 1 second is better, set you process priority high enough to reduce the higher privilege interference to some extent and you should start to get timings around the 1% or lower variation.

Two trailing comments, the next generation Intel cores will behave differently on a scale something like the differences between the PIII and PIV processors so be careful not to lock yourself into one architecture. The other comment is as far as I remember the FP instruction range while still being available on current core hardware is being replaced by much faster SSE/2/3 instructions so if your target hardware is late enough to support these instructions, you will probably get a big performance hit if you can use the later instructions.

Regards,

hutch at movsd dot com
http://www.masm32.com

Intel_Software_Netw1 · ‎08-15-2007

Here's a link to a related post: Problem with rdtsc on Pentium D processor

Also, here is another variation we recentlyreceived on the same basic question above, included here to increase the probability of this solution coming up in keywordsearches:

I am trying to use the RDTSC instruction to time my high performance code. I was trying to figure out haw many cycles the RDTSC instruction takes (this isn't documented anywhere as far as I can tell). I have a small bit of assembly code
that demonstrates a problem I'm having. This compiles and runs fine with both Intel and GNU compilers 3.3, 4.0, etc.

when I compile this and execute under Cygwin (running on Windows XP) and an AMD 4200 I get

./a.exe
ticks per rdtsc 6

which isn't 1 or 2, but I can live with 6 clock ticks to process a seldom-called op.

if I compile and run this under Mac OS X (new Apple MacBook Pro) Intel Core 2 I get 65 ?!?!

if I compile and run this on Suse Linux on a Xeon processor, I get 85?!?! (Intel or GNU compilers agree on this)

I'm not even putting in serializing. does that look right to anyone?

Can anyone verify they get the same results on their x86 machines?

The code:

#include

int main(void)
{
unsigned long long int t0, t1;
int result;
unsigned int ret0[2];
unsigned int ret1[2];
__asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1]));
__asm__ ("xorl %ecx, %ecx "
"L1: "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"rdtsc "
"addl $16, %ecx "
"cmpl $8192, %ecx "
"jne L1");
__asm__ __volatile__("rdtsc" : "=a"(ret1[0]), "=d"(ret1[1]));

t0 = *(unsigned long long int*)ret0;
t1 = *(unsigned long long int*)ret1;
result = (t1-t0)/8192;
printf("ticks per rdtsc %d " ,result);
return result;
}

I've run this one various AMD and Intel machines. most of the AMD machines are returning in 6 to 8 cycles, most of the Intel machines I've tried are returning in 60 to 80 cycles, sometimes as high as 100 cycles. It would be nice if there was some way to query the time register faster. This makes performance measuring and tuning a sketchy affair on Intel chips. Is the rdtsc instruction serializing for some reason (draining the pipelines....) ?

Our engineers agree this questionis also addressed by the solution in the first Q&A above.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us

Intel_Software_Netw1 · ‎08-31-2007

Q. I read Mike Stoner's article entitled Portable Performance Measurement Macros for Intel Architecture. I am studying your IAPERF.H file for using the RDTSC instruction to read the time-stamp counter. I noticed that the CPUID instruction immediately precedes the RDTSC. Why? I cannot find these two instructions used together in the IA-32 documentation.

A. The CPUID instruction serializes the processor pipeline so that all of the preceding instructions must retire before it begins execution. Likewise, the following code will not begin execution until the CPUID retires. This is thought to provide a more accurate cycle count on the code being measured. Really, it shouldn't matter very much if you are measuring something that executes for a million cycles or more.

==

Lexi S.

IntelSoftware NetworkSupport

http://www.intel.com/software

Contact us