Hey there I found my problem at old topic here
but I can not understand which solution was true and I decide repeated question
In response to the original question, I suggest that on late PIV
hardware (Northwood and Prescott core machines) that you have little
chance of getting reliable timings for a short instruction sequence for
a variety of reasons.
In the Intel staff responses it has already been mentioned that the
first iteration is almost exclusively slower than later iterations but
there is another factor that has always effected timings under ring3
access in Windows 32 bit OS versions. Faced with higher privileged
processes being able to interfere with lower privilege level
operations, you will generally get at least a few percent variation on
small samples and it gets worse as the sample gets smaller.
You can reduce this effect by setting the process priority to high or
time critical but you will not escape this effect under ring3 access. I
have found from practice that for real time testing you need a duration
of over half a second before the deviation comes down to within a
percent or two.
What I would suggest is that you isolate the code in a seperate module in an assembler and write code of this type.
mov esi, large_number
mov edi, 1
; your code to time here
sub esi, edi
Adjust the immediate "large_number" so that the code you are timing
runs for over a half a second, over 1 second is better, set you process
priority high enough to reduce the higher privilege interference to
some extent and you should start to get timings around the 1% or lower
Two trailing comments, the next generation Intel cores will behave
differently on a scale something like the differences between the PIII
and PIV processors so be careful not to lock yourself into one
architecture. The other comment is as far as I remember the FP
instruction range while still being available on current core hardware
is being replaced by much faster SSE/2/3 instructions so if your target
hardware is late enough to support these instructions, you will
probably get a big performance hit if you can use the later
Those discussions are really old, and Intel microarchitectures have changed a fair amount in the intervening 11+ years.
There are lots of topics that you need to be aware of when attempting fine-grain timing. A few of the more important ones are:
For measurements of short duration (<< 1 second)
For measurements of very short duration (< 100's of cycles)