Thanks John very much I

Drum__Anthony · ‎07-20-2018

Hey there I found my problem at old topic here

https://software.intel.com/en-us/forums/intel-isa-extensions/topic/306222

but I can not understand which solution was true and I decide repeated question

In response to the original question, I suggest that on late PIV
hardware (Northwood and Prescott core machines) that you have little
chance of getting reliable timings for a short instruction sequence for
a variety of reasons.

In the Intel staff responses it has already been mentioned that the
first iteration is almost exclusively slower than later iterations but
there is another factor that has always effected timings under ring3
access in Windows 32 bit OS versions. Faced with higher privileged
processes being able to interfere with lower privilege level
operations, you will generally get at least a few percent variation on
small samples and it gets worse as the sample gets smaller.

You can reduce this effect by setting the process priority to high or
time critical but you will not escape this effect under ring3 access. I
have found from practice that for real time testing you need a duration
of over half a second before the deviation comes down to within a
percent or two.

What I would suggest is that you isolate the code in a seperate module in an assembler and write code of this type.

push esi
push edi

mov esi, large_number
mov edi, 1
align 16
@@:
; your code to time here
sub esi, edi
jnz @B

pop edi
pop esi

Adjust the immediate "large_number" so that the code you are timing
runs for over a half a second, over 1 second is better, set you process
priority high enough to reduce the higher privilege interference to
some extent and you should start to get timings around the 1% or lower
variation.

Two trailing comments, the next generation Intel cores will behave
differently on a scale something like the differences between the PIII
and PIV processors so be careful not to lock yourself into one
architecture. The other comment is as far as I remember the FP
instruction range while still being available on current core hardware
is being replaced by much faster SSE/2/3 instructions so if your target
hardware is late enough to support these instructions, you will
probably get a big performance hit if you can use the later
instructions.

Regards,

https://phonty.com/

McCalpinJohn · ‎07-20-2018

Those discussions are really old, and Intel microarchitectures have changed a fair amount in the intervening 11+ years.

There are lots of topics that you need to be aware of when attempting fine-grain timing. A few of the more important ones are:

The RDTSC instruction increments at the rate of the "base" (or "nominal") processor frequency, while instructions are executed at the "core frequency". The "core frequency" may be higher or lower than the "base" frequency, and it may change during your measurement interval.
- If you have the ability to "pin" the processor frequency to match the "base" frequency, interpreting the results is often easier.
- Whether you can fix the frequency or not, you will still need to measure several different things to be sure that you can unambiguously interpret the results. More on this below.
With Turbo mode enabled, Intel processors will change their frequency based on how many cores are active. When running a single user thread, you will often get the advertised single-core Turbo frequency, but if the operating system enables more cores to handle (even very short-lived) background processes, your frequency may drop unexpectedly.
Recent Intel processors often throttle down to a low frequency when not in use, and (depending on processor generation, BIOS settings, and OS settings) it may take longer than expected for the frequency to ramp back up to the expected values.
- I usually precede the code that I want to test with a "warm-up" loop consisting of at least a few seconds of execution of instructions using the same SIMD width as the code that I want to test.
Always pin the thread you want to test to a single logical processor (if possible).
- This allows you to use the RDPMC instruction to read the logical processor's fixed-function performance counters.
- It also reduces the chance of frequency changes or other stalls that may be incurred when moving a thread context to a different core.

For measurements of short duration (<< 1 second)

Intel processors will be halted during frequency changes, and recent Intel processors (Haswell and newer) will also be halted when activating and/or deactivating the portions of the pipeline(s) needed for 256-bit SIMD instructions and for 512-bit SIMD instructions.
- The duration of these halts varies by product and in some cases by the amount of the frequency change. I have seen values as low as 6 microseconds and as high as 50 microseconds for these types of transitions.

For measurements of very short duration (< 100's of cycles)

The RDTSC instruction is not ordered with respect to the execution of other instructions. Intel processors have gained increasing ability to execute instructions out of order over the past decade, allowing the execution of these instructions to be moved further away from where one might expect -- in either direction.
The RDTSCP instruction is partially ordered -- it will not execute until all prior instructions in program order have executed.
- RDTSCP can still be executed later than expected, but not earlier.
- This partial ordering can help expose the execution time of long-latency instructions (such as memory accesses or mispredicted branches) that occur shortly before the final value of the TSC is read using RDTSCP.
The Intel branch predictors are stranger than you might expect, and branch misprediction overheads are not trivial.
- If you repeatedly execute an inner loop with a trip count of less than about 30, the branch predictor will "remember" which iteration is the final iteration of the loop, and it will correctly predict the loop exit.
- If you increase the inner loop trip count to 35 or more, the branch predictor will not "remember" which iteration is the final iteration, so the final loop iteration will include a mispredicted branch, with an associated overhead of 15-20 cycles.
- This can be very hard to understand if you are looking at results for loop trip counts from (for example) 16 to 64 and you see an unexpected bump of 15-20 cycles once the trip count exceeds a limit (typically in the 32-34 range).
- This is even more confusing when you consider vectorization and loop unrolling, which the compiler may change significantly from one compilation to the next as you fiddle with your code.

Some recommendations:

A set of interfaces to the RDTSC and RDPMC instructions that have very low overheads are available at low-overhead-timers
I recommend measuring a minimum of four values:
- Elapsed TSC cycles (using RDTSC or RDTSCP)
- Instructions -- using the RDPMC instruction with counter number (1<<30)+0
- Core Cycles not halted -- using the RDPMC instruction with counter number (1<<30)+1
- Reference Cycles not halted -- using the RDPMC instruction with counter number (1<<30)+2
If you have the ability to program the general-purpose core performance counters, I also recommend measuring at least two more values:
- Instructions executed in kernel mode.
- Core cycles not halted in kernel mode.
Compute these metrics:
- Core Utilization = (Elapsed Reference Cycles not Halted) / (Elapsed TSC cycles)
  - If this is not very close to 1.000, the processor has been halted for frequency and/or pipeline activation issues, and you need to try to figure out why.
- Average frequency while not halted = (Elapsed Core Cycles not Halted) / (Elapsed Reference Cycles not Halted) * Base_GHz
  - This should be compared to the expected frequency for your processor, given the number of cores that you think should be active.
- Average net frequency = (Elapsed Core Cycles not Halted) / (Elapsed TSC cycles) * Base_GHz
  - This will tell you how much of your expected frequency has been lost due to processor halts.
- Instructions Retired / Instructions Expected
  - For simple loops, you can look at the assembly code and count instructions.
  - This value will change significantly (and repeatably) if the compiler changes the vectorization of the loop.
  - This will change randomly (upward) if the OS schedules another process on the same logical processor during your measured section.
  - For measurements of 10,000 instructions or less, this will increase by a noticeable amount if an OS timer interrupt occurs during your measured section.
- Kernel instructions / Total instructions
  - Should be zero for short intervals (<1 millisecond) that don't include a kernel timer interrupt. Discard tests with non-zero values for these short cases.
  - Should be very small (<<1%) for any test that does not include an explicit call to a system routine.
- Core Cycles not Halted in Kernel Mode / Core Cycles not Halted
  - Should be zero for short intervals (<1 millisecond) that don't include a kernel timer interrupt. Discard tests with non-zero values for these short cases.
  - Should be very small (<<1%) for any test that does not include an explicit call to a system routine.

Drum__Anthony · ‎07-23-2018

Thanks John very much I found in your replied my answer. Good luck you!

RDTSC to measure performance of small # of FP calculations