Hi again,

Olaf_Krzikalla · ‎03-21-2014

Hi @all,

ist there an in-depth explanation of the timely interaction of performance counters (esp. cache miss counters) with the rest of the code? Maybe a specific section in App.B of the Optimization Reference Manual I have missed so far?

An example:

(pmc configured for counting L1D cache misses)
rdpmc
(store eax)
mov xmm0, [esi]  // read from [esi]
mov xmm1, [edi]  // read from [edi]
rdpmc

Now assume, that esi and edi both point to the same location, which initially is not in L1. Then, which difference of the L1 pmc will be observable?
And why? IMHO there are a lot of things (pipelining, out-of-order-execution, stalling), which can influence the result. Is this documented?

Thanks for your help
Olaf

Bernard · ‎03-21-2014

Should not you use serialization instruction like cpuid before using rdpmc?

Olaf_Krzikalla · ‎03-21-2014

In the case of frequent rdpmc calls the usage of cpuid or the like might probably render the measurement invalid.

Suppose e.g. some data is being prefetched. Now due to the delay induced by cpuid the prefetch is done and thus you won't observe a miss even if it's there in non-measured code.

Bernard · ‎03-21-2014

By looking at posted assembly code snippet it seems that pointers are not incremented and reciprocal throughput of rdpmc is ~39 cycles(Agner)

so I suppose that movaps xmm0,[esi] will not be "noticed" by rdpmc instruction because the load of xmm register will be executed concurrently with the execution of rdpmc.

Bernard · ‎03-21-2014

IIRC because of out-of-order execution the second rdpmc instruction could retire before the first one was executed.

TimP · ‎03-21-2014

Intel(r) VTune(tm) has relatively low limits and defaults on sampling rate so it seems that overhead of counter use can't be ignored unless such limits are observed. The strategy of reserving a core for VTune seems more important for Intel(r) Xeon Phi(tm) than for host (at least after tinkering with the graphics options so as to reduce those interruptions).

In my experience, adding serialization instructions adds more overhead than simply sampling at large enough intervals to be able to neglect pipelining and out-of-order variations, but I don't put a lot of credence in simple statements on this.

McCalpinJohn · ‎03-21-2014

As noted elsewhere, the RDPMC instructions are not ordered with respect to other instructions, so they might be executed at unexpected times. Although Intel processors "try" to execute instructions in program order, they will go out of order whenever an instruction has a delay. For performance counters, the problem typically shows up when an RDPMC instruction follows a long-latency instruction (in program order). The hardware will generally issue the instructions in order, but the RDPMC instruction may start execution at the same time as the preceding long-latency instruction, so it will not catch the full latency of that preceding instruction.

There is a "trick" that might work to provide partial ordering on RDPMC. Since RDPMC has an input argument (the counter number), it is possible to build a dependence between the result of the instruction that you want to test and the input argument to the RDPMC instruction. Historically, people have used instructions like XOR to take the output of one instruction and create a false dependency into the input argument of the RDPMC instruction. However, recent Intel processors actually recognize idioms like XOR %eac,%eac as clearing a register (and therefore breaking any potential dependency between prior and future uses of %eac). Agner Fog's microarchitecture documentation discusses which instruction sequences are recognized in this fashion. From a quick look at his documentation, it looks like SBB (Subtract with Borrow) is not subject to this idiom recognition, so it could be used to establish a fairly low latency false dependency between instructions to enforce ordering.

In general, you would need to create this false dependency on both sides of the instruction sequence under test. I.e., the output of the initial RDPMC would need to be a false input to the first instruction under test and the last instruction under test needs to be a false input to the final RDPMC. Unfortunately even this is not enough if the sequence of instructions under test is not serialized, and it is nearly impossible to set up a case for which the initial RDPMC is a false input to *all* of the instructions under test and the final RDPMC has a false input dependency on *all* the instructions under test.

One half of the problem can be solved with the RDTSCP instruction, which will not execute until all prior instructions (in program order) have executed. The output of the RDTSCP can then be run through an SBB instruction to create a false input dependency for some subsequent instruction.

I don't know how the RDTSCP definition of "execute" works with respect to instructions that can be rejected and retried. This occurs frequently with floating-point instructions -- they are issued to the execution units after the instructions that define their inputs have been issued, but if those instructions include memory accesses that miss the cache the floating-point instructions may try to execute and find that their arguments are not actually present. They are then rejected and retried some time later. One would prefer that the RDTSCP instruction not execute until all prior instructions have *completed execution*, but I can't tell whether support for such semantics exists in the hardware.

All of this leads to the oft-repeated advice -- don't expect the performance counters to provide "in-order" counts for very short code sections. Measuring sections that take a minimum of many hundreds of cycles is usually necessary to make the uncertainty in the exact time of execution of the RDPMC instructions irrelevant.

An exception to the above is Xeon Phi. The RDTSC instruction takes only 5-6 cycles and the core executes in order, so RDTSC can be used on a very fine granularity -- for example to time the latency of individual load instructions that miss the L1 and L2 caches.

Olaf_Krzikalla · ‎03-24-2014

Hi again,

thank you all for your helpful answers. Maybe some background: I am trying to trace an application and understand its cache behavior more precisely by recording the pmc at the start of each basic block. Now I know, that at least the three things time overhead, space overhead and out-of-order execution will certainly tamper the measurement. The interesting question is: how much? And I think this question can be best answered by knowing what is going on behind the scenes.

Partially breaking the out-of-order problem by introducing false dependencies might add other mistakes due to the increased time overhead (beside the slowdown). Thatswhy I am not solely interested in making the measurement as precise as possible but also interested in knowing and understanding mistakes introduced by a fast yet imprecise measurement.

Best Olaf

Bernard · ‎03-24-2014

>>>In my experience, adding serialization instructions adds more overhead than simply sampling at large enough intervals>>>

That is understandable because of large impact on the performance as cpuid instruction has.

Timely interaction of performance counters