Message Edited by email@example.com on 06-23-200607:07 AM
Message Edited by firstname.lastname@example.org on 06-23-200607:08 AM
Here's some additional clarifying informationthe Intel Software Network Support teamreceived about this from our engineering contacts:
Some clarification might help. For the sake of argument, lets assume this is being done on an Intel Core2 processor. The core2 processor executes instructions out of order (unlike an Intel486 processor), dispatching them to the execution units as their inputs become available rather than in the programmed order. They sit in the Reorder Buffer (ROB) and are retired in programmed sequence. Up to 4 instructions can be retired per clock cycle. This OOO execution can result is bursts of retirement.
When the Performance Monitoring Unit (PMU) is used to sample on the occurrence of a performance event, the counter is programmed to count the desired event and is initialized to the Sample After Value (SAV). With each events occurrence the counter is decremented.
When the counter underflows an interrupt is raised by the hardware, and the processor will branch to the address of the interrupt handler specified by the interrupt vector, the VTune Analyzer driver in this case. The driver may not actually start executing for some number of cycles. For example if the processor is executing a ring 0 OS critical piece of code like a page fault handler, this activity will not be interrupted by the performance monitoring interrupt. The point is that the driver acquires the IP of the last retired instruction before it took over control.
Long latency instructions like loads from memory and sqrt and divide will have larger windows during which they are the oldest instruction.
The net effect of these three points, the OOO execution, the larger windows for long latency instructions and the variable interupt response time generate the effect called skid. One particular point is that the combination of the OOO execution and the possibility of multiple instructions being retired per cycle can result in certain instructions never being assigned a single sample during this interrupt behavior, thus the ratio of samples on successive instructions in the VTune disassembly display (or any other sampling tool for that matter) can be infinite.
Unless you are using precise events where the HW captures the IP of the instigating instruction, (ex mem_load_retired.l2_line_miss for L2 cache misses caused by loads), the exact IP value associated with the sa mples should only be viewed as an estimate of the region. In the case of a loop, the event probably occurred in the loop, but even that might not be the case if you rig the test case carefully.