Community support for Analyzers (Intel VTune™ Profiler, Intel Advisor, Intel Inspector)

measuring accurate samples

I am trying to profile and optimize an algorithm. When i profile the same using vtune, i get different numbers for CPU_UNHALTED.CORE samples(varies from 96 to 123) and CPU_UNHALTED.CORE %(varies from 6.53% to 8.8%) for the same function for consecutive run. Hence i am not able to judge if my intrinsic optimization is giving any gain or not? What is the method to get more precise profiling information or I am looking into wrong parameters here ?
0 Kudos
1 Reply

There are many things which can affect why a particular function will report different numbers from one run to another.

Here is a small list of possible reasons:
1) Your code really does have a different performance profile from one run to another. Many factors can contribute to this. you can test this by adding a timer to your program - if this changes run for run then you will know that it isn't the measurement tool.
A partial list of things that can change run to run performance:

Memory layout - when your program runs it requests memory (both for variables/arrays on the stack and those coming from the heap. Sometimes the memory is cache lineoriented, and sometimes it isn't - this can greatly affect performance. Even more than the approximately ~20% change you are seeing for this function.

Other Programs - It is best to do analysis on a system that is only running your program. Even the Virus Checker can have substantial impacts on run to run performance.

CPU affinity: Especially if you have a Multi-threaded program - depending on how the OS schedules your thread- it may place your thread on a Hardware thread/core that is sharing (or not sharing) resources with anotherHardware thread/core.Turning off Hyper-threading and pinning the affinity of your threads can make it easier to measure the performance of your application (but could slow or speed upyour application)

Turbo Mode/CPU Sleep States:
The Frequency of Intel Processors can change over time. affecting the performance of your application. it is easier to measure the performance of an application if you turn these features off in the BIOS. (but could slow or speed up your application)

2)Heisenberg uncertainty principle of measurement tools.
This same affect can be applied to performance analysis in software...
You are askinghow toget more"precise" performance analysis. The problem is that the act of measuring performance affects the performance of your application. We have tried hard in Intel VTune Amplifier XEto provide a very low overhead performance analysis tool. Hardware-event based Sampling analysis - takes an interrupt when the event it is monitoring overflows the counter,this interrupt takes cycles to process. The more samples you take per second the more overeadwill occur - affectingthe performance of your application.Less samplesper second - gives youa statistically less accurate performance profile. If you remember your probability and statistics - there is a "confidence interval"when you take samples - this confidence interval creates a standard deviation that affects how accurate the data is.. Example - 6% with a standard deviation of 20%.

3) Event Skid:
Whenthe PMU interrupt occurs - catching that interrupt can occur slightly after the code where it occurred, also the OS can delay interrupts from being processed, and this lag time can very slightly for each interrupt.
Also - on OOO/superscalar/pipelined processors - multiple instructions can be in flight at any one time. Intel VTune Amplifier XE has some logic to correct for all of this this but it isn't "precise". The PMU has some counters which are "Precise" - the address of instruction which caused the "interrupt"will be part of the interrupt. CPU_UNHALTED.CORE - is not one of these counters. While there is no "time" counter which is precise - Depending on the processor - there may be a counter similiar to time - that will not be affected by Event Skid. But note - none of these precise counters will measure "time" accurately.
How "Large" is the code you are trying to measure the performance of? You indicate an intrinsic is being used in the function. If the function is big - this is probably not an issue. If the function is less than 100 assembly line instructions - than Event Skid could have a substantial impact on measuring the performance accurately.

0 Kudos