Variance in Testing Results

patze · ‎05-13-2009

Hi all,

Im using VTune Analyzer 9.0 Build 719 on Windows Vista. I create Call Graph-Analyzes for a piece of software Id like to test. To get comparable results I checked my application twice after a cold start of my system. Each test run 4 hours. Thus I expect comparable results in the two runs, but I have a pretty big variance. For example one method was profiled with an Average Self Time of 20ms per Call in the first run, the second run needed 25ms average self time for this method. This seems to be a pretty big variance, which can be found in all methods I profiled, even though the testbed should be the same.

Some additional infos:

I set the instrumentation level for all modules except mine to "Minimal" because during my first tests the Application didnt create any results. Now this works, though I get very wide-range results.

Thanks a lot for your help!

Patrick

TimP · ‎05-13-2009

You don't give any indication what kind of help you want, so I'll just throw out some comments.
VTune isn't designed for repeatability of runs of multiple hours. You would have to take precautions to limit the data collection so that disk response issues don't introduce variable performance.
With Vista on a multi-core platform, it may not be practical to get sufficient affinity for repeatability, but you could try whichever affinity scheme may be suitable. If it's this kind of issue, maybe Windows 7 would improve on it.

patze · ‎05-13-2009

Quoting - tim18

You don't give any indication what kind of help you want, so I'll just throw out some comments.
VTune isn't designed for repeatability of runs of multiple hours. You would have to take precautions to limit the data collection so that disk response issues don't introduce variable performance.
With Vista on a multi-core platform, it may not be practical to get sufficient affinity for repeatability, but you could try whichever affinity scheme may be suitable. If it's this kind of issue, maybe Windows 7 would improve on it.

Thanks for your answer. First Ill tell you, what kind of answer Id like to have, though your heading to the right direction. Im searching for an explanation for this behaviour or even better a way to get comparable results. Maybe Im misunderstanding the intention of VTune, but whats the gain of a performance profiler if the results of the performance analysis differ in such a big range? The first run I see a method which needs too much computation time, the next run it needs 40% less time. If I only did the second run, I would never think about optimizing the concrete method whereas I would when I only did the first run.

Though your advice concerning the amount of data-collection seems to be a good step to make my results more reliable. Could you tell me, how this could be achieved? I already introduced a limitation to the amount of data collected by the call-graph analyzer (which was activated by default). As I wrote, I also set the instrumentation level of all loaded module instead of mine to "minimal", so that the amount of data should be reduced. What other options do I have?

TimP · ‎05-13-2009

Assuming you have calibration disabled (as you should have done, for comparison of separate runs), the default sample after values should be reasonable for test runs of a minute or so. You might want to increase them by a factor of 100, for example, so that running 100 times as long saves a similar amount of data in the files, and so the impact of VTune's own file system operations would be small.
If your application spends a lot of time working on the file system, you may have to figure out how to restore the file system to the same state for each run. In some cases, it may involve rebooting so as to have the memory initialized the same each time.

Thomas_W_Intel · ‎05-13-2009

Quoting - tim18

Assuming you have calibration disabled (as you should have done, for comparison of separate runs), the default sample after values should be reasonable for test runs of a minute or so. You might want to increase them by a factor of 100, for example, so that running 100 times as long saves a similar amount of data in the files, and so the impact of VTune's own file system operations would be small.
If your application spends a lot of time working on the file system, you may have to figure out how to restore the file system to the same state for each run. In some cases, it may involve rebooting so as to have the memory initialized the same each time.

Tim,

The Pat is doing call-graph. However, in my opinion he should do a sampling run and follow your advice to increase the sample-after value by at least a 100.

Kind regards
Thomas

robert-reed · ‎05-13-2009

Quoting - Thomas Willhalm (Intel)

[Patrick]is doing call-graph. However, in my opinion he should do a sampling run and follow your advice to increase the sample-after value by at least a 100.

Picking up on commentsby Thomas, which mirror my own thoughts when I was reading this thread earlier, I have a question for Patrick. Is there something specific you hope to learn by using call-graph analysis on your application? Call-graph analysis in VTune analyzer is generated by instrumenting each of the modules and capturing the calls as they occur oneby one. These numbers are not sampled so there's no way to attenuate the counts to accommodate long collection runs (four hours is definitely a long run).

Our usual recommendation for performance analysis is to start with hot spot analysis, wherein the Instruction Pointer is queried at regular intervals to track where the program is spending its time. This sampling can be varied to accommodate long runs by changing the ratio between the number of events seen and the number of samples taken (the so-called SAV or Sample After Value). By raising the SAV (which by default is set for a particular machine to assure sample collection every 1 ms or so), you can uniformly decrease the frequency of samples, guaranteeing the sample counts stay within a measurable range. The functions that collect the most counts are the places where the processor is spending most of its time and thus the most likely place for changes to affect performance.

patze · ‎05-14-2009

Quoting - Robert Reed (Intel)

Picking up on commentsby Thomas, which mirror my own thoughts when I was reading this thread earlier, I have a question for Patrick. Is there something specific you hope to learn by using call-graph analysis on your application? Call-graph analysis in VTune analyzer is generated by instrumenting each of the modules and capturing the calls as they occur oneby one. These numbers are not sampled so there's no way to attenuate the counts to accommodate long collection runs (four hours is definitely a long run).

Our usual recommendation for performance analysis is to start with hot spot analysis, wherein the Instruction Pointer is queried at regular intervals to track where the program is spending its time. This sampling can be varied to accommodate long runs by changing the ratio between the number of events seen and the number of samples taken (the so-called SAV or Sample After Value). By raising the SAV (which by default is set for a particular machine to assure sample collection every 1 ms or so), you can uniformly decrease the frequency of samples, guaranteeing the sample counts stay within a measurable range. The functions that collect the most counts are the places where the processor is spending most of its time and thus the most likely place for changes to affect performance.

First at all, thanks for your replies. To answer your question, my intention is to compare two versions of my software. The first one is the original version, the second one is a refractored version which has been changed in some components. Id like to prove, that the refractored version at least isnt much slower than the original version. As not all parts of the system are modified during the refractoring, I chose the call graph analysis so that Im able to compare the performance of the relevant functions. And this is exactly my problem, if I have such big variances in my profile data, I wont be able to state that version A or version B is faster / slower.

My application doesnt do much operations which require hard disk access, so that the file system shouldnt be the origin for the results. Also after every test run I shut down my computer entirely for 30min so that I can guarantee a cold start with free RAM.

And one last question regarding the ratio which has been mentioned in several posts. Where do I find the option to modify this ratio and is it also applicable for call-graph analysis?

TimP · ‎05-14-2009

I didn't pick up on the idea that you were trying to use call graph for performance comparisons, so I apologize for getting into the discussion.

patze · ‎05-14-2009

Quoting - tim18

I didn't pick up on the idea that you were trying to use call graph for performance comparisons, so I apologize for getting into the discussion.

No problem! I should have made my intention clearer. Thanks anyway for your suggestions regarding data collection!

robert-reed · ‎05-18-2009

Quoting - patze

First at all, thanks for your replies. To answer your question, my intention is to compare two versions of my software. The first one is the original version, the second one is a refractored version which has been changed in some components. Id like to prove, that the refractored version at least isnt much slower than the original version. As not all parts of the system are modified during the refractoring, I chose the call graph analysis so that Im able to compare the performance of the relevant functions. And this is exactly my problem, if I have such big variances in my profile data, I wont be able to state that version A or version B is faster / slower.

To a certain degree it should be possible to do such comparisons using hot spot analysis as well, though the effectiveness of either technique degrades as the sources continue to diverge, such that functions and eventually whole topologies of functions are unique between the two branches. On long runs such as the ones you've described, a hot spot collection of each variant run with an appropriately scaled Sample After Value to guarantee the counters don't overflow could be beneficial. Just note how much the samples shift among the common functions in the before and after runs

My application doesnt do much operations which require hard disk access, so that the file system shouldnt be the origin for the results. Also after every test run I shut down my computer entirely for 30min so that I can guarantee a cold start with free RAM.

And one last question regarding the ratio which has been mentioned in several posts. Where do I find the option to modify this ratio and is it also applicable for call-graph analysis?

I'm not sure what is the value of shutting down your computer before each collection run, especially for a task that takes four hours of collection to produce a meaningful result. It will likely make the collected data more complex, having to deal with uncached accesses more likely at the start of such an application, and might even account for some of the variability you report. I'm not sure what ratio is being referred to. Certainly there are no such ratios that I am aware of in call-graph analysis though with lots of sample counts, there are lots of possibilities for ratios in Event Based Sampling.