We're trying to get started using VTune and it's obvious that we've got a lot to learn. Can anyone point me toward some good learning resources? Websites and books would both be great. If Intel or any other companies offer training courses that you know of, I'd also like to know about them so they can be considered. I saw this post which is fairly helpful but it's still a little bit over my head I think: http://software.intel.com/en-us/forums/showthread.php?t=71883&o=d&s=lr
I've got one other more specific question as well. A very frequent use case that we're going to have is optimizing very short functions and looking for places where we can shave off a handful of milliseconds. For example, I'm currently working on a function that takes less than 0.01 milliseconds to execute on my computer but it's called many thousands of times, frequently totaling up to about 20 milliseconds after all is said and done. Can VTune help much at all with something so short or is that below a threshold where it just can't get enough information to work with? Most of our stuff does some serious computations on lots of data so in a lot of cases I'd like to be able to determine with some certainty whether we're up against cache problems or just really hammering away on the CPU.
Thanks in advance,
There are several books available regarding the VTune analyzer. I highly recommend the 2nd edition of The Software Optimization Cookbook, http://www.intel.com/intelpress/sum_swcb2.htm. There is also VTune Performance Analyzer Essentials, http://www.intel.com/intelpress/sum_vtune.htm.
We have many "articles" available on our KnowledgeBase. For example, How do I Profile a Microsoft* .NET* Web Application? and Locating Thread Contention with VTune Performance Analyzer. And lots of resources on our community web site.
I'm sorry, but I don't know of any companies providing training on the VTune analyzer. Perhaps others on the forum have suggestions?
Finally, regarding your last question, it depends on a couple of things. First, for sampling, if 20 milliseconds is a significant amount of the overall application time, then you can adjust the collection configuration to collect data at this granularity. For example, if the applcation completes in one second, then you can decrease the sample-after-value so that samples are collected more frequently. The downside is that more frequent sampling introduces more overhead and perturbs the system more. And, if you aren't careful, you can lock up the system! For example, if you increase the sampling rate so that say 10,000 samples are collected each second, the system will become unresponsive.
Second, call graph will give you function timing information, so you may be able to use this information to measure performance improvement. However, timing information from call graph is not completely accurate, since there are some heuristics used to reduce overhead and estimate collection overhead.
I don't know if you are aware of it, but you can try a free 30-day evaluation and see if you can collect any meaningful data: http://www.intel.com/software/products/global/eval.htm
I think the best bet is to add timing code around the function call and either use a large enough workload to keep the function busy (which would also allow you to use sampling effectively) or aggregate the timing information for each call and spit out total and average execution times.
The sampling feature of the analyzer is going to give you what you need to determine what is impacting performance of the function, e.g. cache efficiency, instruction mix, etc. But, you are going to have to keep it executing long enough to collect that data.