a. I am trying to generate a call graph for my application on linux. When I start VTune for graph generation, the log says Static Instrumentation started. Does that mean that the results in the graph will not be based on the actual CPU usage of my code executing on the machine? If that is so, then how do I configure the graph to collect results dynamically, similar to the Sampling Wizard? And what is VTune presenting the results based on if it is not profiling the application from the CPU?
b. Is there a way to filter/profile a particular shared object file for an application? Since there are many .so as part of the application, the entire process of call graph generation takes over 2 hrs.
c. How helpful/different is running VTune over a debug binary vs a simple binary?
Call graph instrumentation may add significant overhead; it may not reflect accurately the usage of your optimized code outside call graph. If you are running on linux, or with certain other compilers on Windows, you can get call graph with other profilers (e.g. gprof) with less overhead. There isn't sufficient information available for call graph profiling without some form of instrumentation; gprof does it by adding a monitor function layer for each function call which records who called and (separately, not by caller) totals up time. Event sampling is most often done on an optimized build with debug symbols; possibly one without interprocedural optimizations. Debug symbols don't have signficant influence on performance, and allow the VTune analyzer to point to your source code. Optimization usually does influence performance, if you are interested in performance profiling.
The 'static' in Static Instrumentation refers to when the instrumentation is performed, not when the data is collected. Static instrumentation is performed before the application runs (more precisely, when each module is loaded). The time for each function is collected as the instrumented program runs.
The newer version of VTune (VTune Amplifier XE) can capture performance information statistically - it collects information with less overhead (although it doesn't generate the complete call graph)
I was re-reading your reply and have some fresh doubts. When generating the call graph using VTune, I did get information like the time spent to execute a particular function, the total time spent for that function based on how many times it was called, etc. So when you compared it with gprof, did you suggest that the information provided by gprof is more sophisticate/elaborate as compared to VTune? (Please note that I did not run gprof as yet; that will require quite a bit of effort as I'll need to customize it for the specific daemon that I intent to profile, so am clarifying certain things with you).
Also, referring to your first senetnce w.r.t. the overhead, so my question is, when we run the code with a tool that has a higher overhead, in what way does it not profile the application accurately, or is better than the one with lesser overhead?
Call graphing takes additional time for function calls and returns, so is not representative of release performance for functions which run only a short time per call. gprof might be considered less sophisticated than VTune (no GUI provided, no analysis of relative responsibility of various callers for time spent), but it normally has considerably less overhead and often it introduces little distortion of the results.