a newbie question!

ahpoddar · ‎11-11-2005

Hi everyone,

I recently used the demo version of vtune, and was wondering the features of the full scaled version. my specific requirement is to be able to extract information about ALL number of clock cycles spent in ALL the functions, and ALL the individual statements in that function. and in case that seems to be taxiing on the resources of the system as a whole, then if its possible to setup some special bookmark pairs in the code to be able to achieve the number of clock cycles spent between those bookmark pairs.

any answer to this will be highly appreciated!

with regards,

Ashish.

David_A_Intel1 · ‎11-12-2005

Hi ahpoddar:

What you are asking for is basically what a simulator would do. If you tried to measure every single instruction as it executed on the processor, it would basically make the system crawl, even between two "bookmarks."

In the past, there were such things as in-circuit emulators to get this info. However, as processors got faster and faster, it was not possible to do it, even in hardware. I don't know what is available nowadays.

The VTune analyzer can give you function-level timings using Call Graph or statistical processor utilization using Sampling. Neither are measured at the instruction level.

Regards,

TimP · ‎11-12-2005

You could collect approximate number of cycles between 2 points in the code by inserting _rdtsc() intrinsics. (Intel or Microsoft C)

ahpoddar · ‎11-12-2005

Hi DaveA and tim18,

Thanks for the quick response!

I have justcouple more questions in continuation with this. Right now in the demo version I was able to get the statistics for some of the functions. and some specific lines in those functions.

1) In the full version will i be able to get the statistics for all the functions or will it still be similar to what I saw in demo version. by your replies i understand that by inserting "__rdtsc()" in the code I will be able to find time spent between two checkpoints, is there any such modifier for a function as well or will I have to use the same thing at entry point and exit point of the function. The difference what I visualize here is the overhead for stack copying for the parameters which might be one big factor during the execution.

2) In the full version I understand that it wouldnt be possible to get the time spent for all the lines of code, and wherever we suspect more time needs to be spent, I can insert "__rdtsc()" checkpoints, (correct me if i am wrong please,) however like in the demo version I was able to get the time spent for some of the loops and places where majority of time was spent, similarly will I get information for all or most of the functions or not?

I will appreciate your response to this query.

regards,

Ashish.

TimP · ‎11-13-2005

VTune, or other profilers, including gprof, will give you a measure of the time spent per function, for those functions built with the debug symbol option (-g for linux VTune). VTune does it simply by adding up the time interrupt events which occurred in each such function. The part of your code which is built with -g should work with VTune much as the example does. Information on functions where few time events occur won't be very accurate, but you don't want to collect events at default rates for more than a few minutes.
gprof (option -pg) does actually insert a timer function call which times from the beginning of a function to the return, but part of the function entry time either gets allocated to the calling function or not allocated at all. VTune call graph option does something similar, but adds significant overhead.

ahpoddar · ‎11-14-2005

Hello tim18,

Thanks once again !

still one part is not clear to me. the function calls as u mentioned are monitored by intercepting timer interrupts which are made during the function call. are these interrupt calls made just when the call is made or after the function call stack has been created (with the pass-by-value parameters having been copied to the called function's parameter variables)

I just hope that i am not confusing in my post here...

thanks for all the replies so far, and a comment on this will be highly appreciated.

regards,

ashish.

David_A_Intel1 · ‎11-15-2005

Hi ashish:

I think it is confusing because the VTune analyer displays something close to what you expected.

While the data is displayed for some source lines, it does not mean we collected detailed data for that source line. Rather, the VTune analyzer performs statistical sampling = periodically interrupts processor and collects EIP, Process ID, and Thread ID. This gives you a representative view of what your code is doing, i.e., where it is spending significant time. It does not count cycles for each instruction. When you view the source, the samples collected on instructions within a source line are aggregated and displayed for that source line. What that tells you is that, in general, this source line represents x% of all the time spent executing your application. You should focus your optimization efforts on the lines with a significant amount of time, as opposed to optimizing code that didn't have any, or had few, samples collected on it.

The VTune analyzer's crowning feature is the ability to control the period that it collects samples on by using the processor's performance monitoring events. So, for example, you could collect samples based on 2nd-level Cache Read Misses to highlight code that may be suffering performance problems due to cache issues.

Hope that helps.