I am trying to use intel vtune to get a break down of the execution time of a C/C++ program. I do not want to instrument the code. I have compiled and run the code with debugging/profiling enabled on the compiler.
I would like to calculate a break down of the following :
Can this breakdown be obtained using Vtune ? if so how ? Here is what I have tried:
in the "general exploration" viewpoint, "bottom up" I can see a nice breakdown (in percentage) : core-bound + retiring . is is safe to assume that this summation is the pure 'computation cost' ? and "front-end bound" and "memory bound" is the 'total portion of time spent on memory system' ?
What about the time spent on disk i/o ? where is this information ? whats the disk latency ?
Also I can see 'modules' within my process : vmlinux, libtpsstool, libc, libstdc and other libXX modules. Is it safe to assume all these are 'kernel' related ?
I have attached a screenshot.
Many thanks for your help.
"Effective CPU time" is you want in advanced-hotspots report, there is no wait. Other items - "Spin time" and "Overhead time" you can ignore.
"Clockticks" in you want in general-exploration report, it including - front-end bound + bad speculation + back-end bound + retiring. If you have interest of analyzing memory bound related data - they are in back-end bound category.
thank you for your reply.
I have the following measures:
Elapsed time : 16.839s and Effective time : 11.833s, I can see CPUTime = 11.837s , so roughly EffectiveTime = CPUTime
But when I look as "Memory usage viewpoint" it says 23.2% is Memory Bound. And in "General exploration" viewpoint I can see 42% as 'Retiring'. And 29.9% as Core Bound.
Does this mean my 'Pure Computation time' is simply (100-23.2) = 76.8% ? (rather high)
Or is it only 42% (i.e. retiring) ?
The metric terminology is a bit confusing. By Core-bound I assume the application is 'using the core' by that percentage.
> Elapsed time : 16.839s and Effective time : 11.833s, I can see CPUTime = 11.837s , so roughly EffectiveTime = CPUTime
CPU time = Effective time roughly, meant there is no spin time and overhead time almost. I think that your program is serial code, because of CPU time < Elapsed time.
>But when I look as "Memory usage viewpoint" it says 23.2% is Memory Bound. And in "General exploration" viewpoint I can see 42% as 'Retiring'. And 29.9% as Core Bound.
I get confused that you said "Memory usage viewpoint" since it can not be in General-exploration report. Did you mean HPC Performance characterization viewpoint? Memory bound (or core-bound) only is one of back-end bound,
CPU time (Effective) = Front-end + Bad-speculation + bak-end + retiring
Thank you so much for taking time to reply, and sorry for my late reply.
Yes you are right - it is serial code. I am trying to profile an HEVC decoder. For reference, I am using Vtune Amplifier XE 2016 (update 2 build 444464). I've attached my profile data to this response.
I ran 'Memory Access' under 'Analysis Type'.
I am then checking the analysis tab "Memory usage viewpoint". There are other viewpoints as well, such as 'general exploration' , 'hardware events' etc.
I understand that : [CPU time (Effective) = Front-end + Bad-speculation + back-end + retiring]. So maybe what I need is Not [CPU time]. I want the time or percentage of effective time, taken to perform ONLY the computation - EXCLUDING time taken to wait for stalls, wait for memory requests/responses, waiting for computation units to be available, etc. Just pure time taken to compute : I think this is called 'Retiring' ? for my profiling, Retiring = 42%.
What is Core-bound ? The explanation on the intel website is a bit unclear (or maybe its only unclear to me). Is it the time a CPU 'waited' not doing anything, because one or more of the computation units were occupied ?
In user manual - ... On the other hand, “core bound” which corresponds to stalls due to either the Execution- or OOO-clusters, is a bit trickier. These stalls can manifest either with execution starvation or non-optimal execution ports utilization. For example, a long latency divide operation may serialize the execution causing execution starvation for some period, while pressure on an execution
port that serves specific types of uops, might manifest as small number of ports utilized in a cycle.
In my view, compution time should include four factors, they are execuation time, you need to optimize them, for example - memory bound, core-bound (ports utilization).