I revive this post because it is still very present.
This is what I need, and maybe a lot of people around here have the same needs.
The title is: REAL time matters, CPU time don't (when you are comparing parts of code)
In the picture there is an example of the current VTune output for the function "a" and what I suggest to add.
a_mean is the mean of the time spent by the function "a" in each thread
a_dev is the max deviation from this mean value (or the mean deviation)
So, I AM NOT interested in:
- the SUM of the time spent in every core by each function (so the solution cannot be just "Filter IN by selection" on a specific function)
- the time spent in EACH core by each function (so the solution cannot be just "Filter by thread")
but I DO am interested in:
- the AVERAGE time spent by each function
Moreover, the balance on each core should be balanced so the average value is much more interesting.
Why? Because REAL time matters, not CPU time!!!
I would like to COMPARE different parts of the code and look which of them are taking more wall clock time.
Then, and only then, I will focus on the functions that are taking longer (looking if there is room to do more parallelization and bla bla).
Having the SUM of time spent, makes sense only when you are already to the next step: improving the parallelized code, but not before, when you have to compare code.
I know that has some meaning, because logically every improvement in parallelized code is amplified by a factor of num_threads, but this is not what I want to look at the first step.
Please ask if the idea is not clearly exposed.
P.S: please, do not tell me that I should need Frame, Tasks, Event and all the __itt_blablabla stuff!
Instrumenting the code takes more time, needs to change the code, change the library linkings, change the include files... it makes no sense to do it when you already have all the informations just doing a sampling profile!!!!
There is no actual need to instrument the code! It is like buying a truck to go to work: isn't a car enough?
I have some experience with OpenMP apps tuning (and it might help here since the picture you provided is close to "fork-join" model) and let me bring some points here.
In terms of imbalance it is not every time enough to see what you want for one function since a functions might be imbalanced with idling one cores, b others, c also can different behavior so individually they can be imbalanced but together will be OK. So it is important to calculate per thread waste time on a barrier before s2 for the full group a,b,c.
Also for significant amount of throughput codes you will have your abc instance in the loop. And there might be thousands of instances. In this conditions you might be interesting in metrics aggregated by the region and probably divergence between instances and concentrate on outliers.
Two things above are achievable only with the code markup when you know fork and joint points. And this is about frame API. Since you instrument global points (not per thread) - this will be cheap enough even if you have a couple of microsecond regions. (note - function enter/leave instrumentation to calculate function elapsed time is much more intrusive since it is per-thread). So this is not just blablabla, this is the thing that really can help you in parallel efficiency analysis for fork-join codes and allows you to get wall time metrics.
If a-b-c instance is big enough to be statistically representative in terms of sampling you can dive to a particular instance and applying Function/Thread grouping see how a particular a b or c was executed on a particular thread if needed by the way.
Thanks & Regards, Dmitry