Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4995 Discussions

Measure Speedup with VTune not possible?

Constantin_Christman
568 Views
I am using the VTune Aplifier XE for Windows in order to support the parallelization of a given program.
VTune is a good help in showing me the hotspots of the program, however I am curious how it can help me to measure the improvement after parallelization.
For example: I have function A which is identified as a hotspot. After parallelization it becomes executed concurrently on multiple processors which speeds everything up. What the analysis of VTune then shows me is the CPU Time over all busy processors which is more or less the same as in the sequential case - this is not a surprise as the actual work was not reduced by parallelization.
I guess measuring the (inclusive/exclusive) time of a given function is just not possible with sampling... am I right here?
One more thing: in your VTune tutorial (https://wiki.engr.illinois.edu/download/attachments/114688007/amplifier_xe_linux.pdf?version=1&modificationDate=1296056455000) on page 27 the author mentions two options how the code can be improved:
* sequential tuning
* parallelization
and in the tutorial they choose the first option. This leaves the impresssion that you could have also used VTune to support the second option, wich seems to be not true as I have described above.
Or did I miss something and you can use VTune to measure the speedup of a funtion after parallelization?
Constantin
0 Kudos
1 Solution
Peter_W_Intel
Employee
568 Views

Yes. Workload is no change, there is no direct indicator to compare them, so I suggested to use CL& execution time.

Sometime you can use Critical Pathdata to compare with serial result. I assume that you have reassigned work to different thread, and start them at almost same time stamp. So works terminated in threads at different time:
T1 T2T3 T4 T5
w1
w2
w3

So CP = T4, to compare this with serial result.

Thanks, Peter

View solution in original post

0 Kudos
4 Replies
Peter_W_Intel
Employee
568 Views

VTune Amplifier XE can help to identify the hotspots, and have two kinds of improvement usually:
1. The workload of hot function can be parallelized (you've done), soit's best utilize the multi-core system,as result itreduced the execution time. You can use Concurrency Analysis to know ifconcurrency level gets better. You are right- all workloads arenot reduced, butparallelized, so execution time of program is reduced in Summary report.Youmight review bottom-up report by using grouping "Thread / Function / Call stack" to know parallel workload in each thread. Observe them - imbalanced? adjust algorithm again?

2. After completing parallelling work, we can step into Microarchitecture level turning - such Branch Misprediction issue, Cache Misses, etc. Your adjust code or use Intel C++ compiler's advanced optimization options. As result, execution time of hot functions will be reduced - that is quite different from parallelism optimization.

Regards, Peter

0 Kudos
Constantin_Christman
568 Views
Hi Peter,
thanks for your Reply!
To sum it up: there is no way to figure out the speedup of a hotspot after parallelization with VTune - is this correct?
0 Kudos
Peter_W_Intel
Employee
569 Views

Yes. Workload is no change, there is no direct indicator to compare them, so I suggested to use CL& execution time.

Sometime you can use Critical Pathdata to compare with serial result. I assume that you have reassigned work to different thread, and start them at almost same time stamp. So works terminated in threads at different time:
T1 T2T3 T4 T5
w1
w2
w3

So CP = T4, to compare this with serial result.

Thanks, Peter

0 Kudos
Constantin_Christman
568 Views
Thanks for your help, Peter!
0 Kudos
Reply