Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

Profiling time consuming loops of a program

Prasanth_G_
Beginner
398 Views

How can VTune help profiling a program that has too many loops/look-ups that take the CPU? It may or may not catch them as hotspots based on the 1ms sample collecting interval. I recently had a significant performance improvement choosing appropriate functions that relatively have less iterations or loops in it. However, none of these functions are listed in the basic or advanced hotspot analysis. Looked at General exploration too but no signs of bottleneck piece of code that was taken out to gain huge performance improvement.

Questions:

1. Can there be any suggestions to catch short loops of a C program called multiple times that reduces the overall performance ? 

2. How can VTune help profiling a software if it cannot catch bottlenecks within the 1ms time interval of collecting samples? For example assuming VTune misses a function that executes less than 1ms time interval but multiple times when it collects the sample in every 1ms time interval ? General exploration also may not find any issue with the bottleneck-function in its front-end, back-end, bad-speculation or retiring clauses.

Any help is greatly appreciated.

Thanks in advance

0 Kudos
2 Replies
Peter_W_Intel
Employee
398 Views

Hopefully I can understand your requirement correctly...the function of your interest has less loop or iterations, and executing time is less than 1ms.  My points are:

1. Such function is not 1st priority to optimize them, unless

2. That function will be called frequently by others. So, you need a caller to call this function multiple times (>=100 for example, I suppose). VTune(TM) Amplifier XE does't guaranteed to capture each sample when that function was called, that function might be called 70, 80 times or other.

3. In order to capture more samples which drops in the function, you need to increase sample interval. For example, if your CPU frequency is 2GHz, VTune will use SAV=2,000,000,000/1000 = 2,000,000 (that means, 1000 samples per second, or say 1ms sample interval. You may use SAV=200,000 so sample interval is 0.1ms. For example:

>amplxe-cl -collec-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD:sa=200000,INST_RETIRED.ANY:sa=200000 -- application parameters

Absolutely, overhead will increase because of more interrupt handler times.

Note: sample interval of advanced-hotspots is [0.01-1000] (default is 1ms); sample interval of basic hotspots is [1-1000] (default is 10ms) - so don't use basic hotspots in your case.

 

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
398 Views

As a general rule, if the program runs long enough then the short loops should be captured if they actually contribute significantly to the overall run-time. 

There are a couple of things to watch out for:

  • If the code's execution is perfectly repeatable and exactly synchronized with the timer used for sampling, you could miss important sections of code.   Some tools randomize the sampling interval to attempt to avoid this -- I don't know if VTune uses such an approach.
  • It is hard to tell how long "long enough" is, but if you avoid the synchronization issue mentioned above, a 1000-sample run should be very reliable at finding any piece of code (no matter how short) that accounts for 10% of total execution time.
  • Sometimes a change to a piece of code that is not a significant contributor to the overall run-time will result in a significant benefit by allowing the compiler to generate different (better) code for other loops or functions.  Such cases are usually associated with the compiler's aliasing analysis, and may require a great deal of expertise (or luck) to find.
0 Kudos
Reply