How can VTune help profiling a program that has too many loops/look-ups that take the CPU? It may or may not catch them as hotspots based on the 1ms sample collecting interval. I recently had a significant performance improvement choosing appropriate functions that relatively have less iterations or loops in it. However, none of these functions are listed in the basic or advanced hotspot analysis. Looked at General exploration too but no signs of bottleneck piece of code that was taken out to gain huge performance improvement.
1. Can there be any suggestions to catch short loops of a C program called multiple times that reduces the overall performance ?
2. How can VTune help profiling a software if it cannot catch bottlenecks within the 1ms time interval of collecting samples? For example assuming VTune misses a function that executes less than 1ms time interval but multiple times when it collects the sample in every 1ms time interval ? General exploration also may not find any issue with the bottleneck-function in its front-end, back-end, bad-speculation or retiring clauses.
Any help is greatly appreciated.
Thanks in advance
Hopefully I can understand your requirement correctly...the function of your interest has less loop or iterations, and executing time is less than 1ms. My points are:
1. Such function is not 1st priority to optimize them, unless
2. That function will be called frequently by others. So, you need a caller to call this function multiple times (>=100 for example, I suppose). VTune(TM) Amplifier XE does't guaranteed to capture each sample when that function was called, that function might be called 70, 80 times or other.
3. In order to capture more samples which drops in the function, you need to increase sample interval. For example, if your CPU frequency is 2GHz, VTune will use SAV=2,000,000,000/1000 = 2,000,000 (that means, 1000 samples per second, or say 1ms sample interval. You may use SAV=200,000 so sample interval is 0.1ms. For example:
>amplxe-cl -collec-with runsa -knob event-config=CPU_CLK_UNHALTED.THREAD:sa=200000,INST_RETIRED.ANY:sa=200000 -- application parameters
Absolutely, overhead will increase because of more interrupt handler times.
Note: sample interval of advanced-hotspots is [0.01-1000] (default is 1ms); sample interval of basic hotspots is [1-1000] (default is 10ms) - so don't use basic hotspots in your case.
As a general rule, if the program runs long enough then the short loops should be captured if they actually contribute significantly to the overall run-time.
There are a couple of things to watch out for: