Sampling and VTune's disadvantages - Page 2

Roman_Oderov · ‎07-04-2012

Hi, everyone!

Im a third-year student of Saint-Petersburg State University, the faculty of Mathematics and Mechanics, Software Engineering department. Not very long ago I decided to make a little research in application profiling area: to analyze, how sampling works. So I decided to publish my results here and I will be glad to know expert opinion.
*must say that I used HotSpots or LightWeightHS analyses

1. Empty Function Test

The first test program. Implements only one almost empty function.

[cpp]int main(int argc, char **argv)

{

int a = 0;

for (int i = 0; i < 10; i++) a += i;

return a; }

[/cpp]

There are few iterations so VTune said No data to show. The cause is that minimum sampling interval is 1 ms and the program work less than 1 ms. Thats why profiler VTune suppose that the test didnt work.

2. Simple Asm Test

To check if there is any additional information in analysis results.

Code:

[bash]

mov rcx, 10000000000

l1:

mov rax,rbp

loop l1

[/bash]
Result:

So, all is OK. Nothing unnecessary
3. Function Stack Test
Implements functionchain of function calls.
Goal: does VTune can show us all nested functions and does sampling have any mistakes?

Scheme of the program:

main

--> ExternalFunc

------> MiddleFunc

---------> InternalFunc

The nested functions are called. Bodies of all functions are identical so we can expect that runtime of all functions will be identical too.

When we compiled the program with compilers optimization, VTune showed us only main and Middle functions. But when optimization was turned off weve got expected result.

So VTune has done its work well

4. Create Process Test

Goal: how Intel VTune Amplifier XE 2011 and sampling profiling method can work with multiprocess application. The program creates N new processes Each of them execute code from the first test (EmptyFunction, i.e. do almost nothing).

[java]

NewProcessCreateAndExit(){

//...

CreateProcess(EmptyFunction.exe,...);

//...

}

//...

void Main()

{

for (i = 0; i < N; i++)

NewProcessCreateAndExit();

}

[/java]

Results

All is well, processes are created in consecutive order but overhead is monstrous: when N equals 500 the program worked about 15 seconds. Intel VTune Amplifier XE Hotspots Analysis runtime about 1000 seconds. Moreover, data collecting lasted about 30 minutes.

5. Counter Losses Test

Goal: estimate how many function calls VTune (sampling) loses.

Idea: call (in different places of a program P1 our test-program) function IncCounter(), measure function runtime with/without profiler, estimate overhead and count of calls lost. Function IncCounter() increments a general counter and this counter will inform us how many calls of IncCounter() we had.

So we have an opportunity to measure the number of function calls and the percentage of losses. We have to calculate how long IncCounter worked without profiling. So I wrote another program P2 to calculate this. To increase accuracy I carried out a great amount of IncCounter calls in programs P1 and P2.

Result:

===================================================================

Analysis | HotSpots (HS) | Lightweight HS |

---------------------------------------+------------------------------------------------+--------------------------------------------

IncCounter in P2 | 575 054 ms | 25 980 ms |

(iterations 75*10⁸) | | |

---------------------------------------+-------------------------------------------------+--------------------------------------------

All program P2 | 628 673 ms | 652 538 ms |

---------------------------------------+-------------------------------------------------+--------------------------------------------

IncCounter in P1 | 24 708 ms | 1 158 ms |

(iterations 3*10⁸) | |

---------------------------------------+-------------------------------------------------+--------------------------------------------

All program P2 | 27 121 ms | 27 893 ms |

====================================================================

Note:

Sampling interval 10 ms

IncCounter() without profiling 21585 ms (iterations 3*10⁸)

Definitions:

It' iterations of IncCounter that were identified by profilier

It general number of iterations, i.e. all iterations executed (get from incremented general counter)

T₁ runtime of P1 without profiling

T₂ runtime of P1 with profiling

T₀ runtime of 3*10⁸ iterations of IncCounter() without profiling

T_x runtime of 3*10⁸ iterations of IncCounter() with profiling

K = T2/T1 slowdown coefficient of the program

T₂.IncCounter IncCounter() runtime in P1 with profiling.

I suggest two hypotheses (formulae to calculate percentage of losses):

It'/It = T₂.IncCounter/T_x

It'/It = T₂.IncCounter/(T₀ * K)

Result:

T₂.IncCounter/T_xT₂.IncCounter/(T₀ * K) K

Lightweight HS 1.1143 0.048084 1.11572

HS 1.0742 1.055163 1.08484

6. Saw Test

Idea: construct a saw from the call stack. Sampling is an approximate method and theoretically we could fit a program whose stack looks like this (here orange circles are samples of VTune; 0-level is main function):
Horizontal - time
Vertical - stack depth

Program scheme:

[java]

int main(){

for(experiment = 10000; experiment >= 0; experiment--){

//...

GoRandDeep(int rand);//function calls itself rand times indepth. Rand generates randomly

//...

f1();//executed every 75 iteration

//...

f1();//executed every 125 iteration

}

// here f1 calls f2, f2 calls f3,, f9 calls f10

// each function f1..f9 doesnt do anything except next function call. f10 performs some actions and returns.[/java]

Result:

i.e. VTune didnt find the last function on the top of the stack only main function.

Sometimes there was GoRandDeep in final analyses but it happened rarely. Sometimes sample got in the middle of the call stack and then there were f (i < 10) in results but it was also rarely.

Its obvious that this test is scalable depends on sampling interval and the result can be the same. So we could fit a program containing a function that executes about 25% (or more) of runtime but this function wont be found by profiler with sampling method. Its very serious shortcoming.

Please, leave Your comments and notes. And I'll be glad to seeany critics and answer Your questions.

Bernard · ‎05-08-2013

What is the frequency of clock interrupt?

Peter_W_Intel · ‎05-09-2013

Default SAV value = CPU main frequency / 1,000

Bernard · ‎05-09-2013

Thanks Peter.

SergeyKostrov · ‎05-09-2013

>>...Default SAV value = CPU main frequency / 1,000 Does it means that it can not measure time intervals smaller than 1000 CPU clocks? Thanks in advance.

Bernard · ‎05-09-2013

SAV value looks like crude approximation of the older RTC operating frequency.