>>PMU event-based sampling

Roman_Oderov · ‎07-04-2012

Hi, everyone!

Im a third-year student of Saint-Petersburg State University, the faculty of Mathematics and Mechanics, Software Engineering department. Not very long ago I decided to make a little research in application profiling area: to analyze, how sampling works. So I decided to publish my results here and I will be glad to know expert opinion.
*must say that I used HotSpots or LightWeightHS analyses

1. Empty Function Test

The first test program. Implements only one almost empty function.

[cpp]int main(int argc, char **argv)

{

int a = 0;

for (int i = 0; i < 10; i++) a += i;

return a; }

[/cpp]

There are few iterations so VTune said No data to show. The cause is that minimum sampling interval is 1 ms and the program work less than 1 ms. Thats why profiler VTune suppose that the test didnt work.

2. Simple Asm Test

To check if there is any additional information in analysis results.

Code:

[bash]

mov rcx, 10000000000

l1:

mov rax,rbp

loop l1

[/bash]
Result:

So, all is OK. Nothing unnecessary
3. Function Stack Test
Implements functionchain of function calls.
Goal: does VTune can show us all nested functions and does sampling have any mistakes?

Scheme of the program:

main

--> ExternalFunc

------> MiddleFunc

---------> InternalFunc

The nested functions are called. Bodies of all functions are identical so we can expect that runtime of all functions will be identical too.

When we compiled the program with compilers optimization, VTune showed us only main and Middle functions. But when optimization was turned off weve got expected result.

So VTune has done its work well

4. Create Process Test

Goal: how Intel VTune Amplifier XE 2011 and sampling profiling method can work with multiprocess application. The program creates N new processes Each of them execute code from the first test (EmptyFunction, i.e. do almost nothing).

[java]

NewProcessCreateAndExit(){

//...

CreateProcess(EmptyFunction.exe,...);

//...

}

//...

void Main()

{

for (i = 0; i < N; i++)

NewProcessCreateAndExit();

}

[/java]

Results

All is well, processes are created in consecutive order but overhead is monstrous: when N equals 500 the program worked about 15 seconds. Intel VTune Amplifier XE Hotspots Analysis runtime about 1000 seconds. Moreover, data collecting lasted about 30 minutes.

5. Counter Losses Test

Goal: estimate how many function calls VTune (sampling) loses.

Idea: call (in different places of a program P1 our test-program) function IncCounter(), measure function runtime with/without profiler, estimate overhead and count of calls lost. Function IncCounter() increments a general counter and this counter will inform us how many calls of IncCounter() we had.

So we have an opportunity to measure the number of function calls and the percentage of losses. We have to calculate how long IncCounter worked without profiling. So I wrote another program P2 to calculate this. To increase accuracy I carried out a great amount of IncCounter calls in programs P1 and P2.

Result:

===================================================================

Analysis | HotSpots (HS) | Lightweight HS |

---------------------------------------+------------------------------------------------+--------------------------------------------

IncCounter in P2 | 575 054 ms | 25 980 ms |

(iterations 75*10⁸) | | |

---------------------------------------+-------------------------------------------------+--------------------------------------------

All program P2 | 628 673 ms | 652 538 ms |

---------------------------------------+-------------------------------------------------+--------------------------------------------

IncCounter in P1 | 24 708 ms | 1 158 ms |

(iterations 3*10⁸) | |

---------------------------------------+-------------------------------------------------+--------------------------------------------

All program P2 | 27 121 ms | 27 893 ms |

====================================================================

Note:

Sampling interval 10 ms

IncCounter() without profiling 21585 ms (iterations 3*10⁸)

Definitions:

It' iterations of IncCounter that were identified by profilier

It general number of iterations, i.e. all iterations executed (get from incremented general counter)

T₁ runtime of P1 without profiling

T₂ runtime of P1 with profiling

T₀ runtime of 3*10⁸ iterations of IncCounter() without profiling

T_x runtime of 3*10⁸ iterations of IncCounter() with profiling

K = T2/T1 slowdown coefficient of the program

T₂.IncCounter IncCounter() runtime in P1 with profiling.

I suggest two hypotheses (formulae to calculate percentage of losses):

It'/It = T₂.IncCounter/T_x

It'/It = T₂.IncCounter/(T₀ * K)

Result:

T₂.IncCounter/T_xT₂.IncCounter/(T₀ * K) K

Lightweight HS 1.1143 0.048084 1.11572

HS 1.0742 1.055163 1.08484

6. Saw Test

Idea: construct a saw from the call stack. Sampling is an approximate method and theoretically we could fit a program whose stack looks like this (here orange circles are samples of VTune; 0-level is main function):
Horizontal - time
Vertical - stack depth

Program scheme:

[java]

int main(){

for(experiment = 10000; experiment >= 0; experiment--){

//...

GoRandDeep(int rand);//function calls itself rand times indepth. Rand generates randomly

//...

f1();//executed every 75 iteration

//...

f1();//executed every 125 iteration

}

// here f1 calls f2, f2 calls f3,, f9 calls f10

// each function f1..f9 doesnt do anything except next function call. f10 performs some actions and returns.[/java]

Result:

i.e. VTune didnt find the last function on the top of the stack only main function.

Sometimes there was GoRandDeep in final analyses but it happened rarely. Sometimes sample got in the middle of the call stack and then there were f (i < 10) in results but it was also rarely.

Its obvious that this test is scalable depends on sampling interval and the result can be the same. So we could fit a program containing a function that executes about 25% (or more) of runtime but this function wont be found by profiler with sampling method. Its very serious shortcoming.

Please, leave Your comments and notes. And I'll be glad to seeany critics and answer Your questions.

Peter_W_Intel · ‎07-06-2012

Here are my comments:

1. No result in Empty Function Test

If the function runs shortly (<10ms), HS collector can't capture sample(s) in this function.

The user can copy from (predefined) HS collector, to create a new analysis type and modify "interval"

2. Simple Asm Test

HS collect will capture samples in these code area (dead-loop), the user can specify "-duration" of collector or use "amplxe-cl -command stop -r result-dir"

3. Function Stack Test - miss function(s)

The reason is that some functions are compiled as "inlined" functions, when use compiler's optimization switches, see Top-down report; Yes, if the user disabled optimization switches, there is no such problem; However the user can readthis articleto know about displaying inlined functions when use optimization switches.

4. Create Process Test

The user-mode sampling doesn't support multi-process application. Please use PMU event-based sampling which can monitor all processed in system wide.

5. Counter Losses Test

Seemed that IncCounter of HS result consumed bigger CPU time than LW-HS result. I suspect that You only showed user-function in report, not for user-function/system-function. Thus, all CPU time consumed in system function was counted in IncCounter.

6. Probaly the reason is same as 1

Roman_Oderov · ‎07-06-2012

Thanks, Peter!

I know some of the facts you posted. May be you're right about #5.

About #1 and #6: but the disadvantage of VTune is 1 ms sampling interval and it's too large interval from the point of view of a processor.If a function is a hotspot in our system (i.e. it will run about 20% of runtime) but one iteration will take less than 1 ms, then the profiler could hardly observe it. Am I right?

Peter_W_Intel · ‎07-06-2012

Use LW HS instead of HS, you can set SAV value flexible- for example 2GHz CPU frequency, you can set SAV=2,000. That means 1s generates 2,000,000,000/2,000 = 1,000,000 samples, 1 sample is generated by1us

Roman_Oderov · ‎07-06-2012

OK, I'll try it.

Kai_M_ · ‎05-07-2013

Hi,

I recently compare VTune and -finstrumental-functions of GCC. I found VTune is missing functions.

I am using Ubuntu 11.10 32bit with Core i5-2500 @3.3GHz

Ex.

compile SPEC hmmer benchmark.

When I run VTune

amplxe-cl -c snb-general-exploration benchmark.run

and then

amplxe-cl -report hotspots

I got 31 functions.

However, when I use -finstrument-funciton and hook the _enter ane _exit function, I got

70 functions have been called.

How can I get all the functions in a program with VTune?

Best wishes

Peter_W_Intel · ‎05-07-2013

As I replied you in message - "-finstrumental-function" instruments all functions but VTune(TM) Amplifier XE uses periodical sampling collections with call stack info, some "tiny" functions will not be profiled since PMU event-based sampling doesn't capture samples in these functions (they consumed little CPU time).

Bernard · ‎05-08-2013

>>>ome "tiny" functions will not be profiled since PMU event-based sampling doesn't capture samples in these functions (they consumed little CPU time)>>>

What is the lowest limit as perceived by VTune for sampling of short time running function?

Thanks in advance.

Bernard · ‎05-08-2013

>>>All is well, processes are created in consecutive order but overhead is monstrous: when N equals 500 the program worked about 15 seconds.>>>

It is expected because process creation task is quite long multi-staged and complicated.Large portion of the time is spent in kernel mode and performing I/O operation on behalf of newly created process and its first runnable thread.

Peter_W_Intel · ‎05-08-2013

iliyapolak wrote:

>>>ome "tiny" functions will not be profiled since PMU event-based sampling doesn't capture samples in these functions (they consumed little CPU time)>>>

What is the lowest limit as perceived by VTune for sampling of short time running function?

Thanks in advance.

1000 samples per second, default SAV value. It means if your function ran less than 1ms, it will not be in hot function list.

Bernard · ‎05-08-2013

I thought that the value is ~3ms.

Anyway thanks for the answer.

SergeyKostrov · ‎05-08-2013

>>...1000 samples per second, default SAV value. It means if your function ran less than 1ms, it will not be in hot function list. Peter, I wonder if it could be changed to 10,000 samples per second, for example, or even higher? 1000 samples per second looks like a strange limitation because I could measure a performance of some for-loop with accuracy of several tens of nano-seconds ( a couple of tens of clocks ) using RDTSC instruction without VTune. If I use VTune ( by the way, an expensive software product ) it can not do precise measurements for me.

Bernard · ‎05-08-2013

Resolution of system timer is around ~3ms you can check it with clockres tool.In order to use RDTSC VTune must be able to inject code into measured process address space.Other option is to use sampling by clock level IRQL , but it is very costly operation and it is used with thread scheduling.

Peter_W_Intel · ‎05-08-2013

PMU event-based sampling collection uses hardware to interrupt to gather performance data, via vtune driver. (System mode)

User-mode sampling collection uses system timer to interrupt to gather performance data. (User mode)

SergeyKostrov · ‎05-08-2013

>>PMU event-based sampling collection uses hardware to interrupt to gather performance data, via vtune driver. (System mode) >> >>User-mode sampling collection uses system timer to interrupt to gather performance data. (User mode) What recommendations would you give for measuring very fast ( let's say less then 0.1ms ) blocks of codes with VTune?

Bernard · ‎05-08-2013

Peter,

In case of PMU based sampling is it done by rise edge of clock interrupt on Windows?

Peter_W_Intel · ‎05-08-2013

Let me show the example to use 0.1ms as sample interval.

The system is 2GHz CPU frequency and SAV of event CPU_CLK_UNHALTED.THREAD = 2,000,000 set by VTune as default.

Copy/create your analysis type and change SAV of CPU_CLK_UNHALTED.THREAD = 200,000 - which uses sample interval 0.1ms

>> In case of PMU based sampling is it done by rise edge of clock interrupt on Windows?

Will set SAV value in event counter register first, decrease when clock elapsing, interrupt will occur (capture a sample) when event counter register has to decrease but the value is zero.

SergeyKostrov · ‎05-08-2013

>>Let me show the example to use 0.1ms as sample interval. >> >>The system is 2GHz CPU frequency and SAV of event CPU_CLK_UNHALTED.THREAD = 2,000,000 set by VTune as default. >> >>Copy/create your analysis type and change SAV of CPU_CLK_UNHALTED.THREAD = 200,000 - which uses sample interval 0.1ms Thank you, Peter. Is there any lower limit for CPU_CLK_UNHALTED.THREAD event?

Peter_W_Intel · ‎05-08-2013

>> Is there any lower limit for CPU_CLK_UNHALTED.THREAD event?

No specific limit value, but SAV of CPU_CLK_UNHALTED.THREAD should be greater than 10,000, at least. Notice: if you work on "fast" sampling, overhead will be increased obviously.

SergeyKostrov · ‎05-08-2013

>>>> Is there any lower limit for CPU_CLK_UNHALTED.THREAD event? >> >>No specific limit value, but SAV of CPU_CLK_UNHALTED.THREAD should be greater than 10,000, at least. Thank you, Peter. This is exactly what I wanted to understand.

Bernard · ‎05-08-2013

sorry Peter but what SAV stands for?

Sampling and VTune's disadvantages