- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, everyone!
Im a third-year student of Saint-Petersburg State University, the faculty of Mathematics and Mechanics, Software Engineering department. Not very long ago I decided to make a little research in application profiling area: to analyze, how sampling works. So I decided to publish my results here and I will be glad to know expert opinion.
*must say that I used HotSpots or LightWeightHS analyses
1. Empty Function Test
The first test program. Implements only one almost empty function.
[cpp]int main(int argc, char **argv)
{
int a = 0;
for (int i = 0; i < 10; i++) a += i;
return a; }
[/cpp]
There are few iterations so VTune said No data to show. The cause is that minimum sampling interval is 1 ms and the program work less than 1 ms. Thats why profiler VTune suppose that the test didnt work.
2. Simple Asm Test
To check if there is any additional information in analysis results.
Code:
[bash]
mov rcx, 10000000000
l1:
mov rax,rbp
loop l1
[/bash]
Result:
So, all is OK. Nothing unnecessary
3. Function Stack Test
Implements functionchain of function calls.
Goal: does VTune can show us all nested functions and does sampling have any mistakes?
Scheme of the program:
main
--> ExternalFunc
------> MiddleFunc
---------> InternalFunc
The nested functions are called. Bodies of all functions are identical so we can expect that runtime of all functions will be identical too.
When we compiled the program with compilers optimization, VTune showed us only main and Middle functions. But when optimization was turned off weve got expected result.
So VTune has done its work well
4. Create Process Test
Goal: how Intel VTune Amplifier XE 2011 and sampling profiling method can work with multiprocess application. The program creates N new processes Each of them execute code from the first test (EmptyFunction, i.e. do almost nothing).
[java]
NewProcessCreateAndExit(){
//...
CreateProcess(EmptyFunction.exe,...);
//...
}
//...
void Main()
{
for (i = 0; i < N; i++)
NewProcessCreateAndExit();
}
[/java]
Results
All is well, processes are created in consecutive order but overhead is monstrous: when N equals 500 the program worked about 15 seconds. Intel VTune Amplifier XE Hotspots Analysis runtime about 1000 seconds. Moreover, data collecting lasted about 30 minutes.
5. Counter Losses Test
Goal: estimate how many function calls VTune (sampling) loses.
Idea: call (in different places of a program P1 our test-program) function IncCounter(), measure function runtime with/without profiler, estimate overhead and count of calls lost. Function IncCounter() increments a general counter and this counter will inform us how many calls of IncCounter() we had.
So we have an opportunity to measure the number of function calls and the percentage of losses. We have to calculate how long IncCounter worked without profiling. So I wrote another program P2 to calculate this. To increase accuracy I carried out a great amount of IncCounter calls in programs P1 and P2.
Result:
===================================================================
Analysis | HotSpots (HS) | Lightweight HS |
---------------------------------------+------------------------------------------------+--------------------------------------------
IncCounter in P2 | 575 054 ms | 25 980 ms |
(iterations 75*108) | | |
---------------------------------------+-------------------------------------------------+--------------------------------------------
All program P2 | 628 673 ms | 652 538 ms |
---------------------------------------+-------------------------------------------------+--------------------------------------------
IncCounter in P1 | 24 708 ms | 1 158 ms |
(iterations 3*108) | |
---------------------------------------+-------------------------------------------------+--------------------------------------------
All program P2 | 27 121 ms | 27 893 ms |
====================================================================
Note:
Sampling interval 10 ms
IncCounter() without profiling 21585 ms (iterations 3*108)
Definitions:
It' iterations of IncCounter that were identified by profilier
It general number of iterations, i.e. all iterations executed (get from incremented general counter)
T1 runtime of P1 without profiling
T2 runtime of P1 with profiling
T0 runtime of 3*108 iterations of IncCounter() without profiling
Tx runtime of 3*108 iterations of IncCounter() with profiling
K = T2/T1 slowdown coefficient of the program
T2.IncCounter IncCounter() runtime in P1 with profiling.
I suggest two hypotheses (formulae to calculate percentage of losses):
It'/It = T2.IncCounter/Tx
It'/It = T2.IncCounter/(T0 * K)
Result:
T2.IncCounter/Tx T2.IncCounter/(T0 * K) K
Lightweight HS 1.1143 0.048084 1.11572
HS 1.0742 1.055163 1.08484
6. Saw Test
Idea: construct a saw from the call stack. Sampling is an approximate method and theoretically we could fit a program whose stack looks like this (here orange circles are samples of VTune; 0-level is main function):
Horizontal - time
Vertical - stack depth
Program scheme:
[java]
int main(){
for(experiment = 10000; experiment >= 0; experiment--){
//...
GoRandDeep(int rand);//function calls itself rand times indepth. Rand generates randomly
//...
f1();//executed every 75 iteration
//...
f1();//executed every 125 iteration
}
}
// here f1 calls f2, f2 calls f3,, f9 calls f10
// each function f1..f9 doesnt do anything except next function call. f10 performs some actions and returns.[/java]
Result:
i.e. VTune didnt find the last function on the top of the stack only main function.
Sometimes there was GoRandDeep in final analyses but it happened rarely. Sometimes sample got in the middle of the call stack and then there were f (i < 10) in results but it was also rarely.
Its obvious that this test is scalable depends on sampling interval and the result can be the same. So we could fit a program containing a function that executes about 25% (or more) of runtime but this function wont be found by profiler with sampling method. Its very serious shortcoming.
Please, leave Your comments and notes. And I'll be glad to seeany critics and answer Your questions.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I recently compare VTune and -finstrumental-functions of GCC. I found VTune is missing functions.
I am using Ubuntu 11.10 32bit with Core i5-2500 @3.3GHz
Ex.
compile SPEC hmmer benchmark.
When I run VTune
amplxe-cl -c snb-general-exploration benchmark.run
and then
amplxe-cl -report hotspots
I got 31 functions.
However, when I use -finstrument-funciton and hook the _enter ane _exit function, I got
70 functions have been called.
How can I get all the functions in a program with VTune?
Best wishes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I replied you in message - "-finstrumental-function" instruments all functions but VTune(TM) Amplifier XE uses periodical sampling collections with call stack info, some "tiny" functions will not be profiled since PMU event-based sampling doesn't capture samples in these functions (they consumed little CPU time).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>ome "tiny" functions will not be profiled since PMU event-based sampling doesn't capture samples in these functions (they consumed little CPU time)>>>
What is the lowest limit as perceived by VTune for sampling of short time running function?
Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>All is well, processes are created in consecutive order but overhead is monstrous: when N equals 500 the program worked about 15 seconds.>>>
It is expected because process creation task is quite long multi-staged and complicated.Large portion of the time is spent in kernel mode and performing I/O operation on behalf of newly created process and its first runnable thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>ome "tiny" functions will not be profiled since PMU event-based sampling doesn't capture samples in these functions (they consumed little CPU time)>>>
What is the lowest limit as perceived by VTune for sampling of short time running function?
Thanks in advance.
1000 samples per second, default SAV value. It means if your function ran less than 1ms, it will not be in hot function list.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I thought that the value is ~3ms.
Anyway thanks for the answer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Resolution of system timer is around ~3ms you can check it with clockres tool.In order to use RDTSC VTune must be able to inject code into measured process address space.Other option is to use sampling by clock level IRQL , but it is very costly operation and it is used with thread scheduling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PMU event-based sampling collection uses hardware to interrupt to gather performance data, via vtune driver. (System mode)
User-mode sampling collection uses system timer to interrupt to gather performance data. (User mode)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter,
In case of PMU based sampling is it done by rise edge of clock interrupt on Windows?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me show the example to use 0.1ms as sample interval.
The system is 2GHz CPU frequency and SAV of event CPU_CLK_UNHALTED.THREAD = 2,000,000 set by VTune as default.
Copy/create your analysis type and change SAV of CPU_CLK_UNHALTED.THREAD = 200,000 - which uses sample interval 0.1ms
>> In case of PMU based sampling is it done by rise edge of clock interrupt on Windows?
Will set SAV value in event counter register first, decrease when clock elapsing, interrupt will occur (capture a sample) when event counter register has to decrease but the value is zero.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> Is there any lower limit for CPU_CLK_UNHALTED.THREAD event?
No specific limit value, but SAV of CPU_CLK_UNHALTED.THREAD should be greater than 10,000, at least. Notice: if you work on "fast" sampling, overhead will be increased obviously.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sorry Peter but what SAV stands for?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page