VTune uses kernel mode

Xia_Z_ · ‎04-27-2013

Hello, everyone

recently, I am using Vtune to test my BSDE code in hotspot mode. I have found some insteresting things.

int a,a1,a2,a3;

float trans[4];

_mm_store_ps(trans,a_sse);

below are four lines of code

a= (int)*(trans);
a1= (int)*(trans+1);
a2= (int)(trans[2]);
a3 = (int )trans[3];

compile using gcc with -O0 Optimize optimization, the time each line costs increase as below

a= (int)*(trans); 9.319s
a1= (int)*(trans+1); 1.970s
a2= (int)(trans[2]); 1.020s
a3 = (int )trans[3]; 2.130s

inorder to find the hotspot, I open the asm code, take line1's asm and line2's asm as an example

0x4055d2 1 movq -0xb8(%rbp), %rax 1.361s
0x4055d9 1 movssl (%rax), %xmm0
0x4055dd 1 cvttss2si %xmm0, %eax 4.238s
0x4055e1 1 movl %eax, -0xcc(%rbp) 3.720s

0x4055e7 2 movq -0xb8(%rbp), %rax 1.000s
0x4055ee 2 add $0x4, %rax
0x4055f2 2 movssl (%rax), %xmm0
0x4055f6 2 cvttss2si %xmm0, %eax 0.100s
0x4055fa 2 movl %eax, -0xd0(%rbp) 0.870s

if the cost time of first asm line "0x4055d2 1 movq -0xb8(%rbp), %rax 1.361s" is much larger than "0x4055e7 2 movq -0xb8(%rbp), %rax 1.000s", maybe I can understand it, the first line cause the cache miss which can benefit the second one. But, as you can see, the mainly different is "cvttss2si" and "movl". I don't know what caused the big difference ?

If I change the order of "a=*" and "a3=*" in the C code, "a3=*" cost more time than "a=*".

To make things more interesting, I compile the code with -O3 optimization, here is the time each line cost

a= (int)*(trans); 1.080s
a1= (int)*(trans+1); 1.191s
a2= (int)(trans[2]); 0.520s
a3 = (int )trans[3]; 5.900s

The last line cost the most time now. I compared asm code of the last two line below.

Address Line Assembly CPU Time
0x4036b0 3 movssl 0x8(%rcx), %xmm4 0.200s
0x4036b5 4 movssl 0xc(%rcx), %xmm5 2.900s
0x4036ba 3 cvttss2si %xmm4, %edi 0.320s
0x4036be 4 cvttss2si %xmm5, %r10d 3.000s

With -O3 optimization, gcc put the asm code of line3,4 in the front of line1,2. Why did gcc believe this can save time? I know we put two "movssl" together to speed up the pipeline, But why does the the second "movssl" and "cvttss2si" cost more time than the first? How does Vtune record the asm code's cpu time? is it correct?

Thankyou for your help!!

Bernard · ‎04-29-2013

VTune uses kernel mode driver to read MSR registers and maybe HEPT timer also.Regarding measuring machine code execution time maybe some kind of instrumentation code(like rdtsc instructions) is injected in profiled application address space.

David_A_Intel1 · ‎04-30-2013

Yeah, so the issue is that VTune Amplifier XE doesn't actually "record the asm code's cpu time." What it does is sample where the code is executing and based on the number of samples estimates the % of time spent on that line or instruction. With Hotspots, a timer is used, with Event-based sampling, the processor is programmed to interrupt execution and record the location of execution.

You might also review the documention for "event skid", which can impact how you read the results of EBS data wrt asm code.

Bernard · ‎05-01-2013

Hi MrAnderson,

you explained it in your post that VTune actually samples the percentage of time spent on executing code.I suppose that for hotspots analysis it could be possible to inject machine code instructions like rdtsc in thread's address space with CreateRemoteThread and WriteProcessMemory functions , but I have never tested it.

Xia_Z_ · ‎05-01-2013

I don't think Vtune use rdtsc to complete this function, I think it uses some registers which could be read through MSR in linux. You can write evnet such as CPU_CLK_UNHALTED.REF, and the register would count the unhalted instructions used the reference clock cycles

Bernard · ‎05-01-2013

Xia Z. wrote:

I don't think Vtune use rdtsc to complete this function, I think it uses some registers which could be read through MSR in linux. You can write evnet such as CPU_CLK_UNHALTED.REF, and the register would count the unhalted instructions used the reference clock cycles

Yes I know this.I simply thought about the some kind of code injection into profiled process,but it seems that there is to much programming overhead

The better option is to use CPU_CLK_UNHALTED.REF divided by reference clock cycles to track time spent in various portions of code.

Xia_Z_ · ‎05-05-2013

I can understant vtune use registers to record clcok cycles between some lines of code, record the old_value at the begin of the code segment, record the new_vaule at the end of the code segment, we can use (new_value - old_vaule) to get the result, but how can vtune record cycles per asm line cost? I am confused by it.

Bernard · ‎05-06-2013

>>>but how can vtune record cycles per asm line cost? I am confused by it.>>>

By using CPU_CLK_UNHALTED.REF.

how can Vtune record the asm code's cpu time