- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, everyone
recently, I am using Vtune to test my BSDE code in hotspot mode. I have found some insteresting things.
int a,a1,a2,a3;
float trans[4];
_mm_store_ps(trans,a_sse);
below are four lines of code
- a= (int)*(trans);
- a1= (int)*(trans+1);
- a2= (int)(trans[2]);
- a3 = (int )trans[3];
compile using gcc with -O0 Optimize optimization, the time each line costs increase as below
- a= (int)*(trans); 9.319s
- a1= (int)*(trans+1); 1.970s
- a2= (int)(trans[2]); 1.020s
- a3 = (int )trans[3]; 2.130s
inorder to find the hotspot, I open the asm code, take line1's asm and line2's asm as an example
0x4055d2 1 movq -0xb8(%rbp), %rax 1.361s
0x4055d9 1 movssl (%rax), %xmm0
0x4055dd 1 cvttss2si %xmm0, %eax 4.238s
0x4055e1 1 movl %eax, -0xcc(%rbp) 3.720s
0x4055e7 2 movq -0xb8(%rbp), %rax 1.000s
0x4055ee 2 add $0x4, %rax
0x4055f2 2 movssl (%rax), %xmm0
0x4055f6 2 cvttss2si %xmm0, %eax 0.100s
0x4055fa 2 movl %eax, -0xd0(%rbp) 0.870s
if the cost time of first asm line "0x4055d2 1 movq -0xb8(%rbp), %rax 1.361s" is much larger than "0x4055e7 2 movq -0xb8(%rbp), %rax 1.000s", maybe I can understand it, the first line cause the cache miss which can benefit the second one. But, as you can see, the mainly different is "cvttss2si" and "movl". I don't know what caused the big difference ?
If I change the order of "a=*" and "a3=*" in the C code, "a3=*" cost more time than "a=*".
To make things more interesting, I compile the code with -O3 optimization, here is the time each line cost
- a= (int)*(trans); 1.080s
- a1= (int)*(trans+1); 1.191s
- a2= (int)(trans[2]); 0.520s
- a3 = (int )trans[3]; 5.900s
The last line cost the most time now. I compared asm code of the last two line below.
Address Line Assembly CPU Time
0x4036b0 3 movssl 0x8(%rcx), %xmm4 0.200s
0x4036b5 4 movssl 0xc(%rcx), %xmm5 2.900s
0x4036ba 3 cvttss2si %xmm4, %edi 0.320s
0x4036be 4 cvttss2si %xmm5, %r10d 3.000s
With -O3 optimization, gcc put the asm code of line3,4 in the front of line1,2. Why did gcc believe this can save time? I know we put two "movssl" together to speed up the pipeline, But why does the the second "movssl" and "cvttss2si" cost more time than the first? How does Vtune record the asm code's cpu time? is it correct?
Thankyou for your help!!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
VTune uses kernel mode driver to read MSR registers and maybe HEPT timer also.Regarding measuring machine code execution time maybe some kind of instrumentation code(like rdtsc instructions) is injected in profiled application address space.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah, so the issue is that VTune Amplifier XE doesn't actually "record the asm code's cpu time." What it does is sample where the code is executing and based on the number of samples estimates the % of time spent on that line or instruction. With Hotspots, a timer is used, with Event-based sampling, the processor is programmed to interrupt execution and record the location of execution.
You might also review the documention for "event skid", which can impact how you read the results of EBS data wrt asm code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi MrAnderson,
you explained it in your post that VTune actually samples the percentage of time spent on executing code.I suppose that for hotspots analysis it could be possible to inject machine code instructions like rdtsc in thread's address space with CreateRemoteThread and WriteProcessMemory functions , but I have never tested it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think Vtune use rdtsc to complete this function, I think it uses some registers which could be read through MSR in linux. You can write evnet such as CPU_CLK_UNHALTED.REF, and the register would count the unhalted instructions used the reference clock cycles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Xia Z. wrote:
I don't think Vtune use rdtsc to complete this function, I think it uses some registers which could be read through MSR in linux. You can write evnet such as CPU_CLK_UNHALTED.REF, and the register would count the unhalted instructions used the reference clock cycles
Yes I know this.I simply thought about the some kind of code injection into profiled process,but it seems that there is to much programming overhead
The better option is to use CPU_CLK_UNHALTED.REF divided by reference clock cycles to track time spent in various portions of code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can understant vtune use registers to record clcok cycles between some lines of code, record the old_value at the begin of the code segment, record the new_vaule at the end of the code segment, we can use (new_value - old_vaule) to get the result, but how can vtune record cycles per asm line cost? I am confused by it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>but how can vtune record cycles per asm line cost? I am confused by it.>>>
By using CPU_CLK_UNHALTED.REF.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page