- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using vtune amplifier to profile a code and noticed something rather unusual (perhaps to me but not to the experts here). There are two back to back statements in the code:
Source Line Source Effective Time by Utilization Spin Time Overhead Time Instructions Retired
590 gi=gi*dd7/ri 133.138s 0s 0s 1,471,982,200,000
591 gj=gj*dd8/rj 1320.961s 0s 0s 6,402,068,400,000
The two lines of code have very different execution time, almost a factor of 10, as reported by vtune amplifier. Also the number of instructions retired for the two lines of code are quite different also, one at 1,471,982,200,000 and the other at 6,402,068,400,000.
Could somebody explain what resulted in the differences and what optimizations are possible?
Thank you!
Zhiyong
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are these statements involving arrays?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
These are not arrays.
Zhiyong
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Where you rely on the non- "precise" event counters, you can't directly attribute the event counts to a single instruction or source line. In this case, it looks apparent that the second group of operations spends much of its quoted time waiting for the first to complete. This is particularly likely on the CPU models which have high latencies for division, particularly including the time during which the fpu pipeline is blocked.
Little can be said about possible optimization without looking at the context, possibility of vectorization, replacement of division by multiplication, ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. has a point about the non-precise (with respect to location of instruction) event counters. You can confirm this to some extent yourself by making a test run after swapping those two statements (both in source appear to have the same computational complexity). The efficiency of the code though may differ in that one or more of the variables in the faster statement may be registerized and not be in the slower statement. A secondary (but can be primary) cause can be if the code preceding the two statements in question, has a tendency to pre-load the L1 and/or L2 cache. Some of the VTune counters should be able to provide the information relating to cache miss and/or memory stalls.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As Tim hinted there may be present some kind of interdependency maybe at the uops level which slows down the computation. When looking at your short code sample I cannot see any present dependency between those two statements. Variable "gi" is not used as an argument to compute the value of variable "gj" at least not in the code snippet.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have copied the source lines and the corresponding assembly codes for a "whole block" below. The few assignment source statements are crossed through as they are not assembled. The two source lines 590 and 591 and their assembly codes are highlighted and in italic.
The assembly codes for the two source lines are always back to back and consist of move, mult, and div. There are no dependencies among the two and their operands are identical in register use and "memory access". The address in memory are -0x228 and -0x230 and the memory address accessed for the whole block ranges from -0x1d0 to -0x238. Please note that the memory address -0x238 is associated with source line 583, the other line that does a division just as the other lines at issue here but the reported time is only 45 sec. compared with 133 and 1320 for the other two lines respectively.
With this information, can we conclude that the differences in time for these there lines, 583, 590, and 591, are due to cache and memory misses?
Are there any other more explicit ways to see if there are indeed cache/memory misses?
Why the instructions retired for these three lines are different, as much as order of magnitude?
Seems like that division and multiplication takes about the same time as shown in this particular block of code?
If one just wants to optimize the performance of this particular block of code, is it possible and what would be the best way to go about it? Do I need to arrange the storage of the variables used in this case and make sure that all the variables are accessed in the same memory access? Does reordering the order of some of the executions help at all? Does the execution interleave the memory access and computation?
Thanks!
Zhiyong
Source Line Source Effective Time by Utilization Spin Time Overhead Time Instructions Retired
571 if(ideriv.gt.0) then 9.355s 0s 0s 102,596,000,000
572 gp=pc
573 gu=pu
574 guu=puu
575 gi=ppi
576 gii=pii
577 gj=pj
578 gjj=pjj
579 gui=pui
580 guj=puj
581
582 guu=guu*dd1*dd1+gu*dd2 67.918s 0s 0s 826,953,400,000
583 gu=gu*dd1/rij 45.721s 0s 0s 536,416,400,000
584
585 gui=gui*dd1*dd7 32.883s 0s 0s 375,057,800,000
586 guj=guj*dd1*dd8 0.228s 0s 0s 1,957,800,000
587
588 gii=gii*dd7*dd7+gi*dd9 93.586s 0s 0s 1,034,625,800,000
589 gjj=gjj*dd8*dd8+gj*dd10 586.273s 0s 0s 6,776,933,800,000
590 gi=gi*dd7/ri 133.138s 0s 0s 1,471,982,200,000
591 gj=gj*dd8/rj 1320.961s 0s 0s 6,402,068,400,000
592
593 !!!! een for periodic systems WAS
594 if(icutjasc .gt. 0 .or. iperiodic .ne. 0) then 626.434s 0s 0s 963,357,200,000
Assembly code:
Address Source Line Assembly Effective Time by Utilization Spin Time Overhead Time Instructions Retired
0x8b36bf 571 jle 0x8b3da6 <Block 269>
0x8b36c5 Block 255:
0x8b36c5 582 movsdq -0x210(%rbp), %xmm2 5.979s 0s 0s 70,899,400,000
0x8b36cd 583 movaps %xmm11, %xmm1 14.281s 0s 0s 173,641,000,000
0x8b36d1 588 movsdq -0x208(%rbp), %xmm0 25.805s 0s 0s 297,830,000,000
0x8b36d9 582 mulsdq -0x1e8(%rbp), %xmm9 25.008s 0s 0s 315,564,600,000
0x8b36e2 588 mulsdq -0x1e0(%rbp), %xmm7 18.367s 0s 0s 165,534,200,000
0x8b36ea 582 mulsd %xmm11, %xmm2 17.322s 0s 0s 222,445,600,000
0x8b36ef 588 mulsd %xmm5, %xmm0 24.540s 0s 0s 281,242,000,000
0x8b36f3 583 mulsdq -0x1d0(%rbp), %xmm1 25.527s 0s 0s 291,636,800,000
0x8b36fb 582 addsd %xmm9, %xmm2 12.469s 0s 0s 137,586,800,000
0x8b3700 589 mulsdq -0x200(%rbp), %xmm10 25.294s 0s 0s 259,220,000,000
0x8b3709 588 addsd %xmm7, %xmm0 20.213s 0s 0s 237,200,600,000
0x8b370d 585 mulsdq -0x1f0(%rbp), %xmm12 26.725s 0s 0s 305,502,600,000
0x8b3716 583 divsdq -0x238(%rbp), %xmm1 5.913s 0s 0s 71,138,600,000
0x8b371e 589 movsdq -0x218(%rbp), %xmm9 554.227s 0s 0s 6,440,345,600,000
0x8b3727 590 movsdq -0x1c8(%rbp), %xmm7 1.367s 0s 0s 15,813,200,000
0x8b372f 591 movsdq -0x1d8(%rbp), %xmm3 3.675s 0s 0s 47,980,400,000
0x8b3737 589 mulsd %xmm8, %xmm9 0.006s 0s 0s 15,600,000
0x8b373c 590 mulsd %xmm5, %xmm7 65.909s 0s 0s 735,233,200,000
0x8b3740 591 mulsd %xmm8, %xmm3 1.284s 0s 0s 15,758,600,000
0x8b3745 589 addsd %xmm10, %xmm9 3.635s 0s 0s 44,421,000,000
0x8b374a 586 mulsdq -0x1f8(%rbp), %xmm6 0.005s 0s 0s 36,400,000
0x8b3752 590 divsdq -0x228(%rbp), %xmm7 65.862s 0s 0s 720,935,800,000
0x8b375a 591 divsdq -0x230(%rbp), %xmm3 1316.002s 0s 0s 6,338,329,400,000
0x8b3762 594 cmpl $0x0, -0x220(%rbp) 591.793s 0s 0s 907,816,000,000
0x8b3769 594 jle 0x8b6ca3 <Block 376> 0.028s 0s 0s 5,200,000
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The two movsdq's are from the same memory addressable cache line, and the second instruction getting charged more. This illustrates what Tim P was talking about where the overhead time appears to be billed to a different or following instruction. Your interpretation of VTune's counters has to take this into consideration.
The two mulsd's with first taking more time, likely reflect that the first instruction dependent on xmm7 was assessed the memory fetch time into the same cache line where xmm3 will get its data from. The lesser time for the second mulsd reflects no memory (or L3 or L2) stall occurred in getting the data into xmm3.
The divsq's (my interpretation) reflect a similar memory latency on the fetch into same cache line (holding -0x228 and 0x230 off rbp) with both first instructions (mulsd pair and divsdq pair) at around 65.9s, but that the SSE FPU can only perform one division at a time and the second instruction had to wait.
I project that by swapping these two statements around, that you will observe the second statement taking longer.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I read Jim's answer I realized that I there is dependency on FPU Divider latency which is around ~15-20 cycles. I suppose that Divider is pipelined so the next division uop(s) will be scheduled for execution after ~15-20 cycles, hence probably you are seeing slower performance for the second line of code. By looking at the assembly code snippet I suppose that ri and rj are constants(I can be wrong here) so why do not try to multiply by their inverse?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page