code optimization and vtune

Zhiyong_Z_ · ‎01-12-2016

I am using vtune amplifier to profile a code and noticed something rather unusual (perhaps to me but not to the experts here). There are two back to back statements in the code:

Source Line   Source   Effective Time by Utilization   Spin Time   Overhead Time   Instructions Retired
590                      gi=gi*dd7/ri   133.138s   0s   0s   1,471,982,200,000
591                      gj=gj*dd8/rj   1320.961s   0s   0s   6,402,068,400,000

The two lines of code have very different execution time, almost a factor of 10, as reported by vtune amplifier. Also the number of instructions retired for the two lines of code are quite different also, one at 1,471,982,200,000 and the other at 6,402,068,400,000.

Could somebody explain what resulted in the differences and what optimizations are possible?

Thank you!

Zhiyong

jimdempseyatthecove · ‎01-12-2016

Are these statements involving arrays?

Jim Dempsey

Zhiyong_Z_ · ‎01-12-2016

Hi Jim,

These are not arrays.

Zhiyong

TimP · ‎01-12-2016

Where you rely on the non- "precise" event counters, you can't directly attribute the event counts to a single instruction or source line. In this case, it looks apparent that the second group of operations spends much of its quoted time waiting for the first to complete. This is particularly likely on the CPU models which have high latencies for division, particularly including the time during which the fpu pipeline is blocked.

Little can be said about possible optimization without looking at the context, possibility of vectorization, replacement of division by multiplication, ...

jimdempseyatthecove · ‎01-13-2016

Tim P. has a point about the non-precise (with respect to location of instruction) event counters. You can confirm this to some extent yourself by making a test run after swapping those two statements (both in source appear to have the same computational complexity). The efficiency of the code though may differ in that one or more of the variables in the faster statement may be registerized and not be in the slower statement. A secondary (but can be primary) cause can be if the code preceding the two statements in question, has a tendency to pre-load the L1 and/or L2 cache. Some of the VTune counters should be able to provide the information relating to cache miss and/or memory stalls.

Jim Dempsey

Bernard · ‎01-13-2016

As Tim hinted there may be present some kind of interdependency maybe at the uops level which slows down the computation. When looking at your short code sample I cannot see any present dependency between those two statements. Variable "gi" is not used as an argument to compute the value of variable "gj" at least not in the code snippet.

Zhiyong_Z_ · ‎01-13-2016

I have copied the source lines and the corresponding assembly codes for a "whole block" below. The few assignment source statements are crossed through as they are not assembled. The two source lines 590 and 591 and their assembly codes are highlighted and in italic.

The assembly codes for the two source lines are always back to back and consist of move, mult, and div. There are no dependencies among the two and their operands are identical in register use and "memory access". The address in memory are -0x228 and -0x230 and the memory address accessed for the whole block ranges from -0x1d0 to -0x238. Please note that the memory address -0x238 is associated with source line 583, the other line that does a division just as the other lines at issue here but the reported time is only 45 sec. compared with 133 and 1320 for the other two lines respectively.

With this information, can we conclude that the differences in time for these there lines, 583, 590, and 591, are due to cache and memory misses?

Are there any other more explicit ways to see if there are indeed cache/memory misses?

Why the instructions retired for these three lines are different, as much as order of magnitude?

Seems like that division and multiplication takes about the same time as shown in this particular block of code?

If one just wants to optimize the performance of this particular block of code, is it possible and what would be the best way to go about it? Do I need to arrange the storage of the variables used in this case and make sure that all the variables are accessed in the same memory access? Does reordering the order of some of the executions help at all? Does the execution interleave the memory access and computation?

Thanks!

Zhiyong

Source Line   Source   Effective Time by Utilization   Spin Time   Overhead Time   Instructions Retired
571                    if(ideriv.gt.0) then   9.355s   0s   0s   102,596,000,000
572                      gp=pc
573                      gu=pu
574                      guu=puu
575                      gi=ppi
576                      gii=pii
577                      gj=pj
578                      gjj=pjj
579                      gui=pui
580                      guj=puj
581
582                      guu=guu*dd1*dd1+gu*dd2   67.918s   0s   0s   826,953,400,000
583                      gu=gu*dd1/rij   45.721s   0s   0s   536,416,400,000
584
585                      gui=gui*dd1*dd7   32.883s   0s   0s   375,057,800,000
586                      guj=guj*dd1*dd8   0.228s   0s   0s   1,957,800,000
587
588                      gii=gii*dd7*dd7+gi*dd9   93.586s   0s   0s   1,034,625,800,000
589                      gjj=gjj*dd8*dd8+gj*dd10   586.273s   0s   0s   6,776,933,800,000
590                      gi=gi*dd7/ri   133.138s   0s   0s   1,471,982,200,000
591                      gj=gj*dd8/rj   1320.961s   0s   0s   6,402,068,400,000
592
593   !!!! een for periodic systems         WAS
594                      if(icutjasc .gt. 0 .or. iperiodic .ne. 0) then   626.434s   0s   0s   963,357,200,000

Assembly code:

Address   Source Line   Assembly   Effective Time by Utilization   Spin Time   Overhead Time   Instructions Retired
0x8b36bf   571   jle 0x8b3da6 <Block 269>
0x8b36c5       Block 255:
0x8b36c5   582   movsdq -0x210(%rbp), %xmm2   5.979s   0s   0s   70,899,400,000
0x8b36cd   583   movaps %xmm11, %xmm1   14.281s   0s   0s   173,641,000,000
0x8b36d1   588   movsdq -0x208(%rbp), %xmm0   25.805s   0s   0s   297,830,000,000
0x8b36d9   582   mulsdq -0x1e8(%rbp), %xmm9   25.008s   0s   0s   315,564,600,000
0x8b36e2   588   mulsdq -0x1e0(%rbp), %xmm7   18.367s   0s   0s   165,534,200,000
0x8b36ea   582   mulsd %xmm11, %xmm2   17.322s   0s   0s   222,445,600,000
0x8b36ef   588   mulsd %xmm5, %xmm0   24.540s   0s   0s   281,242,000,000
0x8b36f3   583   mulsdq -0x1d0(%rbp), %xmm1   25.527s   0s   0s   291,636,800,000
0x8b36fb   582   addsd %xmm9, %xmm2   12.469s   0s   0s   137,586,800,000
0x8b3700   589   mulsdq -0x200(%rbp), %xmm10   25.294s   0s   0s   259,220,000,000
0x8b3709   588   addsd %xmm7, %xmm0   20.213s   0s   0s   237,200,600,000
0x8b370d   585   mulsdq -0x1f0(%rbp), %xmm12   26.725s   0s   0s   305,502,600,000
0x8b3716   583   divsdq -0x238(%rbp), %xmm1   5.913s   0s   0s   71,138,600,000
0x8b371e   589   movsdq -0x218(%rbp), %xmm9   554.227s   0s   0s   6,440,345,600,000
0x8b3727   590   movsdq -0x1c8(%rbp), %xmm7   1.367s   0s   0s   15,813,200,000
0x8b372f   591   movsdq -0x1d8(%rbp), %xmm3   3.675s   0s   0s   47,980,400,000
0x8b3737   589   mulsd %xmm8, %xmm9   0.006s   0s   0s   15,600,000
0x8b373c   590   mulsd %xmm5, %xmm7   65.909s   0s   0s   735,233,200,000
0x8b3740   591   mulsd %xmm8, %xmm3   1.284s   0s   0s   15,758,600,000
0x8b3745   589   addsd %xmm10, %xmm9   3.635s   0s   0s   44,421,000,000
0x8b374a   586   mulsdq -0x1f8(%rbp), %xmm6   0.005s   0s   0s   36,400,000
0x8b3752   590   divsdq -0x228(%rbp), %xmm7   65.862s   0s   0s   720,935,800,000
0x8b375a   591   divsdq -0x230(%rbp), %xmm3   1316.002s   0s   0s   6,338,329,400,000
0x8b3762   594   cmpl $0x0, -0x220(%rbp)   591.793s   0s   0s   907,816,000,000
0x8b3769   594   jle 0x8b6ca3 <Block 376>   0.028s   0s   0s   5,200,000

jimdempseyatthecove · ‎01-14-2016

The two movsdq's are from the same memory addressable cache line, and the second instruction getting charged more. This illustrates what Tim P was talking about where the overhead time appears to be billed to a different or following instruction. Your interpretation of VTune's counters has to take this into consideration.

The two mulsd's with first taking more time, likely reflect that the first instruction dependent on xmm7 was assessed the memory fetch time into the same cache line where xmm3 will get its data from. The lesser time for the second mulsd reflects no memory (or L3 or L2) stall occurred in getting the data into xmm3.

The divsq's (my interpretation) reflect a similar memory latency on the fetch into same cache line (holding -0x228 and 0x230 off rbp) with both first instructions (mulsd pair and divsdq pair) at around 65.9s, but that the SSE FPU can only perform one division at a time and the second instruction had to wait.

I project that by swapping these two statements around, that you will observe the second statement taking longer.

Jim Dempsey

Bernard · ‎01-14-2016

When I read Jim's answer I realized that I there is dependency on FPU Divider latency which is around ~15-20 cycles. I suppose that Divider is pipelined so the next division uop(s) will be scheduled for execution after ~15-20 cycles, hence probably you are seeing slower performance for the second line of code. By looking at the assembly code snippet I suppose that ri and rj are constants(I can be wrong here) so why do not try to multiply by their inverse?