Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

code optimization and vtune

Zhiyong_Z_
Beginner
1,573 Views

I am using vtune amplifier to profile a code and noticed something rather unusual (perhaps to me but not to the experts here). There are two back to back statements in the code: 

Source Line    Source    Effective Time by Utilization    Spin Time    Overhead Time    Instructions Retired
590                      gi=gi*dd7/ri    133.138s    0s    0s    1,471,982,200,000
591                      gj=gj*dd8/rj    1320.961s    0s    0s    6,402,068,400,000

The two lines of code have very different execution time, almost a factor of 10, as reported by vtune amplifier. Also the number of instructions retired for the two lines of code are quite different also, one at  1,471,982,200,000 and the other at  6,402,068,400,000.

Could somebody explain what resulted in the differences and what optimizations are possible?

Thank you!

Zhiyong

0 Kudos
8 Replies
jimdempseyatthecove
Honored Contributor III
1,573 Views

Are these statements involving arrays?

Jim Dempsey

0 Kudos
Zhiyong_Z_
Beginner
1,573 Views

Hi Jim,

These are not arrays.

Zhiyong

0 Kudos
TimP
Honored Contributor III
1,573 Views

Where you rely on the non- "precise" event counters, you can't directly attribute the event counts to a single instruction or source line.  In this case, it looks apparent that the second group of operations spends much of its quoted time waiting for the first to complete.  This is particularly likely on the CPU models which have high latencies for division, particularly including the time during which the fpu pipeline is blocked.

Little can be said about possible optimization without looking at the context, possibility of vectorization, replacement of division by multiplication, ...

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,573 Views

Tim P. has a point about the non-precise (with respect to location of instruction) event counters. You can confirm this to some extent yourself by making a test run after swapping those two statements (both in source appear to have the same computational complexity). The efficiency of the code though may differ in that one or more of the variables in the faster statement may be registerized and not be in the slower statement. A secondary (but can be primary) cause can be if the code preceding the two statements in question, has a tendency to pre-load the L1 and/or L2 cache. Some of the VTune counters should be able to provide the information relating to cache miss and/or memory stalls.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
1,573 Views

As Tim hinted there may be present some kind of interdependency maybe at the uops level which slows down the computation. When looking at your short code sample I cannot see any present dependency between those two statements. Variable "gi"  is not used as an argument to compute the value of variable "gj" at least not in the code snippet.

0 Kudos
Zhiyong_Z_
Beginner
1,573 Views

I have copied the source lines and the corresponding assembly codes for a "whole block" below. The few assignment source statements are crossed through as they are not assembled. The two source lines 590 and 591 and their assembly codes are highlighted and in italic. 

The assembly codes for the two source lines are always back to back and consist of move, mult, and div. There are no dependencies among the two and their operands are identical in register use and "memory access". The address in memory are  -0x228 and  -0x230 and the memory address accessed for the whole block ranges from -0x1d0 to -0x238. Please note that the memory address -0x238 is associated with source line 583, the other line that does a division just as the other lines at issue here but the reported time is only 45 sec. compared with 133 and 1320 for the other two lines respectively. 

With this information, can we conclude that the differences in time for these there lines, 583, 590, and 591, are due to cache and memory misses?

Are there any other more explicit ways to see if there are indeed cache/memory misses?

Why the instructions retired for these three lines are different, as much as order of magnitude? 

Seems like that division and multiplication takes about the same time as shown in this particular block of code?

If one just wants to optimize the performance of this particular block of code, is it possible and what would be the best way to go about it? Do I need to arrange the storage of the variables used in this case and make sure that all the variables are accessed in the same memory access? Does reordering the order of some of the executions help at all? Does the execution interleave the memory access and computation? 

Thanks!

Zhiyong

Source Line    Source    Effective Time by Utilization    Spin Time    Overhead Time    Instructions Retired
571                    if(ideriv.gt.0) then    9.355s    0s    0s    102,596,000,000
572                      gp=pc                
573                      gu=pu                
574                      guu=puu                
575                      gi=ppi                
576                      gii=pii                
577                      gj=pj                
578                      gjj=pjj                
579                      gui=pui                
580                      guj=puj                

581                    
582                      guu=guu*dd1*dd1+gu*dd2    67.918s    0s    0s    826,953,400,000
583                      gu=gu*dd1/rij    45.721s    0s    0s    536,416,400,000
584                    
585                      gui=gui*dd1*dd7    32.883s    0s    0s    375,057,800,000
586                      guj=guj*dd1*dd8    0.228s    0s    0s    1,957,800,000
587                    
588                      gii=gii*dd7*dd7+gi*dd9    93.586s    0s    0s    1,034,625,800,000
589                      gjj=gjj*dd8*dd8+gj*dd10    586.273s    0s    0s    6,776,933,800,000
590                      gi=gi*dd7/ri    133.138s    0s    0s    1,471,982,200,000
591                      gj=gj*dd8/rj    1320.961s    0s    0s    6,402,068,400,000

592                    
593    !!!!  een for periodic systems         WAS                
594                      if(icutjasc .gt. 0 .or. iperiodic .ne. 0) then    626.434s    0s    0s    963,357,200,000

Assembly code:

Address    Source Line    Assembly    Effective Time by Utilization    Spin Time    Overhead Time    Instructions Retired
0x8b36bf    571    jle 0x8b3da6 <Block 269>                
0x8b36c5        Block 255:                
0x8b36c5    582    movsdq  -0x210(%rbp), %xmm2    5.979s    0s    0s    70,899,400,000
0x8b36cd    583    movaps %xmm11, %xmm1    14.281s    0s    0s    173,641,000,000
0x8b36d1    588    movsdq  -0x208(%rbp), %xmm0    25.805s    0s    0s    297,830,000,000
0x8b36d9    582    mulsdq  -0x1e8(%rbp), %xmm9    25.008s    0s    0s    315,564,600,000
0x8b36e2    588    mulsdq  -0x1e0(%rbp), %xmm7    18.367s    0s    0s    165,534,200,000
0x8b36ea    582    mulsd %xmm11, %xmm2    17.322s    0s    0s    222,445,600,000
0x8b36ef    588    mulsd %xmm5, %xmm0    24.540s    0s    0s    281,242,000,000
0x8b36f3    583    mulsdq  -0x1d0(%rbp), %xmm1    25.527s    0s    0s    291,636,800,000
0x8b36fb    582    addsd %xmm9, %xmm2    12.469s    0s    0s    137,586,800,000
0x8b3700    589    mulsdq  -0x200(%rbp), %xmm10    25.294s    0s    0s    259,220,000,000
0x8b3709    588    addsd %xmm7, %xmm0    20.213s    0s    0s    237,200,600,000
0x8b370d    585    mulsdq  -0x1f0(%rbp), %xmm12    26.725s    0s    0s    305,502,600,000
0x8b3716    583    divsdq  -0x238(%rbp), %xmm1    5.913s    0s    0s    71,138,600,000
0x8b371e    589    movsdq  -0x218(%rbp), %xmm9    554.227s    0s    0s    6,440,345,600,000
0x8b3727    590    movsdq  -0x1c8(%rbp), %xmm7    1.367s    0s    0s    15,813,200,000
0x8b372f    591    movsdq  -0x1d8(%rbp), %xmm3    3.675s    0s    0s    47,980,400,000
0x8b3737    589    mulsd %xmm8, %xmm9    0.006s    0s    0s    15,600,000
0x8b373c    590    mulsd %xmm5, %xmm7    65.909s    0s    0s    735,233,200,000
0x8b3740    591    mulsd %xmm8, %xmm3    1.284s    0s    0s    15,758,600,000

0x8b3745    589    addsd %xmm10, %xmm9    3.635s    0s    0s    44,421,000,000
0x8b374a    586    mulsdq  -0x1f8(%rbp), %xmm6    0.005s    0s    0s    36,400,000
0x8b3752    590    divsdq  -0x228(%rbp), %xmm7    65.862s    0s    0s    720,935,800,000
0x8b375a    591    divsdq  -0x230(%rbp), %xmm3    1316.002s    0s    0s    6,338,329,400,000

0x8b3762    594    cmpl  $0x0, -0x220(%rbp)    591.793s    0s    0s    907,816,000,000
0x8b3769    594    jle 0x8b6ca3 <Block 376>    0.028s    0s    0s    5,200,000

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,573 Views

The two movsdq's are from the same memory addressable cache line, and the second instruction getting charged more. This illustrates what Tim P was talking about where the overhead time appears to be billed to a different or following instruction. Your interpretation of VTune's counters has to take this into consideration.

The two mulsd's with first taking more time, likely reflect that the first instruction dependent on xmm7 was assessed the memory fetch time into the same cache line where xmm3 will get its data from. The lesser time for the second mulsd reflects no memory (or L3 or L2) stall occurred in getting the data into xmm3.

The divsq's (my interpretation) reflect a similar memory latency on the fetch into same cache line (holding -0x228 and 0x230 off rbp) with both first instructions (mulsd pair and divsdq pair) at around 65.9s, but that the SSE FPU can only perform one division at a time and the second  instruction had to wait.

I project that by swapping these two statements around, that you will observe the second statement taking longer.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
1,573 Views

When I read Jim's answer I realized that I there is dependency on FPU Divider latency which is around ~15-20 cycles. I suppose that Divider is pipelined so the next division uop(s) will be scheduled for execution after ~15-20 cycles, hence probably you are seeing slower performance for the second line of code. By looking at the assembly code snippet I suppose that ri and rj are constants(I can be wrong here) so why do not try to multiply by their inverse?

0 Kudos
Reply