Speed Up shown in the optimization report but no speedup is shown while execution.

Vijay_D_ · ‎07-15-2015

Hi,

I had a for loop which had some branches due to which the loop was not a candidate of vectorization which I confirmed from the optimization report. I removed this branches using masking, and now the same for loop satisfies all the necessary requirement for a loop to be vectorized, like

1. No Branches or Jumps.

2. No Dependencies. e.t.c

Below is a output from the optimization report with option -vec-report5. As can be seen from the vectorization report the loop was splitted and part1 (chunk1) was supposed to give a speed of 2.140 and part2 (chunk2) was supposed to give a speed up of 1.4 but when I execute the code in real time I cannot find any speedup at all, time is almost equal to original. I also checked the assembly for the same loop and it seems that compiler do generates the machine code instructions with xmm registers.

LOOP BEGIN at ../../../sample.c(2635,5)
<Distributed chunk1>
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2637,9) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2638,9) ]
remark #15301: PARTIAL LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 2
remark #15460: masked strided loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 12
remark #15477: vector loop cost: 5.500
remark #15478: estimated potential speedup: 2.140
remark #15479: lightweight vector operations: 7
remark #15481: heavy-overhead vector operations: 1
remark #15487: type converts: 2
remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at ../../../sample.c(2635,5)
<Remainder, Distributed chunk1>
LOOP END

LOOP BEGIN at ../../../sample.c(2635,5)
<Distributed chunk2>
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2641,9) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2641,9) ]
remark #15388: vectorization support: reference mask5 has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference bi has aligned access [ ../../../sample.c(2661,2) ]
remark #15388: vectorization support: reference ai has aligned access [ ../../../sample.c(2661,2) ]
remark #15399: vectorization support: unroll factor set to 2
remark #15301: PARTIAL LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 10
remark #15449: unmasked aligned unit stride stores: 1
remark #15458: masked indexed (or gather) loads: 10
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 72
remark #15477: vector loop cost: 47.370
remark #15478: estimated potential speedup: 1.480
remark #15479: lightweight vector operations: 67
remark #15480: medium-overhead vector operations: 1
remark #15488: --- end vector loop cost summary ---
LOOP END

LOOP BEGIN at ../../../sample.c(2635,5)
<Remainder, Distributed chunk2>
LOOP END

My Question is that is this possible that although the optimization report shows opportunity of speed and machine code do use the 128 bits xmm registers rather than the scalar code, but still in real time it does not shows any speedup? Or is it because of some heavy overhead vectorization operations as shown in the optimization report (But I think compiler do considers it while compilation) ? If I am wrong or missing something, any help for the right direction would be appericiated.

TimP · ‎07-16-2015

Quoted speedup appears to be an ideal based on possibly contradictory conditions:

large enough loop length to make remainders and startup and wind down negligible

no impact of cache misses (gather loads would seem to warn of problems here)

With partial vectorization, presumably the quotations apply to just the vectorized parts; and there may be losses due to splitting a loop into vector and non-vector code portions. Some of these questions may be analyzed in the Advisor beta test version.

I suppose those warnings about type conversions can't be assessed without looking into the code to see whether they fall in inner or vectorized loops.

jimdempseyatthecove · ‎07-20-2015

Use VTune to check for LLC misses within this loop. You may find that both the scalar and vector codes are memory latency bound. If this is the case, and if this is a significant bottleneck in your program, then examine the code to see if reorganizing the data can improve the cache hit ratio.

Jim Dempsey