memmove vectorization opt-report

Patrice_l_ · ‎12-29-2014

Hi all,

I am looking at the vectorization report of this instruction a(ipos+1:m+1)=a(ipos:m) where a is a character(20) array. The optimization report look like that :

LOOP BEGIN at 
   remark #15382: vectorization support: call to function for_cpystr cannot be vectorized
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed FLOW dependence between uids line 84 and uids line 84
   remark #15346: vector dependence: assumed ANTI dependence between uids line 84 and uids line 84
LOOP END

LOOP BEGIN at 
   remark #15382: vectorization support: call to function memmove cannot be vectorized
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed FLOW dependence between uids line 84 and uids line 84
   remark #15346: vector dependence: assumed ANTI dependence between uids line 84 and uids line 84
LOOP END

Report from: Code generation optimizations [cg]
remark #34014: optimization advice for memmove: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation
remark #34014: optimization advice for memmove: increase the source's alignment to 16 (and use __assume_aligned) to speed up library implementation
remark #34026: call to memmove implemented as a call to optimized library version

I understand the flow and anti dependence, thus the use of memmove. And in the advice the memory alignement needs to be 16byte. Does that mean that the instruction will be vectorized if the array section is aligned on 16byte ? And Should i increase the character to character(32) and then I won't have to make sure the size of the array section is a multiple of 16 ?

If so , I am using -aling array16byte compiler option, do I still need to use __assume_aligned or the compiler will deduce it automatically ?

Thanks.

Pat.

TimP · ‎12-29-2014

Too many questions here to answer without a working example. If the object is aligned but the compiler doesn't recognize it due to separate compilation, assume_aligned should help.

I doubt that lengthening to 32 could help unless alignment is recognized. Also doubt if you have a known reason for memmove that it could be vi

ectorized while that issue remains.

none of this should matter unless it is a critical hotspot e.g. in a tight loop.

Patrice_l_ · ‎12-29-2014

Actually after some test, the difference is very small. Maybe because of the remark #34026 that use the optimized version ?

A little bit confusing when reading the two previous remarks. So this was just out of curiosity, I need to sort a big list of record.

Sometimes, I have this :

remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (32, 0), and destination (alignment, offset): (1, 0)

What does that means ?

Thanks.

TimP · ‎01-03-2015

Apparently, the compiler is able to recognize alignment in those cases and skip run time alignment adjustment. It might be interesting if the compiler team could furnish a more complete discussion.

Skipping adjustments should overcome some past performance deficits with memcpy on short operands. I'd still like more sanity in the methods for replacing memcpy with vector code such as where Ifort requires !dir$ simd in place of !$omp simd used by others. The appearance of memcpy often is a symptom of an unnecessary temporary array; this consumes stack, cache, and memory bandwidth.

jimdempseyatthecove · ‎01-03-2015

>>a(ipos+1:m+1)=a(ipos:m)

a(ipos:m+1)=a(ipos:ipos) // a(ipos:m)

or

a(ipos:m+1)= ' ' // a(ipos:m)

The above will create a temporary and remove the vector dependency.

Note, a has dimension of 20, therefore m-ipos+1 must be small. If the temporary is created on stack then this should be relatively fast.

Jim Dempsey

Patrice_l_ · ‎01-05-2015

Hi,

Thanks for the insights. So the gain of having those instruction vectorized might be compensated by the creation of a temporary array. Jim , a does not have dimension 20, but character(20),dimension(6000) :: a. I'll try and see if the creating a temporary is faster for a large dataset.

Thanks.

Pat.