- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am looking at the vectorization report of this instruction a(ipos+1:m+1)=a(ipos:m) where a is a character(20) array. The optimization report look like that :
LOOP BEGIN at remark #15382: vectorization support: call to function for_cpystr cannot be vectorized remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between uids line 84 and uids line 84 remark #15346: vector dependence: assumed ANTI dependence between uids line 84 and uids line 84 LOOP END LOOP BEGIN at remark #15382: vectorization support: call to function memmove cannot be vectorized remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed FLOW dependence between uids line 84 and uids line 84 remark #15346: vector dependence: assumed ANTI dependence between uids line 84 and uids line 84 LOOP END Report from: Code generation optimizations [cg] remark #34014: optimization advice for memmove: increase the destination's alignment to 16 (and use __assume_aligned) to speed up library implementation remark #34014: optimization advice for memmove: increase the source's alignment to 16 (and use __assume_aligned) to speed up library implementation remark #34026: call to memmove implemented as a call to optimized library version
I understand the flow and anti dependence, thus the use of memmove. And in the advice the memory alignement needs to be 16byte. Does that mean that the instruction will be vectorized if the array section is aligned on 16byte ? And Should i increase the character to character(32) and then I won't have to make sure the size of the array section is a multiple of 16 ?
If so , I am using -aling array16byte compiler option, do I still need to use __assume_aligned or the compiler will deduce it automatically ?
Thanks.
Pat.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Too many questions here to answer without a working example. If the object is aligned but the compiler doesn't recognize it due to separate compilation, assume_aligned should help.
I doubt that lengthening to 32 could help unless alignment is recognized. Also doubt if you have a known reason for memmove that it could be vi
ectorized while that issue remains.
none of this should matter unless it is a critical hotspot e.g. in a tight loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually after some test, the difference is very small. Maybe because of the remark
#34026 that use the optimized version ?
A little bit confusing when reading the two previous remarks. So this was just out of curiosity, I need to sort a big list of record.
Sometimes, I have this :
remark #34000: call to memcpy implemented inline with loads and stores with proven source (alignment, offset): (32, 0), and destination (alignment, offset): (1, 0)
What does that means ?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Apparently, the compiler is able to recognize alignment in those cases and skip run time alignment adjustment. It might be interesting if the compiler team could furnish a more complete discussion.
Skipping adjustments should overcome some past performance deficits with memcpy on short operands. I'd still like more sanity in the methods for replacing memcpy with vector code such as where Ifort requires !dir$ simd in place of !$omp simd used by others. The appearance of memcpy often is a symptom of an unnecessary temporary array; this consumes stack, cache, and memory bandwidth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>a(ipos+1:m+1)=a(ipos:m)
a(ipos:m+1)=a(ipos:ipos) // a(ipos:m)
or
a(ipos:m+1)= ' ' // a(ipos:m)
The above will create a temporary and remove the vector dependency.
Note, a has dimension of 20, therefore m-ipos+1 must be small. If the temporary is created on stack then this should be relatively fast.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the insights. So the gain of having those instruction vectorized might be compensated by the creation of a temporary array. Jim , a does not have dimension 20, but character(20),dimension(6000) :: a. I'll try and see if the creating a temporary is faster for a large dataset.
Thanks.
Pat.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page