Community
cancel
Showing results for 
Search instead for 
Did you mean: 
jeff_keasler
Beginner
167 Views

how to optimize for loop bounds having multiple of 2 or 4?

Jump to solution
Hi,


I have many loops in my code interacting on multiple arrays, all aligned on 32 byte boundaries. Here is a trivial example for illustration:

foo (double * __restrict__ x, double * __restrict__ y, int low, int high)
{
__assume_aligned(x, 32) ;
__assume aligned(y, 32) ;

for (int i=low; i x = y *y ;
}
}

The above code generates the code below on an AVX supporting architecture (just a snippet here for core kernel, no prologue or epilogue shown):

..B1.69: # Preds ..B1.60 ..B1.58 ..B1.69 # Infreq
vmovupd 32(%r8,%rcx,8), %ymm1 #517.7
vmovupd (%r8,%rcx,8), %ymm0 #517.7
vmulpd %ymm1, %ymm1, %ymm3 #517.7
vmulpd %ymm0, %ymm0, %ymm2 #517.7
vmovupd %xmm2, (%rdi,%rcx,8) #517.7
vmovupd %xmm3, 32(%rdi,%rcx,8) #517.7
vextractf128 $1, %ymm2, 16(%rdi,%rcx,8) #517.7
vextractf128 $1, %ymm3, 48(%rdi,%rcx,8) #517.7
vmovupd 96(%r8,%rcx,8), %ymm5 #517.7
vmovupd 64(%r8,%rcx,8), %ymm4 #517.7
vmulpd %ymm5, %ymm5, %ymm7 #517.7
vmulpd %ymm4, %ymm4, %ymm6 #517.7
vmovupd %xmm6, 64(%rdi,%rcx,8) #517.7
vmovupd %xmm7, 96(%rdi,%rcx,8) #517.7
vextractf128 $1, %ymm6, 80(%rdi,%rcx,8) #517.7
vextractf128 $1, %ymm7, 112(%rdi,%rcx,8) #517.7
addq $16, %rcx #517.7
cmpq %rdx, %rcx #517.7
jb ..B1.69 # Prob 82% #517.7



In my actual code, the loop bound values 'low' and 'high' are only known at runtime, but I know they are a multiple of the SIMD vector size (2 for SSE, 4 for current AVX).

I want to avoid ugly/bad opcodes such as the vextractf128 on AVX, and I prefer to have movaps opcodes generated in place of movups (or worse) for the other architectures I use.

Since all my arrays are aligned, I am hoping to avoid all prologue and epilogue code generation by using better directives. I also want to avoid multiple variants of the loop being generated due to the fact that some arrays may be aligned differently than others. Basically, I want to guarantee exactly one variant of the loop be generated with no conditional logic around the core loop body (other than the principal iteration comparison and conditional jump).

Question 1: Are there directives or attributes to specify that the 'low' and 'high' variables in the loops above are multiples of the SIMD width?

Question 2: Are the optimizations in icpc currently powerful enough to take these directives and keywords into account to generate the tightest code possible?

I consider these optimizations extremely important going forward.

Thank you,
-Jeff Keasler
0 Kudos
1 Solution
JenniferJ
Moderator
167 Views
Looks like you should use "#pragma simd vectorlength(x)", and icl will decide if the epilogue is necessary.

More info on "pragma simd" ishere:
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/mac/cref_cls/common/...

Also there are more "pragma vector" to try: like "#pragma loop-count" http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/cref_cl...

Jennifer

View solution in original post

10 Replies
styc
Beginner
167 Views

"#pragma vector aligned" should be sufficient performance-wise. Some extra code will still be generated but not executed anyway. To remove such extra code you can manually strip-mine the loop by a factor of four. But that seems to interfere with loop unrolling. At the end of the day you can always use intrinsics.

Om_S_Intel
Employee
167 Views
You may refer to The Intel C++ compilerusers guide on "ivdep pragma" and "vector pragma". You can set suitable pragmas to direct the compiler to optimize for the trip count and for other directives for vecttorization.
jeff_keasler
Beginner
167 Views
Quoting styc

"#pragma vector aligned" should be sufficient performance-wise. Some extra code will still be generated but not executed anyway. To remove such extra code you can manually strip-mine the loop by a factor of four. But that seems to interfere with loop unrolling. At the end of the day you can always use intrinsics.


Hi,

This got rid of the vextractf128 instructions. Thanks! Now if we could just supress the prologue and epilogue, we would be set (all the prologue and epilogue checking opcodes do are disrupt the hardware prefetch, add more branches, clutter the icache, increase bus traffic, and reduce the potential size of the algorithm that will fit entirely in icache).

Thanks again!
-Jeff
styc
Beginner
167 Views

Seriously,a handful of extra bytes of code can cause what you put in the brackets?

SergeyKostrov
Valued Contributor II
167 Views
Quoting jeff_keasler
...
Now if we could just supress the prologue and epilogue, we would be set (all the prologue and epilogue checking opcodes do are disrupt the hardware prefetch, add more branches, clutter the icache, increase bus traffic, and reduce the potential size of the algorithm that will fit entirely in icache).

Thanks again!
-Jeff


1. Did you try a '__fastcall' calling conversion ( supported on Windows only; Option /Gr)?

2. I recommend you to look atMSDN article"Considerations for Writing Prolog/Epilog Code".

3. Iuse a _declspec( naked ) attribute ondeclarations of some functions, in order to make them as
smaller as possible,but this is a Microsoft specific.

Best regards,
Sergey

Georg_Z_Intel
Employee
167 Views
Hello,

another more generic and platform independent strategy to remove the calling and hence prologue/epilogue overhead would be to force inlining of foo(...).

Just in case, a starting point regarding inlining #pragmas:
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/cref_...

Best regards,

Georg Zitzlsberger
jeff_keasler
Beginner
167 Views

From styc: Seriously,a handful of extra bytes of code can cause what you put in the brackets?


On a bunch of back-to-back loops working on arrays containing multiple disjointed segments of 64 doubles, especically in the presence of a large number of cores on the same die? Absolutely!!! Not as big of a problem now, but by the time there are 32 or more cores per die, this is a lot of bus traffic if the cores are not working in lock step.

-Jeff
jeff_keasler
Beginner
167 Views
Quoting jeff_keasler
...
Now if we could just supress the prologue and epilogue, we would be set (all the prologue and epilogue checking opcodes do are disrupt the hardware prefetch, add more branches, clutter the icache, increase bus traffic, and reduce the potential size of the algorithm that will fit entirely in icache).

Thanks again!
-Jeff


1. Did you try a '__fastcall' calling conversion ( supported on Windows only; Option /Gr)?

2. I recommend you to look atMSDN article"Considerations for Writing Prolog/Epilog Code".

3. Iuse a _declspec( naked ) attribute ondeclarations of some functions, in order to make them as
smaller as possible,but this is a Microsoft specific.

Best regards,
Sergey

Sorry, I was refering to the vectorization prologue/epilogue on each loop rather than the function prologue/epilogue. I can actually get rid of the prologue with poper alignment directives and attributes, but I can't get rid of the vectorization epilogue, no matter what I do. A vectorization-epilogue is a code fragment that assumes I have less than a SIMD-vector worth of floating point operations to do at the tail end of my array segment. Since my lower and upper loop bounds are multiples of the SIMD-vector length, it is impossible for this condition to occur.

Thanks,

-Jeff

JenniferJ
Moderator
168 Views
Looks like you should use "#pragma simd vectorlength(x)", and icl will decide if the epilogue is necessary.

More info on "pragma simd" ishere:
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/mac/cref_cls/common/...

Also there are more "pragma vector" to try: like "#pragma loop-count" http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/cref_cl...

Jennifer

View solution in original post

jeff_keasler
Beginner
167 Views
Jennifer,

Thank you.

-Jeff
Reply