- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have many loops in my code interacting on multiple arrays, all aligned on 32 byte boundaries. Here is a trivial example for illustration:
foo (double * __restrict__ x, double * __restrict__ y, int low, int high)
{
__assume_aligned(x, 32) ;
__assume aligned(y, 32) ;
for (int i=low; i
}
}
The above code generates the code below on an AVX supporting architecture (just a snippet here for core kernel, no prologue or epilogue shown):
..B1.69: # Preds ..B1.60 ..B1.58 ..B1.69 # Infreq
vmovupd 32(%r8,%rcx,8), %ymm1 #517.7
vmovupd (%r8,%rcx,8), %ymm0 #517.7
vmulpd %ymm1, %ymm1, %ymm3 #517.7
vmulpd %ymm0, %ymm0, %ymm2 #517.7
vmovupd %xmm2, (%rdi,%rcx,8) #517.7
vmovupd %xmm3, 32(%rdi,%rcx,8) #517.7
vextractf128 $1, %ymm2, 16(%rdi,%rcx,8) #517.7
vextractf128 $1, %ymm3, 48(%rdi,%rcx,8) #517.7
vmovupd 96(%r8,%rcx,8), %ymm5 #517.7
vmovupd 64(%r8,%rcx,8), %ymm4 #517.7
vmulpd %ymm5, %ymm5, %ymm7 #517.7
vmulpd %ymm4, %ymm4, %ymm6 #517.7
vmovupd %xmm6, 64(%rdi,%rcx,8) #517.7
vmovupd %xmm7, 96(%rdi,%rcx,8) #517.7
vextractf128 $1, %ymm6, 80(%rdi,%rcx,8) #517.7
vextractf128 $1, %ymm7, 112(%rdi,%rcx,8) #517.7
addq $16, %rcx #517.7
cmpq %rdx, %rcx #517.7
jb ..B1.69 # Prob 82% #517.7
In my actual code, the loop bound values 'low' and 'high' are only known at runtime, but I know they are a multiple of the SIMD vector size (2 for SSE, 4 for current AVX).
I want to avoid ugly/bad opcodes such as the vextractf128 on AVX, and I prefer to have movaps opcodes generated in place of movups (or worse) for the other architectures I use.
Since all my arrays are aligned, I am hoping to avoid all prologue and epilogue code generation by using better directives. I also want to avoid multiple variants of the loop being generated due to the fact that some arrays may be aligned differently than others. Basically, I want to guarantee exactly one variant of the loop be generated with no conditional logic around the core loop body (other than the principal iteration comparison and conditional jump).
Question 1: Are there directives or attributes to specify that the 'low' and 'high' variables in the loops above are multiples of the SIMD width?
Question 2: Are the optimizations in icpc currently powerful enough to take these directives and keywords into account to generate the tightest code possible?
I consider these optimizations extremely important going forward.
Thank you,
-Jeff Keasler
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
More info on "pragma simd" ishere:
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/mac/cref_cls/common/cppref_pragma_simd.htm
Also there are more "pragma vector" to try: like "#pragma loop-count" http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/cref_cls/common/cppref_bk_pragmas_intel_specific_ref.htm
Jennifer
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"#pragma vector aligned" should be sufficient performance-wise. Some extra code will still be generated but not executed anyway. To remove such extra code you can manually strip-mine the loop by a factor of four. But that seems to interfere with loop unrolling. At the end of the day you can always use intrinsics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"#pragma vector aligned" should be sufficient performance-wise. Some extra code will still be generated but not executed anyway. To remove such extra code you can manually strip-mine the loop by a factor of four. But that seems to interfere with loop unrolling. At the end of the day you can always use intrinsics.
Hi,
This got rid of the vextractf128 instructions. Thanks! Now if we could just supress the prologue and epilogue, we would be set (all the prologue and epilogue checking opcodes do are disrupt the hardware prefetch, add more branches, clutter the icache, increase bus traffic, and reduce the potential size of the algorithm that will fit entirely in icache).
Thanks again!
-Jeff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Seriously,a handful of extra bytes of code can cause what you put in the brackets?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now if we could just supress the prologue and epilogue, we would be set (all the prologue and epilogue checking opcodes do are disrupt the hardware prefetch, add more branches, clutter the icache, increase bus traffic, and reduce the potential size of the algorithm that will fit entirely in icache).
Thanks again!
-Jeff
1. Did you try a '__fastcall' calling conversion ( supported on Windows only; Option /Gr)?
2. I recommend you to look atMSDN article"Considerations for Writing Prolog/Epilog Code".
3. Iuse a _declspec( naked ) attribute ondeclarations of some functions, in order to make them as
smaller as possible,but this is a Microsoft specific.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
another more generic and platform independent strategy to remove the calling and hence prologue/epilogue overhead would be to force inlining of foo(...).
Just in case, a starting point regarding inlining #pragmas:
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/cref_cls/common/cppref_pragma_std.htm
Best regards,
Georg Zitzlsberger
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From styc: Seriously,a handful of extra bytes of code can cause what you put in the brackets?
On a bunch of back-to-back loops working on arrays containing multiple disjointed segments of 64 doubles, especically in the presence of a large number of cores on the same die? Absolutely!!! Not as big of a problem now, but by the time there are 32 or more cores per die, this is a lot of bus traffic if the cores are not working in lock step.
-Jeff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now if we could just supress the prologue and epilogue, we would be set (all the prologue and epilogue checking opcodes do are disrupt the hardware prefetch, add more branches, clutter the icache, increase bus traffic, and reduce the potential size of the algorithm that will fit entirely in icache).
Thanks again!
-Jeff
1. Did you try a '__fastcall' calling conversion ( supported on Windows only; Option /Gr)?
2. I recommend you to look atMSDN article"Considerations for Writing Prolog/Epilog Code".
3. Iuse a _declspec( naked ) attribute ondeclarations of some functions, in order to make them as
smaller as possible,but this is a Microsoft specific.
Best regards,
Sergey
Sorry, I was refering to the vectorization prologue/epilogue on each loop rather than the function prologue/epilogue. I can actually get rid of the prologue with poper alignment directives and attributes, but I can't get rid of the vectorization epilogue, no matter what I do. A vectorization-epilogue is a code fragment that assumes I have less than a SIMD-vector worth of floating point operations to do at the tail end of my array segment. Since my lower and upper loop bounds are multiples of the SIMD-vector length, it is impossible for this condition to occur.
Thanks,
-Jeff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
More info on "pragma simd" ishere:
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/mac/cref_cls/common/cppref_pragma_simd.htm
Also there are more "pragma vector" to try: like "#pragma loop-count" http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/cref_cls/common/cppref_bk_pragmas_intel_specific_ref.htm
Jennifer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you.
-Jeff

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page