- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm seeking clarification on how (and when) to properly use __intel_simd_lane(). I'm trying to understand how best to write SIMD functions that are portable across different intel architectures (i.e. different hardware SIMD length). Here is my toy example (retyped so there might be small errors):
#include <omp.h> #define M 16 #pragma omp declare simd uniform(x,j), linear(lane) unsigned int add2a(unsigned int *x, unsigned int lane, unsigned in j) { return x[lane] += j; } #pragma omp declare simd uniform(x,j) unsigned int add2b(unsigned int *x, unsigned int j) { return x[__intel_simd_lane()] += j; } #pragma omp declare simd linear(x) uniform(j) unsigned int add2c(unsigned int *x, unsigned int j) { return *x += j; } #include <stdio.h> #include <string.h> int main(int argc, char *argv[]) { unsigned int x= {0}; unsigned int y ; memcpy(y,x,M*(sizeof(y[0]); #pragma omp simd for (int j=0; j<M; j++) add2a(y,j,1); for (int j=0; j<M; j++) printf("%d ", y ); printf("\n"); memcpy(y,x,M*(sizeof(y[0]); #pragma omp simd for (int j=0; j<M; j++) add2b(y,1); for (int j=0; j<M; j++) printf("%d ", y ); printf("\n"); }
add2a is taken from Example 10 in https://software.intel.com/en-us/intel-parallel-universe-magazine (issue 22) but I don't like the idea of having to modify the argument list to make a SIMD function. add2b is based on the C compiler 16.0 documentation for __intel_simd_lane(). (add2c is the way I probably would have written it.) So, the issue is that these are not all the same. Compile with icc -std=c99 -O3 -xHost simdlane.c -qopenmp and you get the following (on an E5-2690 Sandy Bridge).
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 <- add2a 4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0 <- add2b
If you add simdlen(M) to the add2b loop pragma, things start working right for M=2,4,8,16,32,64. Any other values (non power of 2, M>64) and only the first 4 elements are set (and incorrect). I guess I understand what's going on, but it seems there's a lot of subtle behavior that I would need to take into account to use it in portable code and that would make my code more complicated, not less.
So, have I misunderstood __intel_simd_lane()? Is it useful for portable code? When is it the right thing to use and how should it be used?
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Craig, sorry for the late response. I was notified about this post yesterday.
The behavior you are seeing is expected from __intel_simd_lane().
In this test case, the only thing you should expect is sum(y[:]) to be 16.
[Given that the SIMD loop trip count is constant value 16, vector length
for the loop will not exceed 16 and as such memory references will not
go out of bounds. Otherwise, you'd want to explicitly specify simdlen()
to #pragma omp simd (See OpenMP4.1 draft at www.openmp.org)
so as to limit the vectorlength within the allocated memory buffer size.
CilkPlus simd directive also has vectorlength clause.]
1) The omp declare simd functions are not marked force inline or noinline.
You are leaving the decision to the compiler ---- therefore do not
assume the vectorlength determined by "omp declare simd" (due to
vector function ABI) is applicable inside add2a()/add2b. For inlined
call sites, "omp declare simd" does not have any effects. See IPO
optimization report for inlining results.
2) If the compiler does not inline add2a()/add2b(), compiler should
choose vectorlength of 4 according to vector function ABI.
Vectorized call to add2b() --- _ZGVxN4uu_add2b() after name-mangling ---
would access y[0], y[1], y[2], y[3].
Scalar call to add2b(), if any is made (e.g. in scalar remainder loop from vectorization),
would access y[0].
3) If the compiler inlines add2b() to the caller loop, vector length decision to be used
in interpreting __intel_simd_lane() is left to the compiler (see the top part of this reply
on how to explicitly specify vectorlength). It would follow the vector length decided for the
caller loop. For obvious reasons, vectorizer will not use vector length of 1.
We have practical limits imposed on the supported vectorlength based on our target ISA.
Current upper limit of 64 matches the number of 8bit (e.g., char) data that can fit in one ZMM register.
[AVX512F ISA extension defines ZMM register.]
4) Here are some examples of valid execution scenarios, for illustration purposes only. It would take
a bit of efforts to actually reproduce some of these behaviors, though --- very careful contrived code
writing may be needed.
a) serial execution. y[0] becomes 16 and all other elements are zero.
b) compiler decides not to peel, uses vector length of 4. In this case, y[0]=y[1]=y[2]=y[4]=4. All other elements are zero.
c) compiler decides to peel off one iteration, uses vector length of 4 and 3 iterations executed in remainder.
In this case, y[0] becomes 7. y[1]=y[2]=y[3]=3. All other elements are zero. [Note: compiler may peel up to one element less than
one full vector register worth.]
d) compiler decides not to peel, uses vector length of 8. In this case, y
e) compiler decides to peel off one iteration, uses vector length of 8 and 7 iterations executed in remainder.
In this case, y[0] becomes 9. y
5) To make it easier to retarget to different vectorlength, some people define VLEN macro in a header file
and consistently use it wherever vectorlength consistency is required.
A[__intel_simd_lane()] memory reference is useful when a programmer wants to reduce many number of elements to
vectorlength (implementation dependent or programmer specified) number of elements. By default, actual distribution within
the reduced array is left to the compiler and runtime execution. If the programmer requires even distribution,
careful coding is needed to enforce such evenness. __intel_simd_lane() provides the evenness within vectorized context
(i.e., main vector loop, vectorized peel loop, vectorized remainder loop, and vectorized function) but not in scalar context
(always return zero).
I do not know whether this is a useful feature for your particular application, but we know there are people
who are using it effectively.
Thanks. Hope this helps.
Hideki
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Hideki: Thanks for the detailed write up to Craig's question, Hideki.
Hi Craig,
I discussed your question with Hideki (our vectorization expert) so your important question (on this new feature) is responded to appropriately and Hideki took the initiative to also respond as well! Hope this helps and feel free to let us know if you need any further clarification as well, appreciate much.
_Kittur

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page