Craig, sorry for the late

CFR · ‎10-10-2015

I'm seeking clarification on how (and when) to properly use __intel_simd_lane(). I'm trying to understand how best to write SIMD functions that are portable across different intel architectures (i.e. different hardware SIMD length). Here is my toy example (retyped so there might be small errors):

#include <omp.h>
#define M 16

#pragma omp declare simd uniform(x,j), linear(lane)
unsigned int
add2a(unsigned int *x, unsigned int lane, unsigned in j)
{
  return x[lane] += j;
}

#pragma omp declare simd uniform(x,j)
unsigned int
add2b(unsigned int *x, unsigned int j)
{
  return x[__intel_simd_lane()] += j;
}

#pragma omp declare simd linear(x) uniform(j)
unsigned int
add2c(unsigned int *x, unsigned int j)
{
  return *x += j;
}

#include <stdio.h>
#include <string.h>
int
main(int argc, char *argv[])
{
  unsigned int x = {0};
  unsigned int y;

  memcpy(y,x,M*(sizeof(y[0]);
#pragma omp simd
  for (int j=0; j<M; j++) add2a(y,j,1);
  for (int j=0; j<M; j++) printf("%d ", y);
  printf("\n");

  memcpy(y,x,M*(sizeof(y[0]);
#pragma omp simd
  for (int j=0; j<M; j++) add2b(y,1);
  for (int j=0; j<M; j++) printf("%d ", y);
  printf("\n");

}

add2a is taken from Example 10 in https://software.intel.com/en-us/intel-parallel-universe-magazine (issue 22) but I don't like the idea of having to modify the argument list to make a SIMD function. add2b is based on the C compiler 16.0 documentation for __intel_simd_lane(). (add2c is the way I probably would have written it.) So, the issue is that these are not all the same. Compile with icc -std=c99 -O3 -xHost simdlane.c -qopenmp and you get the following (on an E5-2690 Sandy Bridge).

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1   <- add2a
4 4 4 4 0 0 0 0 0 0 0 0 0 0 0 0   <- add2b

If you add simdlen(M) to the add2b loop pragma, things start working right for M=2,4,8,16,32,64. Any other values (non power of 2, M>64) and only the first 4 elements are set (and incorrect). I guess I understand what's going on, but it seems there's a lot of subtle behavior that I would need to take into account to use it in portable code and that would make my code more complicated, not less.

So, have I misunderstood __intel_simd_lane()? Is it useful for portable code? When is it the right thing to use and how should it be used?

Thanks.

Hideki_I_Intel · ‎10-23-2015

Craig, sorry for the late response. I was notified about this post yesterday.

The behavior you are seeing is expected from __intel_simd_lane().
In this test case, the only thing you should expect is sum(y[:]) to be 16.
[Given that the SIMD loop trip count is constant value 16, vector length
for the loop will not exceed 16 and as such memory references will not
go out of bounds. Otherwise, you'd want to explicitly specify simdlen()
to #pragma omp simd (See OpenMP4.1 draft at www.openmp.org)
so as to limit the vectorlength within the allocated memory buffer size.
CilkPlus simd directive also has vectorlength clause.]

1) The omp declare simd functions are not marked force inline or noinline.
    You are leaving the decision to the compiler ---- therefore do not
    assume the vectorlength determined by "omp declare simd" (due to
    vector function ABI) is applicable inside add2a()/add2b. For inlined
    call sites, "omp declare simd" does not have any effects. See IPO
    optimization report for inlining results.
2) If the compiler does not inline add2a()/add2b(), compiler should
    choose vectorlength of 4 according to vector function ABI.
    Vectorized call to add2b() --- _ZGVxN4uu_add2b() after name-mangling ---
    would access y[0], y[1], y[2], y[3].
    Scalar call to add2b(), if any is made (e.g. in scalar remainder loop from vectorization),
    would access y[0].
3) If the compiler inlines add2b() to the caller loop, vector length decision to be used
    in interpreting __intel_simd_lane() is left to the compiler (see the top part of this reply
    on how to explicitly specify vectorlength). It would follow the vector length decided for the
    caller loop. For obvious reasons, vectorizer will not use vector length of 1.
    We have practical limits imposed on the supported vectorlength based on our target ISA.
    Current upper limit of 64 matches the number of 8bit (e.g., char) data that can fit in one ZMM register.
    [AVX512F ISA extension defines ZMM register.]
4) Here are some examples of valid execution scenarios, for illustration purposes only. It would take
    a bit of efforts to actually reproduce some of these behaviors, though --- very careful contrived code
    writing may be needed.
    a) serial execution. y[0] becomes 16 and all other elements are zero.
    b) compiler decides not to peel, uses vector length of 4. In this case, y[0]=y[1]=y[2]=y[4]=4. All other elements are zero.
    c) compiler decides to peel off one iteration, uses vector length of 4 and 3 iterations executed in remainder.
        In this case, y[0] becomes 7. y[1]=y[2]=y[3]=3. All other elements are zero. [Note: compiler may peel up to one element less than
        one full vector register worth.]
    d) compiler decides not to peel, uses vector length of 8. In this case, y=2 (k=0 to 7). All other elements are zero.
    e) compiler decides to peel off one iteration, uses vector length of 8 and 7 iterations executed in remainder.
        In this case, y[0] becomes 9. y=1 (k=1 to 7). All other elements are zero.
5) To make it easier to retarget to different vectorlength, some people define VLEN macro in a header file
     and consistently use it wherever vectorlength consistency is required.

A[__intel_simd_lane()] memory reference is useful when a programmer wants to reduce many number of elements to
vectorlength (implementation dependent or programmer specified) number of elements. By default, actual distribution within
the reduced array is left to the compiler and runtime execution. If the programmer requires even distribution,
careful coding is needed to enforce such evenness. __intel_simd_lane() provides the evenness within vectorized context
(i.e., main vector loop, vectorized peel loop, vectorized remainder loop, and vectorized function) but not in scalar context
(always return zero).

I do not know whether this is a useful feature for your particular application, but we know there are people
who are using it effectively.

Thanks. Hope this helps.
Hideki

KitturGanesh · ‎10-23-2015

@Hideki: Thanks for the detailed write up to Craig's question, Hideki.

Hi Craig,
I discussed your question with Hideki (our vectorization expert) so your important question (on this new feature) is responded to appropriately and Hideki took the initiative to also respond as well! Hope this helps and feel free to let us know if you need any further clarification as well, appreciate much.
_Kittur

semantics and use of __intel_simd_lane w/ SIMD functions?