Community
cancel
Showing results for 
Search instead for 
Did you mean: 
YZhan69
Beginner
237 Views

crash using GCC style vector types

I'm seeing a strange compiler crash when using GCC style vector types. I'm using vector types to load/store because using _mm_loadu_si128 to load vectors seems to ignore the restrict qualifier, causing constant values to be reloaded from memory. I'll make another post for that. My function calculates the mean & stdev. of an array. The clever part is reusing the loop body to process the remainder to reduce the cache foot print. But it seems it's this complex control flow that's causing the crash. If I comment out the goto handleRemainder or comment out the inner most do-while loop, it compiles. And of course, using __m128i instead of vector types makes the crash go away. The crash happens in both ICC 14 & the latest ICC 17. Appreciate a reasonable workaround, explanation, or patch.

#include <immintrin.h>
#include <stdint.h>
#include <unistd.h>
#include <math.h>
#include <algorithm>
using namespace std;

#define CAST_VSHORT(x) x
#define ROUND_DOWN(a, b) (a & (~(b - 1)))
#define MAX_INTENSITY 4096
#define FORCE_INLINE inline __attribute__ ((always_inline))

#if 1
// crashes
typedef int32_t __attribute__((vector_size(16))) VINT;
typedef int16_t __attribute__((vector_size(16))) VSHORT;
typedef int16_t __attribute__((vector_size(16), aligned(1))) UNALIGNED_VSHORT;
#else
typedef __m128i VINT;
typedef __m128i VSHORT;
#endif

FORCE_INLINE __m128i PartialVectorMask(ssize_t n)
{
  return _mm_set1_epi16(0xffff);   // incomplete for brevity
}

FORCE_INLINE int64_t VectorSum(VINT x)
{
  __m128i lo = _mm_cvtepi32_epi64(x),
          hi = _mm_cvtepi32_epi64(_mm_srli_si128(x, 8));
  __m128i sum = _mm_add_epi64(lo, hi);
  return _mm_extract_epi64(_mm_add_epi64(sum, _mm_srli_si128(sum, 8)), 0);
}

__m128i
void CalculateMeanAndStdev(float &mean, float &stdev,
                           int16_t *in, ssize_t size)
{
  ssize_t i;
  double sum = 0, squareSum = 0;
  VINT zero = _mm_set1_epi32(0),
    vSquareSum = zero,
    vSum = zero;
    VSHORT data;
    ssize_t blockEnd;
    const ssize_t VECTOR_WIDTH = 8;
    // elements you can accumulate before square sum can overflow
    const ssize_t BLOCK_SIZE = ROUND_DOWN((UINT32_MAX / ((MAX_INTENSITY - 1) * (MAX_INTENSITY - 1))) * 4, VECTOR_WIDTH);
    ssize_t roundedSize = ROUND_DOWN(size, VECTOR_WIDTH);
    for (i = 0; i <= size - VECTOR_WIDTH; )
    {
      blockEnd = min(i + BLOCK_SIZE, roundedSize);
      // process a block whos size is a multiple of 8, except when processing the SIMD remainder
      do
      {
          data = _mm_loadu_si128((__m128i *)&in);
          //data = *(UNALIGNED_VSHORT *)&in;
      handleRemainder:
          VINT unpacked0 = _mm_srai_epi32(_mm_unpacklo_epi16(data, data), 16),
               unpacked1 = _mm_srai_epi32(_mm_unpackhi_epi16(data, data), 16);

          vSquareSum = _mm_add_epi32(_mm_madd_epi16(data, data), vSquareSum);
          vSum = _mm_add_epi32(unpacked0, vSum);
          vSum = _mm_add_epi32(unpacked1, vSum);
          i += VECTOR_WIDTH;
      } while (i < blockEnd);

      squareSum += VectorSum(vSquareSum);
      sum += VectorSum(vSum);
      vSum = zero;
      vSquareSum = zero;
  }
  if (i < size)
  {
      // handle remainder by setting invalid elements to 0
      data = _mm_and_si128(_mm_loadu_si128((__m128i *)&in), PartialVectorMask((size % VECTOR_WIDTH) * sizeof(int16_t)));
      blockEnd = size;
      goto handleRemainder;     // share code to reduce machine code size
  }
  mean = sum / size;
  stdev = sqrtf((squareSum - sum * sum / size) / (size - 1));
}

int main()
{
  const size_t N = 4096;
  int16_t __attribute__((aligned(16))) image;
  float mean, stdev;
  for (int i = 0; i < 1000000; ++i)
  {
    CalculateMeanAndStdev(mean, stdev, image, N);
  }
  return mean;
}
0 Kudos
6 Replies
jimdempseyatthecove
Black Belt
237 Views

While the compiler shouldn't crash, you might take a look at:

https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

Try using:

#if 0
// crashes
typedef int32_t __attribute__((vector_size(16))) VINT;
typedef int16_t __attribute__((vector_size(16))) VSHORT;
typedef int16_t __attribute__((vector_size(16), aligned(1))) UNALIGNED_VSHORT;
#else
// haven't tried this, you can
typedef int32_t VINT __attribute__((vector_size(16)));
typedef int16_t VSHORT __attribute__((vector_size(16)));
typedef int16_t UNALIGNED_VSHORT __attribute__((vector_size(16), aligned(1)));
#endif

In gcc format, attribute follows new type name.

Jim Dempsey

YZhan69
Beginner
237 Views

Jim, I tried your suggestion, but it still crashes :( I think the __attribute__ modifier can come either before or after the variable name.

 

Yuan_C_Intel
Employee
237 Views

Hi, Joe

Could you report the issue in cloud service at: http://www.intel.com/supporttickets ?

Thanks.

SergeyKostrov
Valued Contributor II
237 Views

>>...And of course, using __m128i instead of vector types makes the crash go away. The crash happens in both ICC 14 & >>the latest ICC 17... What compiler options did you use?
YZhan69
Beginner
237 Views

I just compiled with the default (no arguments) ICC options. To compile it with GCC (no crash), I need -msse4 -flax-vector-conversions.

 

YZhan69
Beginner
237 Views

I've submitted the problem to the compiler team and it was confirmed that there was a code generation bug and that it will be fixed in the next service pack.

"Compiler optimization is broken here as it is assuming the loop containing vector intrinsics has only one entrance."

Glad this bug was fixed.

 

Reply