Solved: icpc memory alignment - unexpected output

Massimiliano_B_3 · ‎09-16-2016

Hi all,

I would like to understand the behavior of this small piece of code that I have extracted from a bigger application that makes use of vectorization and simd instructions.
Please don't look at the design, it is inherited from my original code and I want to take it as it is to reproduce the anomaly, though I agree with the fact that it's senseless in this small context. I'm following the guidelines described here about the alignment.

I have the following Dummy class.
dummy.h

#ifdef __INTEL_COMPILER
  typedef double * __restrict__ Real_ptr __attribute__((align_value(32)));
  typedef const double * const __restrict__ ConstReal_ptr __attribute__((align_value(32)));
#else
  typedef double * __restrict__ Real_ptr __attribute__((aligned(32)));
  typedef const double * const __restrict__ ConstReal_ptr __attribute__((aligned(32)));
#endif

class Dummy {
public:
   virtual void calculate( const unsigned int n, ConstReal_ptr x, ConstReal_ptr y, Real_ptr z ) const;
private:
   double computeSingleValue( const double x, const double y ) const;
};

dummy.cpp

#include "dummy.h"
#include <algorithm>

static const double K = 10.0;

void Dummy::calculate( const unsigned int n, ConstReal_ptr x, ConstReal_ptr y, Real_ptr z ) const
{
   for( unsigned int i = 0; i < n; ++i)
   {
    z = computeSingleValue( x, y );
   }
}

double Dummy::computeSingleValue( const double x, const double y ) const
{
   return std::max(K, (x >= y) ? x : y);
}

The main function tests the calculate method and couts a message in case of output different from the expected. The main.cpp is the following:

#include "dummy.h"
#include <cassert>
#include <cmath>
#include <iostream>
#include <stdlib.h>

int main()
{
   const unsigned int N = 4;
   
   Real_ptr x;
   assert( 0 == posix_memalign ( (void **)&x, 32, sizeof ( double ) * N ) );
   x[0] = 0.0;
   x[1] = 10.0;
   x[2] = 100.0;
   x[3] = 1000.0;
   
   Real_ptr y;
   assert( 0 == posix_memalign ( (void **)&y, 32, sizeof ( double ) * N ) );
   y[0] = 0.0;
   y[1] = 10.0;
   y[2] = 100.0;
   y[3] = 1000.0;
   
   Real_ptr z;
   assert( 0 == posix_memalign ( (void **)&z, 32, sizeof ( double ) * N ) );
   z[0] = 0.0;
   z[1] = 0.0;
   z[2] = 0.0;
   z[3] = 0.0;

   Dummy obj;
   obj.calculate( N, x, y, z );
   if( std::abs(10.0   - z[0])> 1.0E-18 ) { std::cout << "FAIL 0: z = " << z[0] << std::endl; };
   if( std::abs(10.0   - z[1])> 1.0E-18 ) { std::cout << "FAIL 1: z = " << z[1] << std::endl; };
   if( std::abs(100.0  - z[2])> 1.0E-18 ) { std::cout << "FAIL 2: z = " << z[2] << std::endl; };
   if( std::abs(1000.0 - z[3])> 1.0E-18 ) { std::cout << "FAIL 3: z = " << z[3] << std::endl; };

   free(x);
   free(y);
   free(z);
}

Now, I'm trying to compile it with -O2 and the following compilers:

g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
icpc (ICC) 16.0.3 20160415

With GCC everything works fine and the result is as expected, while with the Intel compiler the values in the last two elements of the z array are wrong and the output of the program is

FAIL 2: z = 10
FAIL 3: z = 10

The thing that puzzles me, apart from the compiler dependency, is that if I do one of the following things I can get the correct output:

decrease the optimization to -O1 or -O0
move all the source code in a single translation unit
replace z = computeSingleValue( x, y ); with z = std::max(K, (x >= y) ? x : y); in dummy.cpp
add a std::cout << std::endl; in the body of computeSingleValue in dummy.cpp
remove the __restrict__ keyword from ConstReal_ptr typedef

I'm probably doing something wrong, but I don't get it. Any help would be really appreciated.

Thanks in advance and regards,

Massi

Yuan_C_Intel · ‎09-21-2016

Hi, Massimilliano

According to the engineer, this relates to bad interaction between two different C++ language features: __restrict__ and std::max

The pointers x and y are declared __restrict__, while values x and y are passed indirectly to std::max.

Unfortunately, std::max takes reference parameters. The compiler creates references that are aliases of x and y.

Because of the non-aliasing property of x and y, the compiler thinks that the references cannot point to x and y, which is definitely wrong.

We have prioritized this issue in high priority and will resolve the issue soon.

Other workaround might be not use std::max, but use a macro max function, like following:

 #define max(a,b) \
   ({ __typeof__ (a) _a = (a); \
       __typeof__ (b) _b = (b); \
     _a > _b ? _a : _b; })

Is it helpful?

Thanks.

View solution in original post

Yuan_C_Intel · ‎09-18-2016

Hi, Massimiliano

Thank you for raising the issue with a reproducer!

I have reproduced the issue you reported and entered it in our problem tracking system for a resolution.

Sorry for any inconvenience. I will let you know when I have an update on this issue.

Thanks.

SergeyKostrov · ‎09-18-2016

It could take some time before a fix is released and I think you need to consider a workaround for your main application based on __builtin_assume_aligned. Here is an example: template < class T > _RTINLINE RTvoid TestFuction( T ** _RTRESTRICT pptA ) { ... _RTALIGNED T **pptA2 = ( T ** )__builtin_assume_aligned( pptA, _RTDEFAULT_ALIGNMENT ); ... } also, try to use an intrinsic functions _mm_malloc / _mm_free ( if it is possible ) instead of posix's posix_memalign / free functions.

Massimiliano_B_3 · ‎09-18-2016

Thank you for your replies,

as suggested, I have tried also other memory allocators:

mkl_malloc/mkl_free
In this case the result is even different, though still wrong

FAIL 0: z = 1000
FAIL 1: z = 1000
FAIL 2: z = 1000

but again if a get rid of the __restrict__ keyword from ConstReal_ptr typedef it goes back to normal.

_mm_malloc/_mm_free
In this case the behavior is the same as with posix_memalign

FAIL 2: z = 10
FAIL 3: z = 10

and again removing the __restrict__ apparently solves the issue.

Speaking of workarounds, I have tried to instruct the compiler with the alignment in the following two ways (for simplicity I have dropped the constness of x and y, but this doesn't affect the anomaly)

void Dummy::calculate( const unsigned int n, Real_ptr x, Real_ptr y, Real_ptr z ) const
{
   __assume_aligned(x, 32);
   __assume_aligned(y, 32);
   __assume_aligned(z, 32);
   for( unsigned int i = 0; i < n; ++i)
   {
    z = computeSingleValue( x, y );
   }
}

void Dummy::calculate( const unsigned int n, Real_ptr x, Real_ptr y, Real_ptr z ) const
{
   x = (Real_ptr)__builtin_assume_aligned(x,32);
   y = (Real_ptr)__builtin_assume_aligned(y,32);
   z = (Real_ptr)__builtin_assume_aligned(z,32);
   for( unsigned int i = 0; i < n; ++i)
   {
    z = computeSingleValue( x, y );
   }
}

In both cases the code behaves as in my first post, so the issue is still there.

Is it possible that the problem is related to the __restrict__ keyword in the typedef rather than the alignment? Maybe I'm doing some silly mistake trying to do something which is not allowed by the language...

Thank you all again,

Massi

SergeyKostrov · ‎09-18-2016

>>Is it possible that the problem is related to the __restrict__ keyword in the typedef rather than the alignment? It looks like Yes since there are No any problems, and you proved it, when you do Not use __restrict__ keyword. >>Maybe I'm doing some silly mistake trying to do something which is not allowed by the language... This is absolutely legal application of __restrict__ keyword. The problem is even bigger because the Indirect Indexing Technique you are using is a Very Common and used in many well known algorithms, like Histogram algorithms in DSP / Image Processing, and in high performance Pegeonhole Sorting algorithm for integer data types.

Massimiliano_B_3 · ‎09-18-2016

Minor update,
I can reproduce the issue also with an older version of the Intel compiler: icpc (ICC) 14.0.1 20131008

Massimiliano_B_3 · ‎09-18-2016

And I confirm that the alignment is not the problem. I simplified the code a bit more removing the alignment:

// dummy.cpp
#include <algorithm>
#include <iostream>

double computeSingleValue( const double x, const double y );

void calculate( 
  const unsigned int n,
  const double * const __restrict__ x,
  const double * const __restrict__ y,
  double * __restrict__ z )
{
   for( unsigned int i = 0; i < n; ++i)
   {
    z = computeSingleValue( x, y );
   }
}

double computeSingleValue( const double x, const double y )
{
   return std::max(10.0, (x >= y) ? x : y);
}

and

// main.cpp
#include <cmath>
#include <iostream>

void calculate(
  const unsigned int n,
  const double * const __restrict__ x,
  const double * const __restrict__ y,
  double * __restrict__ z );

int main()
{
   const unsigned int N = 4;
   
   double * x = new double;
   x[0] = 0.0;
   x[1] = 10.0;
   x[2] = 100.0;
   x[3] = 1000.0;
   
   double * y = new double;
   y[0] = 0.0;
   y[1] = 10.0;
   y[2] = 100.0;
   y[3] = 1000.0;
   
   double * z = new double;
   z[0] = 0.0;
   z[1] = 0.0;
   z[2] = 0.0;
   z[3] = 0.0;

   calculate( N, x, y, z );
   if( std::abs(10.0   - z[0])> 1.0E-18 ) std::cout << "FAIL 0: z = " << z[0] << std::endl;
   if( std::abs(10.0   - z[1])> 1.0E-18 ) std::cout << "FAIL 1: z = " << z[1] << std::endl;
   if( std::abs(100.0  - z[2])> 1.0E-18 ) std::cout << "FAIL 2: z = " << z[2] << std::endl;
   if( std::abs(1000.0 - z[3])> 1.0E-18 ) std::cout << "FAIL 3: z = " << z[3] << std::endl;

   delete [] x;
   delete [] y;
   delete [] z;
}

and the issue is still there.

SergeyKostrov · ‎09-18-2016

>>The thing that puzzles me, apart from the compiler dependency, is that if I do one of the following things I can get the >>correct output: >> >>•decrease the optimization to -O1 or -O0 I think this is the best workaround. However, you do Not need to do that on a global scope, by using -O0, and optimizations need to be disabled just for your processing function void calculate( ... ): ... void calculate( ... ); ... #pragma optimize ( "", off ) void calculate( ... ) { ... } ... Since your reproducer demonstrates the problem in the main application then disabling optimizations on a global scope does Not look good.

Massimiliano_B_3 · ‎09-21-2016

Dear Yuan,

do you have any update or suggestions on this?

Kind regards

Yuan_C_Intel · ‎09-21-2016

Hi, Massimilliano

According to the engineer, this relates to bad interaction between two different C++ language features: __restrict__ and std::max

The pointers x and y are declared __restrict__, while values x and y are passed indirectly to std::max.

Unfortunately, std::max takes reference parameters. The compiler creates references that are aliases of x and y.

Because of the non-aliasing property of x and y, the compiler thinks that the references cannot point to x and y, which is definitely wrong.

We have prioritized this issue in high priority and will resolve the issue soon.

Other workaround might be not use std::max, but use a macro max function, like following:

 #define max(a,b) \
   ({ __typeof__ (a) _a = (a); \
       __typeof__ (b) _b = (b); \
     _a > _b ? _a : _b; })

Is it helpful?

Thanks.

Massimiliano_B_3 · ‎09-21-2016

Thank you for the reply and the proposed workaround,

According to the engineer, this relates to bad interaction between two different C++ language features: __restrict__ and std::max
The pointers x and y are declared __restrict__, while values x and y are passed indirectly to std::max.
Unfortunately, std::max takes reference parameters. The compiler creates references that are aliases of x and y.
Because of the non-aliasing property of x and y, the compiler thinks that the references cannot point to x and y, which is definitely wrong.

just for my understanding, within the same code-design as the one of this reproducer do I have (in general) to expect this issue with every function indirectly taking reference inputs from pointers declared __restrict__?

Kind regards,

Massi