Designing an aligned vector

velvia · ‎08-20-2015

Hi,

I would like to build my own version of a std::vector that keeps its memory aligned. The following code works as expected:

template <typename T>
class Vector {
 private:
  T* begin_;
  int size_;

 public:
  Vector(T* p, int n) : begin_{p}, size_{n} {}
  int size() const { return size_; }
  const T& operator[](int k) const {
    __assume_aligned(begin_, 32);
    return begin_;
  }
  T& operator[](int k) {
    __assume_aligned(begin_, 32);
    return begin_;
  }
};

double f(const Vector<double>& v) {
  double sum = 0.0;
  for (int i = 0; i < v.size(); ++i) {
    sum += v;
  }
  return sum;
}

When compiled with

icpc -c -std=c++11 -O3 -xHost -ansi-alias -opt-report=5 f.cpp -o f.o

on OSX with icpc 15.0.2, the optimization report shows no sign of loop peeling: it proves that the __assume_aligned works as expected.

Unfortunately, with such a design, Vector<int> is not as efficient as Vector<double>: pointer aliasing prevents the compiler to optimize v.size() out of the loop. Therefore, the trip count is not known at the entrance of the loop which is therefore not vectorized. The classic solution to this problem consists in using a pointer T* end_ such that the size of the vector is end_ - begin_. Unfortunately, the following code:

template <typename T>
class Vector {
 private:
  T* begin_;
  T* end_;

 public:
  Vector(T* p, int n) : begin_{p}, end_{p + n} {}
  int size() const { return static_cast<int>(end_ - begin_); }
  const T& operator[](int k) const {
    __assume_aligned(begin_, 32);
    return begin_;
  }
  T& operator[](int k) {
    __assume_aligned(begin_, 32);
    return begin_;
  }
};

double f(const Vector<double>& v) {
  double sum = 0.0;
  for (int i = 0; i < v.size(); ++i) {
    sum += v;
  }
  return sum;
}

is not vectorized anymore. But removing the __assume_aligned fix this problem. It seems that the compiler acts as if __assume_aligned(begin_, 32) might mutate begin_. So far I have found the following workaround:

template <typename T>
class Vector {
 private:
  T* begin_;
  T* begin_copy_;
  T* size_;

 public:
  Vector(T* p, int n) : begin_{p}, begin_copy_{p}, size_{p + n} {}
  int size() const { return static_cast<int>(size_ - begin_copy_); }
  const T& operator[](int k) const {
    __assume_aligned(begin_, 32);
    return begin_;
  }
  T& operator[](int k) {
    __assume_aligned(begin_, 32);
    return begin_;
  }
};

double f(const Vector<double>& v) {
  double sum = 0.0;
  for (int i = 0; i < v.size(); ++i) {
    sum += v;
  }
  return sum;
}

but would be nice it such hacks could be avoided.

Also, does the fact that __assume_aligned works when placed in a getter/setter should be expected to work in the future?

KitturGanesh · ‎08-20-2015

Hi, you make a very good point and I've passed your feedback to the product team and will keep you updated.

_Kittur

KitturGanesh · ‎08-21-2015

Great, I'll pass on this information as well. I'll keep you updated as soon as I get any additional input from developer after investigation, thx.
_Kittur

KitturGanesh · ‎08-25-2015

@velvia
Hi, very good feedback, which I've passed on to the product team. I'll post any update I may have accordingly thereof, appreciate much.
_Kittur

Vegan · ‎08-25-2015

A while ago I built a high performance library for vector and matrix algebra using OpenMP to leverage CPU cores more effectively

I used C++ templates

std::vector is already available and it can be parallel processed and depending on T it should be aligned automatically

KitturGanesh · ‎08-26-2015

@velvia
Hi,
The 16.0 product version was just released today. When I tried this version (16.0) on 64-bit (linux) all 3 test cases you mentioned were vectorized and aligned as well. I need to try on Mac which I've yet. Also, for 15.0 version no more updates are planned yet so my request to you is to download the 16.0 version that was released today and give it a shot and let me know.

Also, the #pragma omp simd moves the upper bound computation from the loop so explicit vectorizer loops should not need workarounds. The 16.0 version of the compiler does process the __assume as both offset and multiplier. You can test it out and let me know, thanks.

The only other issue (if all tests out ok as above) that's still unresolved is passing the alignment as a template parameter. Again, this is not supported by the ICC front-end so that's is something I'll file an issue against.

Appreciate if you can test out with the released 16.0 version (which you can download from the Intel Registration Center) and let us know

Thanks,
_Kittur

KitturGanesh · ‎08-27-2015

@velvia - thanks much for your feedback. Yes, the following code involving template parameter has to be fixed in the compiler and have passed it on to the team. Again, appreciate for installing the 16.0 version and testing this out. I'll keep you updated when a release with a fix for the issue is out.

_kittur

jimdempseyatthecove · ‎08-29-2015

>>I can't find a way to design a 2 dimensional array container class such that the first element of each row is SIMD-aligned and convey that to the compiler

template<typename T>
struct Array {
 T* a;
 Array() { a = NULL; }
 Array(__int64 n) {a = new T;}
 ~Array() { if(a) delete [] a; }
 T& operator[](__int64 i) { return a; }
};
const __int64 dim1=3;
const __int64 dim2=266;
const __int64 dim3=436;    
const __int64 dim4=274;    
const __int64 dim5=2;    
Array<float> A(dim1);
Array<float[dim2]> B(dim1);
Array<float[dim2][dim3]> C(dim1);
Array<float[dim2][dim3][dim4]> D(dim1);
Array<float[dim2][dim3][dim4][dim5]> E(dim1);
int main(int argc, char* argv[])
{
  for(__int64 i=0; i<dim1; ++i) {
    A = 0;
    for(__int64 j=0;j<dim2;++j) {
      B = 0;
      for(__int64 k=0;k<dim3;++k){
        C=0;
        for(__int64 l=0;l<dim4;++l){
          D=0;
          for(__int64 m=0;m<dim5;++m){
            E=0;
              if(k+l+m==0)
                std::cout<<i<<" "<<j<<" "<<k<<" "<<l<<" "<<m<<" "<<E<<"\n";
          }
        }
      }
    }
  }
  return 0;
}

*** replace the "a = newt" with your preference for aligned allocation (this is dependent on compiler version and/or O/S).
*** and if desired replace the dimension required with the dimension required + padd

Additional note. The above technique eliminates the pointer arrays. In the case of array E with 5 subscripts, it will eliminate fetching 4 pointers to get access to the final array of floats.

Jim Dempsey

velvia · ‎08-29-2015

Hi Jim,

Thanks for your input. But both dimensions are only known at run time.

jimdempseyatthecove · ‎08-31-2015

Then you will want to use post #5 suggestion (or variation thereof) to attain alignment, padding (if necessary) and avoid the use of a pointer array. The other thing you should look at (research) C++ extensions now support declarations of SIMD functions. This may or may not be applicable to your situation.

Jim Dempsey