Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Should we align for SIMD on modern x86?

velvia
Beginner
815 Views

Hi,

I've been working on the usage of aligning arrays to SIMD width on modern x86 CPU. I finally found this piece of code that shows a difference on my computer (Core i7-4850HQ).

#include <iostream>
#include <chrono>
#include <mm_malloc.h>

int main(int argc, const char* argv[]) {
  const int n{8000};
  const int nb_loops{10000000};
  {
    char* a{new char};
    char* b{new char};
    char* c{new char};
    for (int i = 0; i < n; ++i) {
      a = 0;
      b = 1;
      c = 0;
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int k = 0; k < nb_loops; ++k) {
      for (int i = 0; i < n; ++i) {
        b = a + b + c;
      }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto time =
        std::chrono::duration_cast<std::chrono::nanoseconds>(end - start)
            .count();

    std::cout << "Time unaligned: " << time << " ns" << std::endl;

    delete[] c;
    delete[] b;
    delete[] a;
  }

  {
    char* a{static_cast<char*>(_mm_malloc(n * sizeof(char), 32))};
    char* b{static_cast<char*>(_mm_malloc(n * sizeof(char), 32))};
    char* c{static_cast<char*>(_mm_malloc(n * sizeof(char), 32))};
    for (int i = 0; i < n; ++i) {
      a = 0;
      b = 1;
      c = 0;
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int k = 0; k < nb_loops; ++k) {
#pragma omp simd aligned(a, b, c : 32)
      for (int i = 0; i < n; ++i) {
        b = a + b + c;
      }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto time =
        std::chrono::duration_cast<std::chrono::nanoseconds>(end - start)
            .count();

    std::cout << "Time aligned: " << time << " ns" << std::endl;

    _mm_free(c);
    _mm_free(b);
    _mm_free(a);
  }

  return 0;
}

On my CPU, the unaligned version takes 1.57s whereas the aligned version takes 1.36s when compiled with

icpc -std=c++11 -O3 -xHost -ansi-alias -qopenmp main.cpp -o main

I would like to understand the reason for this difference. Here are the suspects:

1) SIMD aligned loads and stores are faster than unaligned ones

2) A SIMD aligned data does not cross a cacheline which makes the the memory transfer faster

3) The code for the SIMD version does not have loop peeling, and is therefore way smaller which makes the loop faster

It seems that reason 1 is not valid on modern CPU. For reason 2, it does not seem right as the aligned version loses its advantage over the first without the alignement hint. This is the reason I highly suspect the third reason. To confirm that, it would be nice to have a compiled version without loop peeling and with aligned loads. Is there a way to do that?

If 3 is the reason for the better speed, why do we still have loop peeling on x86 hardware?

The Xeon Phi is another beast as unaligned loads and stores are way slower than aligned ones. Is it expected to vanish with the future generations of Xeon Phi? 

0 Kudos
6 Replies
TimP
Honored Contributor III
815 Views

The alignment pragma should suppress peeling for alignment.  If it doesn't, try comparing #pragma vector aligned.  If that appears to generate better code, file an IPS question.  Your loop is long enough that time spent checking or peeling for alignment ought to be negligible.

Intel compilers from 12.0 don't use aligned moves when unaligned moves are expected to have the same performance on aligned data.   You could compare the performance of compilers which do use aligned moves, or modify the asm code yourself.  There may be an undocumented unsupported option to use aligned moves.

If the generated code doesn't align the loop body, changes which affect code alignment may affect performance.  You might also take the precaution of specifying the same unroll in your comparisons.

0 Kudos
velvia
Beginner
815 Views

Hi Tim,

I have figured out that the aligned loop needs in fact 64 bytes alignment to perform well. If I specifically ask for 32 bytes alignement which is not 64 bytes aligned, the performance drops to 1.52s even though loop peeling disappeared.

    char* a{static_cast<char*>(_mm_malloc((n + 32) * sizeof(char), 64))};
    char* b{static_cast<char*>(_mm_malloc((n + 32) * sizeof(char), 64))};
    char* c{static_cast<char*>(_mm_malloc((n + 32) * sizeof(char), 64))};
    int shift{32};
    a += shift;
    b += shift;
    c += shift;
    for (int i = 0; i < n; ++i) {
      a = 0;
      b = 1;
      c = 0;
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int k = 0; k < nb_loops; ++k) {
#pragma omp simd aligned(a, b, c : 32)
      for (int i = 0; i < n; ++i) {
        b = a + b + c;
      }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto time =
        std::chrono::duration_cast<std::chrono::nanoseconds>(end - start)
            .count();

    std::cout << "Time: " << time << " ns" << std::endl;

    a -= shift;
    b -= shift;
    c -= shift;
    _mm_free(c);
    _mm_free(b);
    _mm_free(a);

So it seems that the speedup is mostly related to the alignment to the cacheline and not the SIMD width.

0 Kudos
TimP
Honored Contributor III
815 Views

AVX2 usually performs better with default 16-byte alignment than SSE2 did on the first I7 platforms, even though it may involve peeling, but of course 32-byte alignment is recommended, and 64-byte alignment could be an improvement in some cases.

From your original title, it seemed you preferred not to align, which might hinder vectorization of some more complicated cases, particularly those with conditional assignments or assignment to multiple arrays.  For such large arrays, 64-byte alignment doesn't waste a relatively significant amount of storage.

0 Kudos
velvia
Beginner
815 Views

Hi,

Maybe, it would be better to state clearly what I am looking for: an example where aligning memory makes a clear performance difference on a recent Intel CPU (core i7 or Xeon). It is for a course that I have to give on SIMD.

As you mentioned, I've tried the following loop:

unsigned char* a = ...;
unsigned char* b = ...;
// initialize a such that a = 0 for i even and a = 255 for i odd
// initialize b to 0

for (int i = 0; i < n; ++i) {
  if (a > 128) {
    b = 255;
  }
}

The Intel compiler does not vectorize this loop because it thinks that it not a good idea to do so. If I align a and b to a 64 byte boundary, and tell the loop so using #pragma omp simd aligned(a, b : 64), I get a 15% speedup. When I take the original unaligned loop, and I force the loop to be vectorized using #pragma omp simd, I get a 10% speedup. In the end, the difference in between the aligned and the unaligned version is <= 5% which is not much.

Is it possible to find an example where the difference is larger?

Can you explain what you mean by assigning to multiple arrays?

Thanks for your help.

 

0 Kudos
McCalpinJohn
Honored Contributor III
815 Views

There is an example on Agner's performance blog (http://www.agner.org/optimize/blog/read.php?i=415#423) that shows a significant (almost 2x) difference in performance when attempting to perform two 256-bit loads per cycle from L1 cache.   With 32 Byte alignment, the test case reports ~52.5 Bytes/cycle (but correcting for timer overhead brings this to about 58.5 Bytes/cycle, or 91% of peak), while for any lesser alignment the test reports much lower values -- typically less than 32 Bytes/cycle.  For the non-32-Byte-aligned case, there is an unexplained increase to ~40 bytes/cycle (after correcting for timer overhead) if the loads are performed in reverse order.

The author also notes that the impact of alignment is much smaller for L2-contained data, and is almost zero for L3-contained or memory-contained data.  These observations make sense -- once you leave the L1 cache, all data transfers are by full cache line and once you drop to (no more than) one load per cycle from the L1 Data Cache there is little opportunity for conflict there.

0 Kudos
velvia
Beginner
815 Views

Hi John,

Thanks for your kind answer. I need time to digest it though and I had problems to compile the code given on Agner's Fog website which is full of instrinsics. I am working on it though.

Thanks.

0 Kudos
Reply