Alignment of returned address from malloc()

Sunwoo_L_ · ‎01-03-2016

Hi, guys,

I am using icc 15.0.2 which is compatible to gcc 4.4.7. Whenever I allocate a memory space with malloc function, the address is aligned by 16 bytes. I know gcc's malloc provides the alignment for 64-bit processors. Does the icc malloc function support the same alignment of address? I think it is related to the quality of vectorization and I definitely need to make sure the malloc function of icc also supports the alignment.

TimP · ‎01-04-2016

Default 16 byte alignment in malloc is specified in x86_64 abi. If you have a case where it is not so, it may be a reportable bug.

When the compiler can see that alignment is inherited from malloc , it is entitled to assume alignment. 16 byte alignment will not be sufficient for full avx optimization.

For a time,gcc had situations not shared by icc where stack objects weren't aligned. I think that was corrected before gcc 4.4.7, which has become outdated . It's reasonable to expect icc to perform equal or better alignment than gcc.

Judith_W_Intel · ‎01-04-2016

Intel does not provide its own C or C++ runtime libraries so the version of malloc you link in should be the same as GNU's.

You can use memalign or posix_memalign if you want to ensure a specific alignment.

velvia · ‎01-08-2016

Hi Sunwoo,

You don't need to aligned your data to benefit from vectorization. For instance, suppose that you have an array v of n = 1000 floating point double and you want to run the following code

for (int i = 0; i < n; ++i) {
  v = 2 * v;
}

most compilers, including the Intel compiler will vectorize the code even though v is not 32-byte aligned (I assume that you CPU has 256 bit vector length which is the case of modern Intel CPU). Suppose that v "=" 32 * k + 16. As a consequence, v + 2 is 32-byte aligned. The compiler will do the following:

- Treat the loop iterations i =0 and i = 1 sequentially (loop peeling)

- Then treat i = 2, i = 3, i = 4, i = 5 with one vector instruction.

- Use vector instructions up to the last vector instruction for i = 994, i = 995, i= 996, i = 997

- Treat the loop iterations i = 998, i = 999 sequentially (remainder)

So, except for the the very beginning and the very end of the loop, your code will get vectorized. You'll get a slight overhead for the loop peeling and the remainder, but with n = 1000, you won't feel anything.

The problem comes when n is small enough so you can't neglect loop peeling and the remainder. You also have the problem when you have two arrays running at the same time such as:

for (int i = 0; i < n; ++i) {
  v = 2 * w;
}

If v and w are not aligned, there is no way to have aligned load for v, v[i + 1], v[i + 2], v[i + 3] and w, w[i + 1], w[i + 2], w[i + 3]. Therefore, the load has to be unaligned which *might* degrade performance. With modern CPU, most likely, you won't feel il (maybe a few percent slower, but it will be most likely in the noise of a basic timer measurement).

So aligning for vectorization is not a must. It is something that should be done in some special cases when a profiler shows that it is needed. Intel Advisor is the only profiler that I know that can do those things. When you have identified the loops that might get some speedup with alignement, you need to:

- Align the memory: you might use _mm_malloc

- Tell the compiler that the pointer you are going to use is aligned: you might use OpenMP 4 (#pragma omp simd aligned(p : 32)) or the Intel extension special __assume_aligned

Aligning the memory without telling the compiler is useless.

In a nutshell:

1) Profile with Intel Advisor

2) Align your memory where needed AND tell the compiler you've done it