vectorization support: reference x has unaligned access

Dario_I_ · ‎12-13-2013

The following code

[cpp]

#include <sys/time.h>

#include<iostream>

#include<omp.h>

#include <stdlib.h> /* srand, rand */

#include <malloc.h>

int main () {

const unsigned int n = 10000000;

double *x, *s, *c, *tof;

x = (double*)_mm_malloc(n * sizeof(double), 16 );

s = (double*)_mm_malloc(n * sizeof(double), 16 );

c = (double*)_mm_malloc(n * sizeof(double), 16 );

tof = (double*)_mm_malloc(n * sizeof(double), 16 );

for (unsigned int i=0; i < n; ++i) {

s = (double)rand() / RAND_MAX;

c = ( (double)rand() / RAND_MAX ) * s/2;

tof = (double)rand() / RAND_MAX;

}

#pragma omp parallel for

#pragma simd

for (unsigned int i=0; i < n; ++i) {

x = s+c+tof;

}

_mm_free(x);

_mm_free(s);

_mm_free(c);

_mm_free(tof);

return 0;
}

[/cpp]

compiled with:

icpc -c -xAVX -vec_report6 -openmp main.cpp

produces this output:

prova.cpp(27): (col. 20) remark: vectorization support: call to function rand cannot be vectorized
prova.cpp(29): (col. 22) remark: vectorization support: call to function rand cannot be vectorized
prova.cpp(31): (col. 22) remark: vectorization support: call to function rand cannot be vectorized
prova.cpp(25): (col. 5) remark: loop was not vectorized: existence of vector dependence
prova.cpp(31): (col. 22) remark: vector dependence: assumed OUTPUT dependence between line 31 and line 27
prova.cpp(27): (col. 20) remark: vector dependence: assumed OUTPUT dependence between line 27 and line 31
prova.cpp(31): (col. 22) remark: vector dependence: assumed OUTPUT dependence between line 31 and line 29
prova.cpp(29): (col. 22) remark: vector dependence: assumed OUTPUT dependence between line 29 and line 31
prova.cpp(29): (col. 22) remark: vector dependence: assumed OUTPUT dependence between line 29 and line 27
prova.cpp(27): (col. 20) remark: vector dependence: assumed OUTPUT dependence between line 27 and line 29
prova.cpp(29): (col. 22) remark: vector dependence: assumed OUTPUT dependence between line 29 and line 31
prova.cpp(31): (col. 22) remark: vector dependence: assumed OUTPUT dependence between line 31 and line 29
prova.cpp(27): (col. 20) remark: vector dependence: assumed OUTPUT dependence between line 27 and line 29
prova.cpp(29): (col. 22) remark: vector dependence: assumed OUTPUT dependence between line 29 and line 27
prova.cpp(27): (col. 20) remark: vector dependence: assumed OUTPUT dependence between line 27 and line 31
prova.cpp(31): (col. 22) remark: vector dependence: assumed OUTPUT dependence between line 31 and line 27
prova.cpp(41): (col. 9) remark: vectorization support: reference (unknown) has aligned access
prova.cpp(41): (col. 9) remark: vectorization support: reference (unknown) has unaligned access
prova.cpp(41): (col. 9) remark: vectorization support: reference (unknown) has unaligned access
prova.cpp(41): (col. 9) remark: vectorization support: reference (unknown) has unaligned access
prova.cpp(41): (col. 9) remark: vectorization support: unaligned access used inside loop body
prova.cpp(39): (col. 5) remark: vectorization support: unroll factor set to 4
prova.cpp(39): (col. 5) remark: SIMD LOOP WAS VECTORIZED

I am worried about the unaligned access .... I am using mm_alloc, so I do not see what the problem is ..

Dario

Leonardo_B_Intel · ‎12-13-2013

Hello Dario,

Allow me to make a couple of initial comments:

Since the compilation is for 256-bit AVX registers, I’d recommend enforcing 32-byte alignment in the allocation calls (and not 16-byte alignment)
Te code above attempts vectorization and parallelization of the same loop. This will require some re-work of the code.

If you follow the recommendation (1) and remove the OpenMP pragma, I am sure the “unaligned” messages will disappear.
With parallelization and vectorization happen at the same loop, each thread starts with its own lower-bound of the arrays for the loop. These bounds are determined at runtime based on the OMP scheduling. These initial addresses are unknown to the compiler. Here’s a very helpful writeup about alternatives to tackle this: http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization

Best,

Leo.

jimdempseyatthecove · ‎12-13-2013

Also consider

#if defined(__MIC__)
#define VEC_ALIGN 64
#else
#define VEC_ALIGN 32
#endif

You could also check for AVX and lack of AVX to provide for an alignment of 16 for non-MIC, non-AVX

The article link provided by Leo is a good reference.

OpenMP 4.0 has support for declaring loops to use simd (#pragma parallel for simd) to assure that loop iteration slicing occurs in granularity that is a multiple of the simd vector size. IOW, if the first reference of the first thread partition is aligned, then the remainder threads first reference will be aligned. You will need to carefully read the OpenMP 4.0 spec, as this section is non-trivial.

Jim Dempsey

TimP · ‎12-13-2013

As Leo said, you need 32-byte alignment to qualify as aligned for AVX. The compiler won't be able to assume alignment in the omp parallel loop even with the arrays allocated with alignment, as it doesn't know whether you will set a number of threads which divides evenly into 250000 (if it even tries to make that analysis). Thus Leo's comment about easier achievement of alignment without omp parallel.

On such a large data set, the performance penalty you will see due to unaligned data should be small.

The aligned clause in omp simd aligned, as far as I know, is an assertion, which probably applies only to the origin of the array, not the individual OpenMP chunks. An Intel compiler developer told me the omp simd aligned is equivalent to the corresponding Intel proprietary assume_aligned assertions.

Anyway, I'm not understanding why the greater concern for alignment than for the more basic question of why #pragma simd isn't taken as an assertion of non-overlapping data. Is unsigned int required? If you have a choice, does 32- vs. 64-bit mode make a difference to vectorizability with unsigned int? Choosing signed ints (even 64-bit ones in 64-bit mode) is much more often practical than choosing an array size which makes the OpenMP chunks come out evenly aligned.