unaligned access

luiceur · ‎11-22-2018

I am trying to understand a few things here. But first let's see a sample of the code:

#define LEN 10000000

int main(){

  double* a = aligned_alloc(32,LEN*sizeof(double));
  double* b = aligned_alloc(32,LEN*sizeof(double));
  double* c = aligned_alloc(32,LEN*sizeof(double));

  int k;
  for(k = 0; k < LEN; k++){
    a = rand();
    b = rand();
  }

  for(k  = 0; k < LEN; k++)
    c = a * b;

The vectorization report gives the following (icc -xAVX -O2 vec.c -o vect -qopt-report-phase=vec -qopt-report=5)

LOOP BEGIN at vec.c(27,3)
   remark #15388: vectorization support: reference c has aligned access   [ vec.c(28,5)]
   remark #15389: vectorization support: reference a has unaligned access   [ vec.c(28,12) ]
   remark #15389: vectorization support: reference b has unaligned access   [ vec.c(28,19) ]
   remark #15381: vectorization support: unaligned access used inside loop body

However, if I use the following instead:

 double* a = _mm_malloc(LEN*sizeof(double),32); 
 double* b = _mm_malloc(LEN*sizeof(double),32); 
 double* c = _mm_malloc(LEN*sizeof(double),32);

reports

 LOOP BEGIN at vec.c(27,3)
  remark #15388: vectorization support: reference c has aligned access   [ vec.c(28,5) ]
   remark #15388: vectorization support: reference a has aligned access   [ vec.c(28,12) 
   remark #15388: vectorization support: reference b has aligned access   [ vec.c(28,19)]

1- Why does it happen? What is the difference?

2- How do I know I am using the best alignment possible for my architecture? I am testing this on my desktop (Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz).

TimP · ‎11-23-2018

Apparently, the compiler doesn't recognize aligned_malloc() as anything special. You should be able to get identical results by adding assume_aligned assertions.

You shouldn't see any difference in performance, as all AVX CPUs is have eliminated the performance deficit for unaligned read instructions operating on aligned data, so that the compiler can use unaligned read instructions for AVX ISA option, regardless of expected alignment. Your loop is long enough that the alignment adjustment (checking at run time and adjusting c[] alignment) won't take measurable additional time. As long as the arrays have consistent alignment (as you have assured they do), the adjustment will give you full performance in the main body of the loop. With full diagnostic options (my preference would be -qopt-report=4, but the ones you chose should work), you should find out whether the compiler chose to generate a vector remainder loop (using full unrolling with 256-bit instructions in the main loop, plus a loop without unrolling and 128-bit instructions to handle the case where the loop count doesn't match exactly the unroll factor times the instruction width). It should also show the unroll factor (I would expect 2 or 4). If you have short loops where the compiler can see the expected length (using loop count pragma if necessary), the unroll factor should be adjusted automatically to assure that you don't get the case where the fully optimized loop is always skipped and there is no useless vector remainder loop. If you get no report of a vectorized remainder, it would mean that the compiler already recognized that you don't need the vector remainder.

If you have a context (short loops) where recognition of aligned_alloc() is of concern, you might file a ticket with online service center explaining how it makes a difference to your application. If you're lucky, you may get a response on whether they might consider making a future compiler recognize aiigned_alloc().