Showing results for 
Search instead for 
Did you mean: 

Very basic question about SIMD and memory alignment

I start to learn something about compiler auto-vectorization.
I read the doc and it says that memory alignment is required for SIMD execution through SSE/AVX/etc.
I did a following test:
I allocate a buffer which is cache line aligned (64B);
Then I use a for loop to visit a continues chunk of an array, the address of visiting start point is not aligned with vector register size, for example:
data = memalign(CACHE_LINE_SIZE, buff_size);
ptr = (char *)data + align * sizeof(int); //align could be 0, 1, 2, 3, ......
data_int = (int *)ptr;
for(j = 0; j < ITER_NUM; j++)
for(i = 0; i < size_in_int; i++)
data_int += i & 0x3;
Array size range from 512K ~ 16384K. I hope this size could help to avoid L1/L2 caching, to make the effect more obvious.
icc compiler told me that this code is vectorized.
When I run this program and changing the value of varialble "align", I could see no difference between aligned and unaligned cases.
My question is, for unaligned cases, does auto-vectorization module could vectorize most part of the whole array and let only head/tail unaligned part to be executed in non SIMD way?
Thank you for reading my question,
0 Kudos
3 Replies
Black Belt

You're correct, when the compiler generates code for the unaligned case, there is a short scalar loop to process the head of the array, up to a point of alignment. The code for alignment adjustment at the head of the loop may be suppressed by #pragma vector aligned (but then the code would fault if not aligned). Similarly, unless the compiler can determine it's not needed, there is a remainder loop for unaligned data at the end.

Thank you very much for your reply, I test an aligned/unaligned array visiting loop, and got two versions in same .s:
data_int += align; //data_int is 64B aligned
for(j = 0; j < ITER_NUM; j++)
for(i = 0; i < size_in_int; i++)
data_int += i & 0x3;
..B3.20: # Preds ..B3.20 ..B3.19
movaps %xmm2, %xmm3 #34.23
paddd %xmm1, %xmm2 #32.3
pand %xmm0, %xmm3 #34.23
paddd (%rbp,%rdx,4), %xmm3 #34.4
movdqa %xmm3, (%rbp,%rdx,4) #24.2
addq $4, %rdx #32.3
cmpq %rcx, %rdx #32.3
jb ..B3.20 # Prob 82% #32.3
..B3.24: # Preds ..B3.22 ..B3.24
movl %ecx, %eax #34.4
andl $3, %eax #34.23
addl %eax, (%rbp,%rcx,4) #34.4
incq %rcx #32.3
cmpq %r15, %rcx #32.3
jb ..B3.24 # Prob 82% #32.3
May I ask one more simple question?
For following code, another aligned and nonaligned cases, even compiler tells me that it is vectorized, but it still use difference instructions that might have different latency, right? Seem to me that it doesnt use the several version to deal will head/tail non-aligned part.
struct A
float x;
float y;
float z;
float w;
float a;
float b;
float c;
float d;
void func(float factor, struct A * array) //A is 64B aligned
int i;
for(i = 0; i < 1024; i++)
array.y *= factor;
array.z *= factor;
array.w *= factor;
array.a *= factor;
I got following .s code when the address is not aligned:
..B2.1: # Preds ..B2.0
..___tag_value__Z12func_susan_1fP1A.15: #18.1
addq $4, %rdi #
shufps $0, %xmm0, %xmm0 #17.6
xorl %eax, %eax #21.2
# LOE rax rbx rbp rdi r12 r13 r14 r15 xmm0
..B2.2: # Preds ..B2.2 ..B2.1
movups (%rdi), %xmm1 #23.3
mulps %xmm0, %xmm1 #23.3
incq %rax #21.2
movups %xmm1, (%rdi) #23.3
addq $32, %rdi #21.2
cmpq $1024, %rax #21.2
jb ..B2.2 # Prob 99% #21.
It is not movaps here. I hope I could understand:
0. Is there any clue (except reading .s file) for programer to know how compiler make decision on which way to use? (multiple versions or using unaligned instruction)
1. Is there any performance difference that I should be care about?Or Intel chip has been properly designed and this doesnt matter much at all.
Thank you very much for your precious time.
Black Belt

On the current Intel CPUs, the last movups is expected to exhibit more latency when the data are misaligned. However, the code you show is not a typical vectorization case; it's not possible to align the block of struct components by a scalar remainder loop, so it looks like the compiler has done an excellent job with the task you set. I would have thought the compiler might report BLOCK VECTORIZED in such a case.