You're correct, when the compiler generates code for the unaligned case, there is a short scalar loop to process the head of the array, up to a point of alignment. The code for alignment adjustment at the head of the loop may be suppressed by #pragma vector aligned (but then the code would fault if not aligned). Similarly, unless the compiler can determine it's not needed, there is a remainder loop for unaligned data at the end.
Thank you very much for your reply, I test an aligned/unaligned array visiting loop, and got two versions in same .s:
data_int += align; //data_int is 64B aligned
for(j = 0; j < ITER_NUM; j++)
for(i = 0; i < size_in_int; i++)
data_int += i & 0x3;
..B3.20: # Preds ..B3.20 ..B3.19
movaps %xmm2, %xmm3 #34.23
paddd %xmm1, %xmm2 #32.3
pand %xmm0, %xmm3 #34.23
paddd (%rbp,%rdx,4), %xmm3 #34.4
movdqa %xmm3, (%rbp,%rdx,4) #24.2
addq $4, %rdx #32.3
cmpq %rcx, %rdx #32.3
jb ..B3.20 # Prob 82% #32.3
..B3.24: # Preds ..B3.22 ..B3.24
movl %ecx, %eax #34.4
andl $3, %eax #34.23
addl %eax, (%rbp,%rcx,4) #34.4
incq %rcx #32.3
cmpq %r15, %rcx #32.3
jb ..B3.24 # Prob 82% #32.3
May I ask one more simple question?
For following code, another aligned and nonaligned cases, even compiler tells me that it is vectorized, but it still use difference instructions that might have different latency, right? Seem to me that it doesnt use the several version to deal will head/tail non-aligned part.
void func(float factor, struct A * array) //A is 64B aligned
for(i = 0; i < 1024; i++)
array.y *= factor;
array.z *= factor;
array.w *= factor;
array.a *= factor;
I got following .s code when the address is not aligned:
..B2.1: # Preds ..B2.0
addq $4, %rdi #
shufps $0, %xmm0, %xmm0 #17.6
xorl %eax, %eax #21.2
# LOE rax rbx rbp rdi r12 r13 r14 r15 xmm0
..B2.2: # Preds ..B2.2 ..B2.1
movups (%rdi), %xmm1 #23.3
mulps %xmm0, %xmm1 #23.3
incq %rax #21.2
movups %xmm1, (%rdi) #23.3
addq $32, %rdi #21.2
cmpq $1024, %rax #21.2
jb ..B2.2 # Prob 99% #21.
It is not movaps here. I hope I could understand:
0. Is there any clue (except reading .s file) for programer to know how compiler make decision on which way to use? (multiple versions or using unaligned instruction)
1. Is there any performance difference that I should be care about?Or Intel chip has been properly designed and this doesnt matter much at all.
On the current Intel CPUs, the last movups is expected to exhibit more latency when the data are misaligned. However, the code you show is not a typical vectorization case; it's not possible to align the block of struct components by a scalar remainder loop, so it looks like the compiler has done an excellent job with the task you set. I would have thought the compiler might report BLOCK VECTORIZED in such a case.