I am trying to understand how aligned code is generated and therefore I have created the following code snippet
subroutine add(A, B, C, N) implicit none integer,intent(in) :: N real*8, intent(in), dimension(N) :: A,B real*8, intent(out), dimension(N) :: C !dir$ assume_aligned A:32, B:32, C:32 !dir$ vector aligned C = A+B return end subroutine add
which I am compiling with the following two commands:
ifort -align array32byte -S align.f90 -xavx
ifort -align array32byte -S align.f90
Since I am aligning the arrays to 32 byte (for AVX), explicitly adding a directive to assume that the arrays be aligned and even adding a !dir$ vector align directive (which shouldn't be necessary if I used assume_aligned - please correct me if I'm wrong) I would expect the assembler code to contain aligned loads and stores. However, I am seeing something interesting:
For the AVX code:
vmovupd (%rdi,%rax,8), %ymm0 #10.5 vmovupd 32(%rdi,%rax,8), %ymm2 #10.5 vmovupd 64(%rdi,%rax,8), %ymm4 #10.5 vmovupd 96(%rdi,%rax,8), %ymm6 #10.5 vaddpd (%rsi,%rax,8), %ymm0, %ymm1 #10.5 vaddpd 32(%rsi,%rax,8), %ymm2, %ymm3 #10.5 vaddpd 64(%rsi,%rax,8), %ymm4, %ymm5 #10.5 vaddpd 96(%rsi,%rax,8), %ymm6, %ymm7 #10.5 vmovupd %ymm1, (%rbx,%rax,8) #10.5 vmovupd %ymm3, 32(%rbx,%rax,8) #10.5 vmovupd %ymm5, 64(%rbx,%rax,8) #10.5 vmovupd %ymm7, 96(%rbx,%rax,8) #10.5 addq $16, %rax #10.5 cmpq %rcx, %rax #10.5 jb ..B1.4 # Prob 82% #10.5
For the SSE2 code:
movaps (%rdi,%rax,8), %xmm0 #10.5 movaps 16(%rdi,%rax,8), %xmm1 #10.5 movaps 32(%rdi,%rax,8), %xmm2 #10.5 movaps 48(%rdi,%rax,8), %xmm3 #10.5 addpd (%r8,%rax,8), %xmm0 #10.5 addpd 16(%r8,%rax,8), %xmm1 #10.5 addpd 32(%r8,%rax,8), %xmm2 #10.5 addpd 48(%r8,%rax,8), %xmm3 #10.5 movaps %xmm0, (%rdx,%rax,8) #10.5 movaps %xmm1, 16(%rdx,%rax,8) #10.5 movaps %xmm2, 32(%rdx,%rax,8) #10.5 movaps %xmm3, 48(%rdx,%rax,8) #10.5 addq $8, %rax #10.5 cmpq %rsi, %rax #10.5 jb ..B1.4 # Prob 82% #10.5
Apparently the AVX version does unaligned packed loads/stores even though I would assume that it would use vmovaps instead. In fact, I am unable to generate an example where I see vmovaps at all. The generated code isn't multiversioned either, so it's not that I've overseen something in the assembly. Also, why does the SSE2 code use aligned (which I expect), but single precision load instructions? Does that make any sense?
I have read this awesome article (https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization) and I was sure that I was having the right idea of what the compiler's assembly output would look like. But unfortunately the results that I'm getting show a different behavior. Back to the example, I would expect the code to generate movapd/vmovapd instructions for SSE and AVX, respectively. What is my misconception? I am using ifort version 15.0.2.
For the AVX ISA, performance of aligned and unaligned loads on aligned data are the same, so the compiler (beginning with 12.0) chooses always to use unaligned. With opt-report4 you get confirmation about which data are recognized as aligned.
SSE single precision load is at least potentially faster than double due to shorter encoding.
vector aligned directive suppresses peeling for alignment so can improve performance of moderate length loops as well as assuring the compiler that all operands have compatible alignment. For your example, -align array32byte should have been sufficient.
That's interesting. Indeed, the vectorization report is telling me that all three arrays are being loaded in an aligned fashion.
I was trying to find information about this behavior (i.e. aligned and unaligned loads having the same performance), but I can't find anything official from Intel. I looked into Agner Fog's table (http://www.agner.org/optimize/instruction_tables.pdf), where vmov[au]pd share one entry, i.e. same latency, throughput etc (this is probably what you meant). Do you have any link for an official documentation of the AVX ISA extension (I'm only able to find one for AVX512 on the Intel website)?
Also, why is movapd necessary at all if we can use movaps instead?