For the AVX ISA, performance

Andre_M_1 · ‎02-10-2016

I am trying to understand how aligned code is generated and therefore I have created the following code snippet

subroutine add(A, B, C, N)
    implicit none

    integer,intent(in)                :: N
    real*8, intent(in),  dimension(N) :: A,B
    real*8, intent(out), dimension(N) :: C
    !dir$ assume_aligned A:32, B:32, C:32

    !dir$ vector aligned
    C = A+B
    return
end subroutine add

which I am compiling with the following two commands:

ifort -align array32byte -S align.f90 -xavx

and

ifort -align array32byte -S align.f90

Since I am aligning the arrays to 32 byte (for AVX), explicitly adding a directive to assume that the arrays be aligned and even adding a !dir$ vector align directive (which shouldn't be necessary if I used assume_aligned - please correct me if I'm wrong) I would expect the assembler code to contain aligned loads and stores. However, I am seeing something interesting:

For the AVX code:

         vmovupd   (%rdi,%rax,8), %ymm0                          #10.5
        vmovupd   32(%rdi,%rax,8), %ymm2                        #10.5
        vmovupd   64(%rdi,%rax,8), %ymm4                        #10.5
        vmovupd   96(%rdi,%rax,8), %ymm6                        #10.5
        vaddpd    (%rsi,%rax,8), %ymm0, %ymm1                   #10.5
        vaddpd    32(%rsi,%rax,8), %ymm2, %ymm3                 #10.5
        vaddpd    64(%rsi,%rax,8), %ymm4, %ymm5                 #10.5
        vaddpd    96(%rsi,%rax,8), %ymm6, %ymm7                 #10.5
        vmovupd   %ymm1, (%rbx,%rax,8)                          #10.5
        vmovupd   %ymm3, 32(%rbx,%rax,8)                        #10.5
        vmovupd   %ymm5, 64(%rbx,%rax,8)                        #10.5
        vmovupd   %ymm7, 96(%rbx,%rax,8)                        #10.5
        addq      $16, %rax                                     #10.5
        cmpq      %rcx, %rax                                    #10.5
        jb        ..B1.4        # Prob 82%                      #10.5

For the SSE2 code:

        movaps    (%rdi,%rax,8), %xmm0                          #10.5
        movaps    16(%rdi,%rax,8), %xmm1                        #10.5
        movaps    32(%rdi,%rax,8), %xmm2                        #10.5
        movaps    48(%rdi,%rax,8), %xmm3                        #10.5
        addpd     (%r8,%rax,8), %xmm0                           #10.5
        addpd     16(%r8,%rax,8), %xmm1                         #10.5
        addpd     32(%r8,%rax,8), %xmm2                         #10.5
        addpd     48(%r8,%rax,8), %xmm3                         #10.5
        movaps    %xmm0, (%rdx,%rax,8)                          #10.5
        movaps    %xmm1, 16(%rdx,%rax,8)                        #10.5
        movaps    %xmm2, 32(%rdx,%rax,8)                        #10.5
        movaps    %xmm3, 48(%rdx,%rax,8)                        #10.5
        addq      $8, %rax                                      #10.5
        cmpq      %rsi, %rax                                    #10.5
        jb        ..B1.4        # Prob 82%                      #10.5

Apparently the AVX version does unaligned packed loads/stores even though I would assume that it would use vmovaps instead. In fact, I am unable to generate an example where I see vmovaps at all. The generated code isn't multiversioned either, so it's not that I've overseen something in the assembly. Also, why does the SSE2 code use aligned (which I expect), but single precision load instructions? Does that make any sense?

I have read this awesome article (https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization) and I was sure that I was having the right idea of what the compiler's assembly output would look like. But unfortunately the results that I'm getting show a different behavior. Back to the example, I would expect the code to generate movapd/vmovapd instructions for SSE and AVX, respectively. What is my misconception? I am using ifort version 15.0.2.

Thank you!

TimP · ‎02-11-2016

For the AVX ISA, performance of aligned and unaligned loads on aligned data are the same, so the compiler (beginning with 12.0) chooses always to use unaligned. With opt-report4 you get confirmation about which data are recognized as aligned.

SSE single precision load is at least potentially faster than double due to shorter encoding.

vector aligned directive suppresses peeling for alignment so can improve performance of moderate length loops as well as assuring the compiler that all operands have compatible alignment. For your example, -align array32byte should have been sufficient.

Andre_M_1 · ‎02-11-2016

Thanks Tim!

That's interesting. Indeed, the vectorization report is telling me that all three arrays are being loaded in an aligned fashion.

I was trying to find information about this behavior (i.e. aligned and unaligned loads having the same performance), but I can't find anything official from Intel. I looked into Agner Fog's table (http://www.agner.org/optimize/instruction_tables.pdf), where vmov[au]pd share one entry, i.e. same latency, throughput etc (this is probably what you meant). Do you have any link for an official documentation of the AVX ISA extension (I'm only able to find one for AVX512 on the Intel website)?

Also, why is movapd necessary at all if we can use movaps instead?

Aligned/Unaligned load instructions generated for SSE2/AVX