Compound Assignment Improving Vectorization?

Kevin_M_4 · ‎01-21-2016

Hi

I was attempting to write a Fortran equivalent to the code provided in this vectorization tutorial

https://cvw.cac.cornell.edu/vector/exercise_1

My attempt at a Fortran vectorization equivalent is here:

#define ARRAY_SIZE 1024
#define NUMBER_OF_TRIALS 1000000

module array
    !dir$ attributes align:64 :: a, b, c
    double precision, dimension(ARRAY_SIZE) ::  a, b, c
end module

program main

    use array
    implicit none
    integer i, t

    double precision :: m = 1.0001

    ! Populate A and B arrays
    do i=1, ARRAY_SIZE
        b(i) = i
        a(i) = i + 1
    end do

    ! Perform an operation a number of times
    do t = 1, NUMBER_OF_TRIALS;
        do i = 1, ARRAY_SIZE
            c(i) = c(i) + m*a(i) + b(i)
        end do

        ! I have also tried a version with this line
        ! c = c + m * a  + b
    end do

end program

My questions(s):

0) Is this code as near-equivalent to the version 'simple.c' as in the link above?

1) Any further optimization recommendations?

2) The c version appears to always be faster, and I believe to be related to compound assignment (with '+='). Here's some reasoning http://programmers.stackexchange.com/a/134136/69046

c +=  m*a + b;

Am I off base in thinking that this is the primary difference between the C and Fortran versions?

jimdempseyatthecove · ‎01-21-2016

One issue you may have is the compiler optimization may notice that the results placed into array c[:] are not used following the loop. The optimization may result in elimination of the array c and the loops (initialization and timed loop). To correct for this, after the timed loop insert something innocuous (outside timed section), such as

IF(ISNAN(SUM(A))) PRINT *,"Found NaN"

If your input arrays are "orderly" the IF test won't print. If a NaN is in array c, then fix the input data such that it does not generate NaN's.

Jim Dempsey

Kevin_M_4 · ‎01-21-2016

Thanks Jim,

I inserted the extra line and it does not impact the compiler optimization.

Furthermore (for all), I should note that my timings are about identical between c and Fortran when I do use

c = c +  m*a + b;

C (no compound assignment)/Fortran

simple_no_vec: 0.88
simple 0.35
simple_avx 0.23

C (compound assignment) only:

simple_no_vec: 0.69
simple 0.16
simple_avx 0.08

TimP · ‎01-21-2016

If you expect an advantage for the order of operations implied by the += operator, you can do the same thing with parentheses in Fortran. Both intel Fortran and c will ignore the specified order including the one in += until you set protect-parens option (included in ifort -standard-semantics).

As Jim pointed out, you can't draw any useful conclusions from your example.

In a realistic similar situation, you might have a point due to shortage of pointer registers in 32 bit mode.

jimdempseyatthecove · ‎01-22-2016

The C program (Intel C++) should have produced the same assembly code for

c +=  m*a + b;

and

c = c +  m*a + b;

The fact that it differs "may" imply one of the two used FMA and the other did not. Though other instruction sequences may have been used.
A second factor may be that the loop unrolling may have differed. You will have to look at the disassembly (VTune) to assure what the compiler did or did not do. (I assume the C program performed an aligned allocation to 64 bytes as did your Fortran example)

I caution you at making any lasting inference of any compiler optimization missed opportunities since these are subject to change with any revision.

If you have a simple reproducer, please send it to premier support inclusive of compiler version and command line options.

Jim Dempsey

TimP · ‎01-22-2016

Those 2 c versions are required to differ if standards compliance options are set. It may be reasonable to assume no standards compliance, but then it is totally unreasonable to make generalizations about how Fortran and c compilers will act in general, particularly with respect to dead code.

jimdempseyatthecove · ‎01-22-2016

Tim,

>>Those 2 C versions are required to differ if standards compliance options are set.

Other than for when FMA may or may not be used. What in the standards indicate that the two statements are not to be implemented equivalently?

Note, excepting for when array c[:] is an array of atomics or volatiles, or where i is volatile or atomic "c +=" and "c = c +" should be implementable equivalently.

Jim

JVanB · ‎01-22-2016

Is there some reason this couldn't have been written in Fortran? I translated to something that would compile:

module array
   implicit none
   integer, parameter :: ARRAY_SIZE = 1024
   integer, parameter :: NUMBER_OF_TRIALS = 1000000
    !dir$ attributes align:64 :: a, b, c
    double precision, dimension(ARRAY_SIZE) ::  a, b, c
end module

program main

    use array
    implicit none
    integer i, t

    double precision :: m = 1.0001d0

    ! Populate A and B arrays
    do i=1, ARRAY_SIZE
        b(i) = i
        a(i) = i + 1
    end do

    ! Perform an operation a number of times
    do t = 1, NUMBER_OF_TRIALS;
        do i = 1, ARRAY_SIZE
            c(i) = c(i) + m*a(i) + b(i)
        end do

        ! I have also tried a version with this line
        ! c = c + m * a  + b
    end do

end program

The inner loop looked like:

.B1.6::                         ; Preds .B1.6 .B1.5
        vmovupd   ymm1, YMMWORD PTR [imagerel(ARRAY_mp_A)+rcx+rdx*8] ;26.29
        lea       r8, QWORD PTR [rcx+rdx*8]                     ;26.13
        vmovupd   ymm3, YMMWORD PTR [imagerel(ARRAY_mp_A)+32+rcx+rdx*8] ;26.29
        vmovupd   ymm5, YMMWORD PTR [imagerel(ARRAY_mp_A)+64+rcx+rdx*8] ;26.29
        vfmadd213pd ymm1, ymm0, YMMWORD PTR [imagerel(ARRAY_mp_C)+rcx+rdx*8] ;26.25
        vfmadd213pd ymm3, ymm0, YMMWORD PTR [imagerel(ARRAY_mp_C)+32+rcx+rdx*8] ;26.25
        vfmadd213pd ymm5, ymm0, YMMWORD PTR [imagerel(ARRAY_mp_C)+64+rcx+rdx*8] ;26.25
        vaddpd    ymm2, ymm1, YMMWORD PTR [imagerel(ARRAY_mp_B)+rcx+rdx*8] ;26.13
        vaddpd    ymm4, ymm3, YMMWORD PTR [imagerel(ARRAY_mp_B)+32+rcx+rdx*8] ;26.13
        vaddpd    ymm1, ymm5, YMMWORD PTR [imagerel(ARRAY_mp_B)+64+rcx+rdx*8] ;26.13
        vmovupd   YMMWORD PTR [imagerel(ARRAY_mp_C)+r8], ymm2   ;26.13
        vmovupd   ymm2, YMMWORD PTR [imagerel(ARRAY_mp_A)+96+rcx+rdx*8] ;26.29
        vmovupd   YMMWORD PTR [imagerel(ARRAY_mp_C)+32+r8], ymm4 ;26.13
        vmovupd   YMMWORD PTR [imagerel(ARRAY_mp_C)+64+r8], ymm1 ;26.13
        vfmadd213pd ymm2, ymm0, YMMWORD PTR [imagerel(ARRAY_mp_C)+96+rcx+rdx*8] ;26.25
        vaddpd    ymm3, ymm2, YMMWORD PTR [imagerel(ARRAY_mp_B)+96+rcx+rdx*8] ;26.13
        add       rdx, 16                                       ;25.9
        vmovupd   YMMWORD PTR [imagerel(ARRAY_mp_C)+96+r8], ymm3 ;26.13
        cmp       rdx, 1024                                     ;25.9
        jb        .B1.6         ; Prob 99%                      ;25.9

(Compiled with /QxCORE-AVX2). Looks normal, with m = ymm0 and indeed an FMA. What does this inner loop look like in your C compiler?

Kevin_M_4 · ‎01-22-2016

compile options -O3 -S -fsource-asm -c # mark_description "Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.3.187 Build 2"

###     ! Perform an operation a number of times
###     do t = 1, NUMBER_OF_TRIALS;
###         do i = 1, ARRAY_SIZE
###             c(i) = c(i) + m*a(i) + b(i)

        movaps    .L_2il0floatpacket.0(%rip), %xmm0             #27.20
        xorl      %eax, %eax                                    #25.5
                                # LOE rbx r12 r13 r14 r15 eax xmm0
..B1.5:                         # Preds ..B1.7 ..B1.4
        xorl      %edx, %edx                                    #26.9
                                # LOE rdx rbx r12 r13 r14 r15 eax xmm0
..B1.6:                         # Preds ..B1.6 ..B1.5
        movaps    array_mp_a_(,%rdx,8), %xmm1                   #27.29
        movaps    16+array_mp_a_(,%rdx,8), %xmm2                #27.29
        movaps    32+array_mp_a_(,%rdx,8), %xmm3                #27.29
        movaps    48+array_mp_a_(,%rdx,8), %xmm4                #27.29
        mulpd     %xmm0, %xmm1                                  #27.28
        mulpd     %xmm0, %xmm2                                  #27.28
        mulpd     %xmm0, %xmm3                                  #27.28
        mulpd     %xmm0, %xmm4                                  #27.28
        addpd     array_mp_c_(,%rdx,8), %xmm1                   #27.25
        addpd     16+array_mp_c_(,%rdx,8), %xmm2                #27.25
        addpd     32+array_mp_c_(,%rdx,8), %xmm3                #27.25
        addpd     48+array_mp_c_(,%rdx,8), %xmm4                #27.25
        addpd     array_mp_b_(,%rdx,8), %xmm1                   #27.13
        addpd     16+array_mp_b_(,%rdx,8), %xmm2                #27.13
        addpd     32+array_mp_b_(,%rdx,8), %xmm3                #27.13
        addpd     48+array_mp_b_(,%rdx,8), %xmm4                #27.13
        movaps    %xmm1, array_mp_c_(,%rdx,8)                   #27.13
        movaps    %xmm2, 16+array_mp_c_(,%rdx,8)                #27.13
        movaps    %xmm3, 32+array_mp_c_(,%rdx,8)                #27.13
        movaps    %xmm4, 48+array_mp_c_(,%rdx,8)                #27.13
        addq      $8, %rdx                                      #26.9
        cmpq      $1024, %rdx                                   #26.9
        jb        ..B1.6        # Prob 99%                      #26.9
                                # LOE rdx rbx r12 r13 r14 r15 eax xmm0
..B1.7:                         # Preds ..B1.6
        incl      %eax                                          #25.5
        cmpl      $1000000, %eax                                #25.5
        jb        ..B1.5        # Prob 99%                      #25.5
                                # LOE rbx r12 r13 r14 r15 eax xmm0
..B1.8:                         # Preds ..B1.7

###         end do
###         !c = c + (m * a  + b)

With the parentheses:

###     ! Perform an operation a number of times
###     do t = 1, NUMBER_OF_TRIALS;
###         do i = 1, ARRAY_SIZE
###             c(i) = c(i) + (m*a(i) + b(i))

        movsd     .L_2il0floatpacket.0(%rip), %xmm0             #27.20
        pxor      %xmm1, %xmm1                                  #27.13
                                # LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.5:                         # Preds ..B1.7 ..B1.4
        movsd     array_mp_a_(,%rax,8), %xmm2                   #27.30
        movaps    %xmm1, %xmm9                                  #27.13
        mulsd     %xmm0, %xmm2                                  #27.29
        pxor      %xmm10, %xmm10                                #27.13
        movaps    %xmm1, %xmm8                                  #27.13
        movaps    %xmm1, %xmm7                                  #27.13
        movaps    %xmm1, %xmm6                                  #27.13
        movaps    %xmm1, %xmm5                                  #27.13
        movaps    %xmm1, %xmm4                                  #27.13
        movaps    %xmm1, %xmm3                                  #27.13
        xorl      %edx, %edx                                    #25.5
        addsd     array_mp_b_(,%rax,8), %xmm2                   #27.35
        unpcklpd  %xmm2, %xmm2                                  #27.35
                                # LOE rax rbx r12 r13 r14 r15 edx xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10
..B1.6:                         # Preds ..B1.6 ..B1.5
        addl      $16, %edx                                     #25.5
        addpd     %xmm2, %xmm10                                 #27.13
        addpd     %xmm2, %xmm9                                  #27.13
        addpd     %xmm2, %xmm8                                  #27.13
        addpd     %xmm2, %xmm7                                  #27.13
        addpd     %xmm2, %xmm6                                  #27.13
        addpd     %xmm2, %xmm5                                  #27.13
        addpd     %xmm2, %xmm4                                  #27.13
        addpd     %xmm2, %xmm3                                  #27.13
        cmpl      $1000000, %edx                                #25.5
        jb        ..B1.6        # Prob 99%                      #25.5
                                # LOE rax rbx r12 r13 r14 r15 edx xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10
..B1.7:                         # Preds ..B1.6
        addpd     %xmm9, %xmm10                                 #27.13
        addpd     %xmm7, %xmm8                                  #27.13
        addpd     %xmm5, %xmm6                                  #27.13
        addpd     %xmm3, %xmm4                                  #27.13
        addpd     %xmm8, %xmm10                                 #27.13
        addpd     %xmm4, %xmm6                                  #27.13
        addpd     %xmm6, %xmm10                                 #27.13
        movaps    %xmm10, %xmm2                                 #27.13
        unpckhpd  %xmm10, %xmm2                                 #27.13
        addsd     %xmm2, %xmm10                                 #27.13
        addsd     array_mp_c_(,%rax,8), %xmm10                  #27.13
        movsd     %xmm10, array_mp_c_(,%rax,8)                  #27.13
        incq      %rax                                          #26.9
        cmpq      $1024, %rax                                   #26.9
        jb        ..B1.5        # Prob 99%                      #26.9
                                # LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.8:                         # Preds ..B1.7

###         end do
###         !c = c + (m * a  + b)
###     end do

I don't know how to interpret this - can someone spell this out further?

Kevin_M_4 · ‎01-22-2016

C version (the 'compound assignment' or rather, different Order of Operations?):

###     /* Perform an operation a number of times */
###     for (t=0; t < NUMBER_OF_TRIALS; t++) {
###             for (i=0; i < ARRAY_SIZE; i++) {
###                     c +=  m*a + b;

        pxor      %xmm1, %xmm1                                  #27.4
                                # LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.4:                         # Preds ..B1.6 ..B1.3
        movsd     a(,%rax,8), %xmm2                             #27.15
        movaps    %xmm1, %xmm9                                  #27.4
        mulsd     %xmm0, %xmm2                                  #27.15
        pxor      %xmm10, %xmm10                                #27.4
        movaps    %xmm1, %xmm8                                  #27.4
        movaps    %xmm1, %xmm7                                  #27.4
        movaps    %xmm1, %xmm6                                  #27.4
        movaps    %xmm1, %xmm5                                  #27.4
        movaps    %xmm1, %xmm4                                  #27.4
        movaps    %xmm1, %xmm3                                  #27.4
        xorl      %edx, %edx                                    #25.5
        addsd     b(,%rax,8), %xmm2                             #27.22
        unpcklpd  %xmm2, %xmm2                                  #27.22
                                # LOE rax rbx r12 r13 r14 r15 edx xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10
..B1.5:                         # Preds ..B1.5 ..B1.4
        addl      $16, %edx                                     #25.5
        addpd     %xmm2, %xmm10                                 #27.4
        addpd     %xmm2, %xmm9                                  #27.4
        addpd     %xmm2, %xmm8                                  #27.4
        addpd     %xmm2, %xmm7                                  #27.4
        addpd     %xmm2, %xmm6                                  #27.4
        addpd     %xmm2, %xmm5                                  #27.4
        addpd     %xmm2, %xmm4                                  #27.4
        addpd     %xmm2, %xmm3                                  #27.4
        cmpl      $1000000, %edx                                #25.5
        jb        ..B1.5        # Prob 99%                      #25.5
                                # LOE rax rbx r12 r13 r14 r15 edx xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10
..B1.6:                         # Preds ..B1.5
        addpd     %xmm9, %xmm10                                 #27.4
        addpd     %xmm7, %xmm8                                  #27.4
        addpd     %xmm5, %xmm6                                  #27.4
        addpd     %xmm3, %xmm4                                  #27.4
        addpd     %xmm8, %xmm10                                 #27.4
        addpd     %xmm4, %xmm6                                  #27.4
        addpd     %xmm6, %xmm10                                 #27.4
        movaps    %xmm10, %xmm2                                 #27.4
        unpckhpd  %xmm10, %xmm2                                 #27.4
        addsd     %xmm2, %xmm10                                 #27.4
        addsd     c(,%rax,8), %xmm10                            #27.4
        movsd     %xmm10, c(,%rax,8)                            #27.4
        incq      %rax                                          #26.3
        cmpq      $1024, %rax                                   #26.3
        jb        ..B1.4        # Prob 99%                      #26.3
                                # LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.7:                         # Preds ..B1.6

###             }
###     }
###

JVanB · ‎01-22-2016

The compiler is gaming your benchmark in the last two versions. It sees that it can switch the order of the loops and so avoids a lot of loads and stores. Try somehow reading or writing a random element of array c in the outer loop or even call a no-nothing subroutine with c as the actual argument (and make sure that the compiler can't see that the subroutine actually does nothing, as by dynamically linking it.)

Kevin_M_4 · ‎01-22-2016

I did that (call a dynamically linked library containing 'dummy' subroutine with c as argument), and it prevented the loop interchange. I also learned to set the value of qopt-report>=2 so that I could see the diagnostic messages for loop interchange.

Thanks

JVanB · ‎01-22-2016

So did that make the performance results more consistent between the C and Fortran versions? Do you really have an old computer that doesn't have avx, let alone FMA?

jimdempseyatthecove · ‎01-23-2016

The parenthesis do not (abstractly) require a change in order of operation, however, in Fortran, the addition of the parenthesis around the sub expression require the use of a temporary (in this case, it added overhead the elieded vectorization).

Jim Dempsey

Kevin_M_4 · ‎01-26-2016

For this test, I worked on Xeon E5-2670.

JVanB · ‎01-26-2016

Looks like your processor has AVX but not FMA. You should be looking for a compiler switch that changes those %xmm* registers to %ymm* registers because that should give you roughly 2X the throughput since your arrays fit in cache.

Kevin_M_4 · ‎01-27-2016

Thanks, I've primarily been getting used to using '-xHost' for compiles. It appears that changes the registers to %ymm. Do I need to explicitly call out '-fma' as a compiler option?

JVanB · ‎01-27-2016

You can try -fma, but my guess would be that it should cause an undefined opcode fault at runtime on your machine.