Re: Missed Vectorization issues with AVX and AVX2

jimdempseyatthecove · ‎09-02-2020

In the following code, subroutine foo is an exemplar of a small section of a simulation program. The simulation program uses a large number of small multi-dimension arrays (3x3, 3x4,...) that do not lend themselves to efficient vectorization using AVX, AVX2 and AVX512. While in places some degree of vectorization is attained, it is usually limited to 2-lanes or 3-lanes, but very seldom to 4-lanes (AVX/AVX2) and never 8-lanes (AVX512) REAL(8) data.

As a solution to this, the data structure is reorganized into cache line wide elements:

real(8) :: array(8,3,4) ! eight or former array(3,4)

When generating for AVX512, the left-most slice can be manipulated using single AVX512 registers. The compiler optimization (19.0 and later) does an excellent job of reducing the statements into efficient machine code.

On systems with AVX and AVX2, while the registers cannot hold the full width of the left most slice in a single register, it can hold half the the slice and use two registers to hold the whole of the slice.

The compiler optimizations, when optimizing non-looped code, does in fact generate fully vectorized code (twice, once for each half of the slice).

The problem comes in when the statements are enclosed within a loop and double nested loop.

In cases of a single loop level, depending on the code in the loop, some times the compiler manages to vectorize, In most cases it does not.

Using !dir$ ivdep helps in some places but not all. Using !dir vector always seems to have no effect.

In the following code

subroutine foo should vectorize with systems with AVX, AVX2, and AVX512 (it does with AVX512)

subroutine foo2 is a work-around that explicitly sub-divides the slice(1:8) into slice(1:4) and slice(5:8) and perform the same statement(s) twice. This works some of the time but not all of the time. In the below example, it does not work.

subroutine foo3, with a contains function Hack does attain vectorization on system with AVX. While it will attain 256-bit wide vectorization on AVX512 it is not as efficient as using the intended code without the Hack.

I hope this reproducer will aid the Intel software development team identify an optimization opportunity (and hopefully fix it in an upcoming release).

!  Console2.f90 
module mod_foo
    integer, parameter :: CacheLineSize = 64
    integer, parameter :: VectorWidth = CacheLineSize / sizeof(0.0d0)
    real(8) :: Output(VectorWidth, 3, 4)
    real(8) :: A(VectorWidth, 3, 4)
    real(8) :: B(VectorWidth, 3, 3)
    real(8) :: C(VectorWidth)
    !dir$ attributes align : CacheLineSize :: Output, A, B, C
end module mod_foo
    
program tst
    use mod_foo
    call RANDOM_NUMBER(A)
    call RANDOM_NUMBER(B)
    call RANDOM_NUMBER(C)
    call foo(Output, A, B, C)
    ! not interested in output
    ! print is here to assure optimization does not remove all code
    print *,Output
    call foo2(Output, A, B, C)
    ! not interested in output
    ! print is here to assure optimization does not remove all code
    print *,Output
    call foo3(Output, A, B, C)
    ! not interested in output
    ! print is here to assure optimization does not remove all code
    print *,Output
    
    end program

subroutine foo(output, A, B, C)
    integer, parameter :: CacheLineSize = 64
    integer, parameter :: VectorWidth = CacheLineSize / sizeof(0.0d0)
    real(8), intent(out) :: Output(VectorWidth, 3, 4)
    real(8), intent(in) :: A(VectorWidth, 3, 4)
    real(8), intent(in) :: B(VectorWidth, 3, 3)
    real(8), intent(in) :: C(VectorWidth)
    
    integer :: i, j
    !dir$ assume_aligned output:64, A:64, B:64, C:64
    do j=1,4
        !dir$ ivdep
        !dir$ vector always
        do i=1,3
            Output(:,i,j) = &
                ( A(:,1,j) * B(:,1,i) &
                + A(:,2,j) * B(:,2,i) &
                + A(:,3,j) * B(:,3,i) ) * C(:)
        end do
    end do
end subroutine foo
    
    
subroutine foo2(output, A, B, C)
    integer, parameter :: CacheLineSize = 64
    integer, parameter :: VectorWidth = CacheLineSize / sizeof(0.0d0)
    real(8), intent(out) :: Output(VectorWidth, 3, 4)
    real(8), intent(in) :: A(VectorWidth, 3, 4)
    real(8), intent(in) :: B(VectorWidth, 3, 3)
    real(8), intent(in) :: C(VectorWidth)
    
    integer :: i, j
    !dir$ assume_aligned output:64, A:64, B:64, C:64
    do j=1,4
        !dir$ ivdep
        !dir$ vector always
        do i=1,3
            Output(1:4,i,j) = &
                ( A(1:4,1,j) * B(1:4,1,i) &
                + A(1:4,2,j) * B(1:4,2,i) &
                + A(1:4,3,j) * B(1:4,3,i) ) * C(1:4)
        end do
        !dir$ ivdep
        !dir$ vector always
        do i=1,3
            Output(5:8,i,j) = &
                ( A(5:8,1,j) * B(5:8,1,i) &
                + A(5:8,2,j) * B(5:8,2,i) &
                + A(5:8,3,j) * B(5:8,3,i) ) * C(5:8)
        end do
    end do
end subroutine foo2
    
    
subroutine foo3(output, A, B, C)
    integer, parameter :: CacheLineSize = 64
    integer, parameter :: VectorWidth = CacheLineSize / sizeof(0.0d0)
    real(8), intent(out) :: Output(VectorWidth, 3, 4)
    real(8), intent(in) :: A(VectorWidth, 3, 4)
    real(8), intent(in) :: B(VectorWidth, 3, 3)
    real(8), intent(in) :: C(VectorWidth)
    
    integer :: i, j
    !dir$ assume_aligned output:64, A:64, B:64, C:64
    do j=1,4
      !dir$ ivdep
      !dir$ vector always
      do i=1,3
        Output(1:4,i,j) = &
          Hack( A(1:4,:,j), B(1:4,:,i), C(1:4))
        Output(5:8,i,j) = &
          Hack( A(5:8,:,j), B(5:8,:,i), C(5:8))
        end do
    end do
    contains
    function Hack(A, B, C)
    real(8) :: Hack(4), A(4,3), B(4,3), C(4)
    !dir$ assume_aligned Hack(1):64, A:64, B:64, C:64
    Hack(:) = &
        ( A(:,1) * B(:,1) &
        + A(:,2) * B(:,2) &
        + A(:,3) * B(:,3) ) * C(:)
    end function Hack
end subroutine foo3

Jim Dempsey

JohnNichols · ‎09-02-2020

In the beginning was the Word......

Jim: That statement actually makes more sense than trying to understand your code. It is just way to deep.

JMN

jimdempseyatthecove · ‎09-02-2020

John,

The code wasn't intended for you. It was intended for Intel compiler developers, and possibly FortranFan, Mecej, etc... that are looking for a work around to a vectorization issue.

For mere mortals like you, it is gobbly **bleep**.

Jim Dempsey

JohnNichols · ‎09-04-2020

Jim:

Thank you for clarifying that -- I will never make the Titan class.

John

Bernard · ‎09-04-2020

Hi Jim,

Did you try to annotate the nested "Hack" function with this:

For example

!dir$ attribute vector : vectorlength(8) :: Hack

P.s.

Quick look at Godbolt compiler explorer Ifort 19 assembly revealed that, "Hack" function was (as expected) inlined at the its callsite. Unfortunately the machine code (AVX2) uses mixture of XMM and YMM registers. At any point of time I was not able to force the generation of AVX512 code even when using "qopt-zmm-usage=high" compiler option. I presume, that either the data layout of (Hack arguments) was not optimal for 8-lanes code, or maybe the cost model of switching on the additional 512-bit circuitry needed for the heavy AVX512 code was prohibitive in the light of small data size.

jimdempseyatthecove · ‎09-04-2020

The Hack is not required on AVX512 builds.

The point was not to show the call overhead of the Hack, rather to show that the do j=; do i= loops did not vectorize code that is clearly vectorizable.

Alternativies to the Hack, is in the inner most loop, use associate(component=>slice(:,i,j) for each component in the statement, then use the (1:4) indexes of the components, and then (5:8) indexes of the component.

When that doesn't work, then use the associate(component=>slice(1:4,i,j)... statement, end associate, associate(component=>slice(5:8,i,j)... statement, end associate. While these work it is a royal PIA.

Jim Dempsey

Bernard · ‎09-06-2020

I suppose, that you might have known that information already.

In subroutine 'foo' indexing the arrays: by induction variable "i" did enable AVX2 vectorization. Unfortunately the directive "!dir$ attributes align" did not have any effect. Compiler generates unaligned load/stores. I wonder why indexing by sequence (1,2,3) produced scalar AVX2 code and indexing by induction variable "i" vectorized the code.

Godbolt link