weird performance degradation (slice notation is faster)

Aldo_H_ · ‎02-23-2021

I am confused by the performance effects of small differences in the following code, which adds the Kronecker product of two matrices to another matrix.

The arrays `c` and copy_c` are identical, and both are simply contiguous. I see some weird behavior: each of the following modifications result in 2x slower performance:

calling `add_kronprod` with the actual argument `copy_c` instead of `c`;
writing `c = copy_c` instead of `c(:,:) = copy_c`;
giving `c` the `target` attribute;
using a simply contiguous pointer (i.e., declared with contiguous attribute) `ptr_to_c` as the actual argument.

Points 1, 2, and 4 can be countered by adding the `contiguous` attribute to the dummy argument, which restores normal performance.

I would guess that there is some mistake in the compiler, where it fails to recognize that non-pointer local arrays are always contiguous. But that does not explain why situation 3 above cannot be improved with a contiguous attribute for the dummy variable

Can anyone explain what's going on here? Are these glitches in the compiler?

I compile with `ifort -O3 -ipo kronprod.f90`.
CPU: AMD A8-3870 APU
OS: Ubuntu 20.04.2 LTS (Focal Fossa)

program test_kronprod
   use, intrinsic :: iso_fortran_env
   implicit none
   
   integer, parameter :: wp = real64
   real(wp), allocatable, dimension(:,:) :: a, b, c, copy_c
   real(real64) :: tic, toc
   
   integer :: ma, mb, na, nb
   integer :: i, niter

   real(wp), pointer, contiguous :: ptr_to_c(:,:)
   ! UNCOMMENTING THIS MAKES THE CODE 2X SLOWER:
   ! target :: c

   ma = 10
   na = 10
   mb = 4
   nb = 4
   niter = 10**6

   allocate(a(ma, na))
   allocate(b(mb, nb))
   allocate(c(ma*mb, na*nb))
   call random_number(a)
   call random_number(b)
   call random_number(c)
   allocate(copy_c, source = c)
   ! ptr_to_c => c
   
   ! REMOVING THE SLICE NOTATION MAKES THE CODE 2X SLOWER:
   c(:,:) = copy_c         ! c = copy_c
   call cpu_time(tic)
   do i = 1, niter
      call add_kronprod(c, a, b)
   enddo
   call cpu_time(toc)
   print *, toc - tic

contains
   subroutine add_kronprod(C, A, B)
      ! C += kronprod(A, B)
      ! 

      ! UNCOMMENTING THIS FIXES THE PERFORMANCE IN *SOME* CASES
      ! contiguous :: C

      real(wp), intent(inout) :: C(:,:)
      real(wp), intent(in) :: A(:,:), B(:,:)

      integer :: Ci, Cj, Ai, Aj, Bi, Bj

      do Aj = 1, size(A,2)
         do Bj = 1, size(B,2)
            do Ai = 1, size(A,1)
               do Bi = 1, size(B,1)
                  Ci = (Ai - 1) * size(B,1) + Bi
                  Cj = (Aj - 1) * size(B,2) + Bj
                  C(Ci, Cj) = C(Ci, Cj) + A(Ai, Aj) * B(Bi, Bj)
               enddo
            enddo
         enddo
      enddo
   end subroutine
end program

Ron_Green · ‎02-23-2021

first, use -qopt-report 5

that will help with insights.

Next, READ THIS

you said: "I would guess that there is some mistake in the compiler, where it fails to recognize that non-pointer local arrays are always contiguous." The bold does not apply to arguments. Read the article, section 3, assumedshape array arguments.
the main program can

call add_kronprod(c(::2,::2), a(::2,::2), b(::2,::2))

In compiling the subroutine it has to assume it may be non-contiguous. This is why we have the CONTIGUOUS attribute. The compiler is not human and does not look at all the calling location and parameters - besides, after you write the program you may hire an intern that changes the code to do the above. Compiler has to assume the worse and generate safe code.

Use the opt-report to examine other performance issues.

Aldo_H_ · ‎02-24-2021

I checked the output of `-qopt-report=5` for all cases. The function `add_kronprod` always gets inlined. Therefore what matters is that the actual argument is simply contiguous.

Based on the first two cases below, I suspect that the compiler fails to recognize that local, non-pointer variables are simply contiguous.

The following is fast, and the optimization report shows that "reference C(ci,(aj-1)*4+bj) has aligned access" in the inner loop:
```
c(:,:) = copy_c
do i = 1, niter
   call add_kronprod(c, a, b)
enddo
```
This is 2x slower, and the optimization report shows "unaligned access" in the inner loop:
```
c = copy_c
do i = 1, niter
   call add_kronprod(c, a, b)
enddo
```
But I can regain normal performance by adding the compiler option `-assume norealloc_lhs`.
This is 3x slower than the first case, even though the optimization reports are entirely identical (!):
```
target :: c
...

c(:,:) = copy_c
do i = 1, niter
   call add_kronprod(c, a, b)
enddo
```
Maybe this last case is different, and it may have something to do with the compiler not figuring out that there is no argument aliasing.

Ron_Green · ‎02-24-2021

next experiment:

compile with

-align array64byte -qopenmp-simd

and make this small code addition

      !$omp simd aligned( A,B,C : 64 )
      do Aj = 1, size(A,2)
         do Bj = 1, size(B,2)
            do Ai = 1, size(A,1)
               do Bi = 1, size(B,1)
                  Ci = (Ai - 1) * size(B,1) + Bi
                  Cj = (Aj - 1) * size(B,2) + Bj
                  C(Ci, Cj) = C(Ci, Cj) + A(Ai, Aj) * B(Bi, Bj)
               enddo
            enddo
         enddo
      enddo
      !$omp end simd

Ron_Green · ‎02-24-2021

a few parting notes as I disengage from this thread. it looks like a somewhat interesting access pattern however.

These arrays are rather small. Is this typical of what you want to work with, 10x10s and 40x40s? YOu had to iterate a million times to even get the timer above noise level.

Your #rows and # columns is 10 which is not a multiple of your vector size. I'd suggest making #rows and #columns multiples of your vector length. this will help the compiler from creating remainder loops. it may not match whatever physical problem you are trying to solve - I suspect you have a reason these are 10x10 and not 16x16 or some other vector multiple. Or perhaps you are trying to make it hard for the vectorizer; that can be an interesting study as well, throw it something that is not optimal for the architecture so you can contrast that with a GPU or other arch that is not reliant on vectorization.

Anyhow, with -qopt-report 5 you might now look at -qopt-report-phase=vec,loop OR use the Vector Advisor for more insight. Looks like a fun little study. Enjoy.

Aldo_H_ · ‎02-25-2021

The array sizes are typical of my application, and they're in an inner loop somewhere. (I found this problem when I "corrected" the my code by following Steve Lionel's recommendation of removing the slice in `c(:,:) = ...`)

Thank you for the alignment suggestion. I'll try that some day, possibly on a different case.

It seems unrelated to the difference in performance between `c(:,:) = ...` and `c = ...`. I think that's surprising behavior from the compiler. It shouldn't think that whether "C(ci,(aj-1)*4+bj) has aligned access" can depend on how `c` is allocated.