Re: ifx: Potential optimization of transpose

foxtran · ‎07-25-2023

Hi guys,

I was looking at some code and I noticed that there is one interesting construction:

S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))

where `S` is some 2D matrix.

Unfortunately, when the matrix is large (for example, its size is 2000x2000), with the following code:

program main
  integer, parameter :: N = 2000
  real(8), allocatable :: S(:,:)
  integer(8) :: indexes(4)
  allocate(S(N,N))
  indexes = (/ 1, N, 1, N /)
  call test(S, indexes)
  print *, S(1,1)
contains
subroutine test(S, indexes)
  real(8), intent(inout) :: S(:,:)
  integer(8), intent(in) :: indexes(4)
  integer(8) :: in1, in2, jn1, jn2
  in1 = indexes(1)
  in2 = indexes(2)
  jn1 = indexes(3)
  jn2 = indexes(4)
  S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))
end subroutine test
end program main

one gets SegFault if it is compiled without the `-heap-arrays` flag.

I suppose it happens due to the hidden allocation of the 2D matrix for saving transposed `S` or something like that. One possible solution is to generate an in-memory transpose that requires O(1) memory in this place.

A simple procedure to do that is (only for square matrixes):

subroutine transpose_v1(S)
  real(8), intent(inout) :: S(:,:)
  if (size(S, dim=1) /= size(S, dim=2)) error stop "Only for square matrixes"
  do i = 1, size(S, dim=1)
    do j = i + 1, size(S, dim=2)
      tmp = S(i,j)
      S(i,j) = S(j,i)
      S(j,i) = tmp
    end do
  end do
end subroutine transpose_v1

Returning to the initial example, there was transposing of sub-space in a 2D matrix. So, the above-presented code should be somehow modified to catch this case.

The simplest way is to call `transpose_v1` over some submatrix:

subroutine transpose_v2(S, I0,I1,J0,J1)
  real(8), intent(inout) :: S(:,:)
  integer(8), intent(in) :: I0,I1,J0,J1
  ! copy elements
  S(I0:I1,J0:J1) = S(J0:J1,I0:I1)
  ! transpose them
  call transpose_v1(S(I0:I1,J0:J1))
end subroutine transpose_v2

However, here, we have two memory copying operations: the first one is in the array assignment, second one is in the `transpose_v1` subroutine. We can merge these two copying operations.

In the case of

S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))

there are three cases:

sub-blocks totally overlapped, i.e. in1 == jn1 and in2 == jn2
sub-blocks partially overlapped, i.e. in1 <= jn1 <= jn2 or jn1 <= in1 <= jn2
sub-block did not overlap

Note, here, the overlapping is always square.

Having subroutine definition the same as for `transpose_v2`, for (1), we need to call `transpose`. Here, we will use O(1)-memory algorithm:

      do concurrent (i=I0:I1, j=I0:I1, i<j)
        tmp = S(i,j)
        S(i,j) = S(j,i)
        S(j,i) = tmp
      end do

For (2), we need to introduce two new variables:

    integer(8) :: imax, jmax
    imax = max(I0,J0)
    jmax = min(I1,J1)

and, when `jmax > imax` this case happens. For this case, our algorithm is the following:

      do concurrent (i=I0:I1, j=J0:J1)
        if (.not.(i >= imax .and. i <= jmax .and. j >= imax .and. j <= jmax)) then ! non-overlapped space, just memory copy
          S(i,j) = S(j,i)
        else if (i < j) then ! overlapped space, we need to swap elements
          tmp = S(i,j)
          S(i,j) = S(j,i)
          S(j,i) = tmp
        end if
      end do

For (3), we have memory copying:

      do concurrent (i=I0:I1, j=J0:J1)
        S(i,j) = S(j,i)
      end do

Then, our final procedure is the following:

  subroutine transpose_inmemory(S, i0,i1,j0,j1)
    real(8), contiguous, intent(inout) :: S(:,:)
    integer(8), intent(in) :: i0,i1,j0,j1
    integer(8) :: imax, jmax, i, j
    real(8) :: tmp
    imax = max(I0,J0)
    jmax = min(I1,J1)
    if (I0 == J0 .and. I1 == J1) then
      do concurrent (i=I0:I1, j=I0:I1, i<j)
        tmp = S(i,j)
        S(i,j) = S(j,i)
        S(j,i) = tmp
      end do
    else if (jmax > imax) then ! if overlap is
      do concurrent (i=I0:I1, j=J0:J1)
        if (.not.(i >= imax .and. i <= jmax .and. j >= imax .and. j <= jmax)) then
          S(i,j) = S(j,i)
        else if (i < j) then
          tmp = S(i,j)
          S(i,j) = S(j,i)
          S(j,i) = tmp
        end if
      end do
    else
      do concurrent (i=I0:I1, j=J0:J1)
        S(i,j) = S(j,i)
      end do
    end if
  end subroutine transpose_inmemory

Having 2D matrix `S(2000,2000)` (ifx-2023.2.0 -Ofast -heap-arrays -flto):

For (1), when the overlapped region changes from 10x10 to 1000x1000, the speed up is from 1.5x to 3x. With larger matrixes, the speed-up is larger.
For (2), when the overlapped region is about 90% of 50x50-1800x1800 matrixes, the speed-up is from 0.4x to 3x. And only for overlapped regions larger than 1000x1000, I can see speed-up.
For (3), when the sub-matrix size is changing from 50x50 to 800x800, the speed-up is from 1.03x to 2.5x.

So, the current code provides an improvement in 2 of 3 cases. Let's replace the slow second case with C code. Introduce C-API to a new C routine:

    interface
      subroutine transpose_case2(S, N,M, I0,I1,J0,J1) bind(C, name="transpose_case2")
        real(8), intent(inout) :: S(*)
        integer(8), value, intent(in) :: N,M
        integer(8), value, intent(in) :: I0,I1,J0,J1
      end subroutine
    end interface

Replace current `if (jmax > imax)` with:

    else if (jmax > imax) then ! if overlap is
      call transpose_case2(S, size(S, dim=1, kind=8), size(S, dim=2, kind=8), I0, I1, J0, J1)

And implement do-cycles in C:

#include <stdint.h>

#define MAX(x, y) (((x) > (y)) ? (x) : (y))
#define MIN(x, y) (((x) < (y)) ? (x) : (y))

void transpose_case2(double *restrict S, const int64_t N, const int64_t M, const int64_t I0, const int64_t I1, const int64_t J0, const int64_t J1) {
  int64_t imax = MAX(I0,J0);
  int64_t jmax = MIN(I1,J1);
  int64_t i, j;
  double tmp;
  for(i = I0; i <= I1; ++i)
  {
    for(j = J0; j <= J1; ++j)
    {
      if (!(i >= imax && i <= jmax && j >= imax && j <= jmax)) {
        S[(j-1)*N + (i-1)] = S[(i-1)*N + (j-1)];
      } else if (i < j) {
        tmp = S[(i-1)*N + (j-1)];
        S[(i-1)*N + (j-1)] = S[(j-1)*N + (i-1)];
        S[(j-1)*N + (i-1)] = tmp;
      }
    }
  }
}

With these replacements, if the overlapped region is about 90% of 50x50-1000x1000 matrixes, the speed-up is from 0.9x to 3x. Unfortunately, higher overlapping leads to a significant increase in computational time, again.

Fortunately, I noticed that GCC 13.1 provides more effective code for C: https://godbolt.org/z/7h85vvjns

Compiling C code with GCC 11.3 (-O3) and linking it with other parts of my application, I got speed-up for any matrixes (from close to 1x up to 3x times).

Finally, I got the speed-up for some specific transpose operations from 1x to 3x times without using additional memory. Unfortunately, the code could not be written only in Fortran and compiled by only one compiler.

The whole code of the transpose routine is here (without testing application): https://godbolt.org/z/e7P6vExM3

foxtran · ‎07-29-2023

Hi guys,

I found a generic algorithm for in-memory matrix transposing. I.e. the following transformation can be done in memory (but shapes need to satisfy transpose's restriction about indexes):

S(i0:i1,j0:j1) = transpose(S(k0:k1,l0:l1))

The algorithm has the following steps:
1) check that rest matrixes are the same
1a) if true, transpose them in memory using transpose_block; exit
2) check that transposing regions are not overlapped
2a) if true, copy memory using copy_block; exit
3) find non-overlapped region in left-side part of matrix
4) copy related block from right-side part of matrix via copy_block
5) exclude transposed block
6) goto 1

In quite slow Fortran implementation, it gives me the same speed like previous one (speed-up is up to 3x), but it is quite slow for small matrixes like it was for previous Fortran-written algorithm (about 0.8x of initial results, but it can be improved with rewriting `copy_block` and `transpose_block` in C and compiling it with GCC).
The usage of new algorithm for previous example is the following:

block
  use transpose_mod
  type(block_t) :: block_to, block_from
  block_from = new_block((/ k0,k1, l0,l1 /)
  block_to = new_block((/ i0,i1, j0,j1 /)
  call inmemory_transpose(S, block_to, block_from)
end block

In the attachment, the source code of algorithm is presented.

TobiasK · ‎08-01-2023

Hi Foxtran,

have you tried the corresponding MKL routine?
https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-2/mkl-imatcopy.html