- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi guys,

I was looking at some code and I noticed that there is one interesting construction:

`S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))`

where `S` is some 2D matrix.

Unfortunately, when the matrix is large (for example, its size is 2000x2000), with the following code:

```
program main
integer, parameter :: N = 2000
real(8), allocatable :: S(:,:)
integer(8) :: indexes(4)
allocate(S(N,N))
indexes = (/ 1, N, 1, N /)
call test(S, indexes)
print *, S(1,1)
contains
subroutine test(S, indexes)
real(8), intent(inout) :: S(:,:)
integer(8), intent(in) :: indexes(4)
integer(8) :: in1, in2, jn1, jn2
in1 = indexes(1)
in2 = indexes(2)
jn1 = indexes(3)
jn2 = indexes(4)
S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))
end subroutine test
end program main
```

one gets SegFault if it is compiled without the `-heap-arrays` flag.

I suppose it happens due to the hidden allocation of the 2D matrix for saving transposed `S` or something like that. One possible solution is to generate an in-memory transpose that requires O(1) memory in this place.

A simple procedure to do that is (only for square matrixes):

```
subroutine transpose_v1(S)
real(8), intent(inout) :: S(:,:)
if (size(S, dim=1) /= size(S, dim=2)) error stop "Only for square matrixes"
do i = 1, size(S, dim=1)
do j = i + 1, size(S, dim=2)
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end do
end do
end subroutine transpose_v1
```

Returning to the initial example, there was transposing of sub-space in a 2D matrix. So, the above-presented code should be somehow modified to catch this case.

The simplest way is to call `transpose_v1` over some submatrix:

```
subroutine transpose_v2(S, I0,I1,J0,J1)
real(8), intent(inout) :: S(:,:)
integer(8), intent(in) :: I0,I1,J0,J1
! copy elements
S(I0:I1,J0:J1) = S(J0:J1,I0:I1)
! transpose them
call transpose_v1(S(I0:I1,J0:J1))
end subroutine transpose_v2
```

However, here, we have two memory copying operations: the first one is in the array assignment, second one is in the `transpose_v1` subroutine. We can merge these two copying operations.

In the case of

`S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))`

there are three cases:

- sub-blocks totally overlapped, i.e. in1 == jn1 and in2 == jn2
- sub-blocks partially overlapped, i.e. in1 <= jn1 <= jn2 or jn1 <= in1 <= jn2
- sub-block did not overlap

Note, here, the overlapping is always square.

Having subroutine definition the same as for `transpose_v2`, for (1), we need to call `transpose`. Here, we will use O(1)-memory algorithm:

```
do concurrent (i=I0:I1, j=I0:I1, i<j)
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end do
```

For (2), we need to introduce two new variables:

```
integer(8) :: imax, jmax
imax = max(I0,J0)
jmax = min(I1,J1)
```

and, when `jmax > imax` this case happens. For this case, our algorithm is the following:

```
do concurrent (i=I0:I1, j=J0:J1)
if (.not.(i >= imax .and. i <= jmax .and. j >= imax .and. j <= jmax)) then ! non-overlapped space, just memory copy
S(i,j) = S(j,i)
else if (i < j) then ! overlapped space, we need to swap elements
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end if
end do
```

For (3), we have memory copying:

```
do concurrent (i=I0:I1, j=J0:J1)
S(i,j) = S(j,i)
end do
```

Then, our final procedure is the following:

```
subroutine transpose_inmemory(S, i0,i1,j0,j1)
real(8), contiguous, intent(inout) :: S(:,:)
integer(8), intent(in) :: i0,i1,j0,j1
integer(8) :: imax, jmax, i, j
real(8) :: tmp
imax = max(I0,J0)
jmax = min(I1,J1)
if (I0 == J0 .and. I1 == J1) then
do concurrent (i=I0:I1, j=I0:I1, i<j)
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end do
else if (jmax > imax) then ! if overlap is
do concurrent (i=I0:I1, j=J0:J1)
if (.not.(i >= imax .and. i <= jmax .and. j >= imax .and. j <= jmax)) then
S(i,j) = S(j,i)
else if (i < j) then
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end if
end do
else
do concurrent (i=I0:I1, j=J0:J1)
S(i,j) = S(j,i)
end do
end if
end subroutine transpose_inmemory
```

Having 2D matrix `S(2000,2000)` (ifx-2023.2.0 -Ofast -heap-arrays -flto):

- For (1), when the overlapped region changes from 10x10 to 1000x1000, the speed up is from 1.5x to 3x. With larger matrixes, the speed-up is larger.
- For (2), when the overlapped region is about 90% of 50x50-1800x1800 matrixes, the speed-up is from 0.4x to 3x. And only for overlapped regions larger than 1000x1000, I can see speed-up.
- For (3), when the sub-matrix size is changing from 50x50 to 800x800, the speed-up is from 1.03x to 2.5x.

So, the current code provides an improvement in 2 of 3 cases. Let's replace the slow second case with C code. Introduce C-API to a new C routine:

```
interface
subroutine transpose_case2(S, N,M, I0,I1,J0,J1) bind(C, name="transpose_case2")
real(8), intent(inout) :: S(*)
integer(8), value, intent(in) :: N,M
integer(8), value, intent(in) :: I0,I1,J0,J1
end subroutine
end interface
```

Replace current `if (jmax > imax)` with:

```
else if (jmax > imax) then ! if overlap is
call transpose_case2(S, size(S, dim=1, kind=8), size(S, dim=2, kind=8), I0, I1, J0, J1)
```

And implement do-cycles in C:

```
#include <stdint.h>
#define MAX(x, y) (((x) > (y)) ? (x) : (y))
#define MIN(x, y) (((x) < (y)) ? (x) : (y))
void transpose_case2(double *restrict S, const int64_t N, const int64_t M, const int64_t I0, const int64_t I1, const int64_t J0, const int64_t J1) {
int64_t imax = MAX(I0,J0);
int64_t jmax = MIN(I1,J1);
int64_t i, j;
double tmp;
for(i = I0; i <= I1; ++i)
{
for(j = J0; j <= J1; ++j)
{
if (!(i >= imax && i <= jmax && j >= imax && j <= jmax)) {
S[(j-1)*N + (i-1)] = S[(i-1)*N + (j-1)];
} else if (i < j) {
tmp = S[(i-1)*N + (j-1)];
S[(i-1)*N + (j-1)] = S[(j-1)*N + (i-1)];
S[(j-1)*N + (i-1)] = tmp;
}
}
}
}
```

With these replacements, if the overlapped region is about 90% of 50x50-1000x1000 matrixes, the speed-up is from 0.9x to 3x. Unfortunately, higher overlapping leads to a significant increase in computational time, again.

Fortunately, I noticed that GCC 13.1 provides more effective code for C: https://godbolt.org/z/7h85vvjns

Compiling C code with GCC 11.3 (-O3) and linking it with other parts of my application, I got speed-up for any matrixes (from close to 1x up to 3x times).

Finally, I got the **speed-up** for some specific transpose operations from 1x to **3x** times **without** using **additional memory**. Unfortunately, the code could not be written only in Fortran and compiled by only one compiler.

The whole code of the transpose routine is here (without testing application): https://godbolt.org/z/e7P6vExM3

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi guys,

I found a generic algorithm for in-memory matrix transposing. I.e. the following transformation can be done in memory (but shapes need to satisfy transpose's restriction about indexes):

`S(i0:i1,j0:j1) = transpose(S(k0:k1,l0:l1))`

The algorithm has the following steps:

1) check that rest matrixes are the same

1a) if true, transpose them in memory using transpose_block; exit

2) check that transposing regions are not overlapped

2a) if true, copy memory using copy_block; exit

3) find non-overlapped region in left-side part of matrix

4) copy related block from right-side part of matrix via copy_block

5) exclude transposed block

6) goto 1

In quite slow Fortran implementation, it gives me the same speed like previous one (speed-up is up to 3x), but it is quite slow for small matrixes like it was for previous Fortran-written algorithm (about 0.8x of initial results, but it can be improved with rewriting `copy_block` and `transpose_block` in C and compiling it with GCC).

The usage of new algorithm for previous example is the following:

```
block
use transpose_mod
type(block_t) :: block_to, block_from
block_from = new_block((/ k0,k1, l0,l1 /)
block_to = new_block((/ i0,i1, j0,j1 /)
call inmemory_transpose(S, block_to, block_from)
end block
```

In the attachment, the source code of algorithm is presented.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Foxtran,

have you tried the corresponding MKL routine?

https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-2/mkl-imatcopy.html

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page