- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guys,
I was looking at some code and I noticed that there is one interesting construction:
S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))
where `S` is some 2D matrix.
Unfortunately, when the matrix is large (for example, its size is 2000x2000), with the following code:
program main
integer, parameter :: N = 2000
real(8), allocatable :: S(:,:)
integer(8) :: indexes(4)
allocate(S(N,N))
indexes = (/ 1, N, 1, N /)
call test(S, indexes)
print *, S(1,1)
contains
subroutine test(S, indexes)
real(8), intent(inout) :: S(:,:)
integer(8), intent(in) :: indexes(4)
integer(8) :: in1, in2, jn1, jn2
in1 = indexes(1)
in2 = indexes(2)
jn1 = indexes(3)
jn2 = indexes(4)
S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))
end subroutine test
end program main
one gets SegFault if it is compiled without the `-heap-arrays` flag.
I suppose it happens due to the hidden allocation of the 2D matrix for saving transposed `S` or something like that. One possible solution is to generate an in-memory transpose that requires O(1) memory in this place.
A simple procedure to do that is (only for square matrixes):
subroutine transpose_v1(S)
real(8), intent(inout) :: S(:,:)
if (size(S, dim=1) /= size(S, dim=2)) error stop "Only for square matrixes"
do i = 1, size(S, dim=1)
do j = i + 1, size(S, dim=2)
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end do
end do
end subroutine transpose_v1
Returning to the initial example, there was transposing of sub-space in a 2D matrix. So, the above-presented code should be somehow modified to catch this case.
The simplest way is to call `transpose_v1` over some submatrix:
subroutine transpose_v2(S, I0,I1,J0,J1)
real(8), intent(inout) :: S(:,:)
integer(8), intent(in) :: I0,I1,J0,J1
! copy elements
S(I0:I1,J0:J1) = S(J0:J1,I0:I1)
! transpose them
call transpose_v1(S(I0:I1,J0:J1))
end subroutine transpose_v2
However, here, we have two memory copying operations: the first one is in the array assignment, second one is in the `transpose_v1` subroutine. We can merge these two copying operations.
In the case of
S(in1:in2,jn1:jn2) = transpose(S(jn1:jn2,in1:in2))
there are three cases:
- sub-blocks totally overlapped, i.e. in1 == jn1 and in2 == jn2
- sub-blocks partially overlapped, i.e. in1 <= jn1 <= jn2 or jn1 <= in1 <= jn2
- sub-block did not overlap
Note, here, the overlapping is always square.
Having subroutine definition the same as for `transpose_v2`, for (1), we need to call `transpose`. Here, we will use O(1)-memory algorithm:
do concurrent (i=I0:I1, j=I0:I1, i<j)
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end do
For (2), we need to introduce two new variables:
integer(8) :: imax, jmax
imax = max(I0,J0)
jmax = min(I1,J1)
and, when `jmax > imax` this case happens. For this case, our algorithm is the following:
do concurrent (i=I0:I1, j=J0:J1)
if (.not.(i >= imax .and. i <= jmax .and. j >= imax .and. j <= jmax)) then ! non-overlapped space, just memory copy
S(i,j) = S(j,i)
else if (i < j) then ! overlapped space, we need to swap elements
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end if
end do
For (3), we have memory copying:
do concurrent (i=I0:I1, j=J0:J1)
S(i,j) = S(j,i)
end do
Then, our final procedure is the following:
subroutine transpose_inmemory(S, i0,i1,j0,j1)
real(8), contiguous, intent(inout) :: S(:,:)
integer(8), intent(in) :: i0,i1,j0,j1
integer(8) :: imax, jmax, i, j
real(8) :: tmp
imax = max(I0,J0)
jmax = min(I1,J1)
if (I0 == J0 .and. I1 == J1) then
do concurrent (i=I0:I1, j=I0:I1, i<j)
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end do
else if (jmax > imax) then ! if overlap is
do concurrent (i=I0:I1, j=J0:J1)
if (.not.(i >= imax .and. i <= jmax .and. j >= imax .and. j <= jmax)) then
S(i,j) = S(j,i)
else if (i < j) then
tmp = S(i,j)
S(i,j) = S(j,i)
S(j,i) = tmp
end if
end do
else
do concurrent (i=I0:I1, j=J0:J1)
S(i,j) = S(j,i)
end do
end if
end subroutine transpose_inmemory
Having 2D matrix `S(2000,2000)` (ifx-2023.2.0 -Ofast -heap-arrays -flto):
- For (1), when the overlapped region changes from 10x10 to 1000x1000, the speed up is from 1.5x to 3x. With larger matrixes, the speed-up is larger.
- For (2), when the overlapped region is about 90% of 50x50-1800x1800 matrixes, the speed-up is from 0.4x to 3x. And only for overlapped regions larger than 1000x1000, I can see speed-up.
- For (3), when the sub-matrix size is changing from 50x50 to 800x800, the speed-up is from 1.03x to 2.5x.
So, the current code provides an improvement in 2 of 3 cases. Let's replace the slow second case with C code. Introduce C-API to a new C routine:
interface
subroutine transpose_case2(S, N,M, I0,I1,J0,J1) bind(C, name="transpose_case2")
real(8), intent(inout) :: S(*)
integer(8), value, intent(in) :: N,M
integer(8), value, intent(in) :: I0,I1,J0,J1
end subroutine
end interface
Replace current `if (jmax > imax)` with:
else if (jmax > imax) then ! if overlap is
call transpose_case2(S, size(S, dim=1, kind=8), size(S, dim=2, kind=8), I0, I1, J0, J1)
And implement do-cycles in C:
#include <stdint.h>
#define MAX(x, y) (((x) > (y)) ? (x) : (y))
#define MIN(x, y) (((x) < (y)) ? (x) : (y))
void transpose_case2(double *restrict S, const int64_t N, const int64_t M, const int64_t I0, const int64_t I1, const int64_t J0, const int64_t J1) {
int64_t imax = MAX(I0,J0);
int64_t jmax = MIN(I1,J1);
int64_t i, j;
double tmp;
for(i = I0; i <= I1; ++i)
{
for(j = J0; j <= J1; ++j)
{
if (!(i >= imax && i <= jmax && j >= imax && j <= jmax)) {
S[(j-1)*N + (i-1)] = S[(i-1)*N + (j-1)];
} else if (i < j) {
tmp = S[(i-1)*N + (j-1)];
S[(i-1)*N + (j-1)] = S[(j-1)*N + (i-1)];
S[(j-1)*N + (i-1)] = tmp;
}
}
}
}
With these replacements, if the overlapped region is about 90% of 50x50-1000x1000 matrixes, the speed-up is from 0.9x to 3x. Unfortunately, higher overlapping leads to a significant increase in computational time, again.
Fortunately, I noticed that GCC 13.1 provides more effective code for C: https://godbolt.org/z/7h85vvjns
Compiling C code with GCC 11.3 (-O3) and linking it with other parts of my application, I got speed-up for any matrixes (from close to 1x up to 3x times).
Finally, I got the speed-up for some specific transpose operations from 1x to 3x times without using additional memory. Unfortunately, the code could not be written only in Fortran and compiled by only one compiler.
The whole code of the transpose routine is here (without testing application): https://godbolt.org/z/e7P6vExM3
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi guys,
I found a generic algorithm for in-memory matrix transposing. I.e. the following transformation can be done in memory (but shapes need to satisfy transpose's restriction about indexes):
S(i0:i1,j0:j1) = transpose(S(k0:k1,l0:l1))
The algorithm has the following steps:
1) check that rest matrixes are the same
1a) if true, transpose them in memory using transpose_block; exit
2) check that transposing regions are not overlapped
2a) if true, copy memory using copy_block; exit
3) find non-overlapped region in left-side part of matrix
4) copy related block from right-side part of matrix via copy_block
5) exclude transposed block
6) goto 1
In quite slow Fortran implementation, it gives me the same speed like previous one (speed-up is up to 3x), but it is quite slow for small matrixes like it was for previous Fortran-written algorithm (about 0.8x of initial results, but it can be improved with rewriting `copy_block` and `transpose_block` in C and compiling it with GCC).
The usage of new algorithm for previous example is the following:
block
use transpose_mod
type(block_t) :: block_to, block_from
block_from = new_block((/ k0,k1, l0,l1 /)
block_to = new_block((/ i0,i1, j0,j1 /)
call inmemory_transpose(S, block_to, block_from)
end block
In the attachment, the source code of algorithm is presented.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Foxtran,
have you tried the corresponding MKL routine?
https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-2/mkl-imatcopy.html
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page