- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a code where I have implemented the "Stones Implicit Procedure" (SIP) for solving a Poisson-like equation. With 'ifx' the performance of the code is severely deteriorated.
I have narrowed the problem down to a few loops, and for the sake of simplicity, this is the worst loop:
SUBROUTINE sipiter2(kk, jj, ii, res, ue, un, ut)
IMPLICIT NONE
INTEGER, INTENT(in) :: kk, jj, ii
REAL, INTENT(inout) :: res(kk, jj, ii)
REAL, INTENT(in) :: ue(kk, jj, ii), un(kk, jj, ii), ut(kk, jj, ii)
INTEGER :: k, j, i
DO i = ii-2, 3, -1
DO j = jj-2, 3, -1
DO k = kk-2, 3, -1
res(k, j, i)= res(k, j, i) - un(k, j, i)*res(k, j+1, i) &
- ut(k, j, i)*res(k+1, j, i) - ue(k, j, i)*res(k, j, i+1)
END DO
END DO
END DO
END SUBROUTINE sipiter2
There are other loops that are performing non-optimal as well, but this is by far the worst. On my workstation this loop needs more than 3 times the runtime when compiled with ifx compared to ifort or gfortran.
You might see one of the 'features' of the SIP algorithm is the loop dependencies. Therefore, the loop(s) cannot be vectorized, and we can rule out this as a cause for the performance difference.
The code can be benchmarked by creating a dummy program that feed the 'sipiter2' routine some working arrays like this:
PROGRAM bench
USE, INTRINSIC :: ISO_FORTRAN_ENV
IMPLICIT NONE
INTEGER :: kk, jj, ii
REAL, ALLOCATABLE :: res(:), ue(:), un(:), ut(:)
INTEGER :: n, iloop
INTEGER(int64) :: rate, start_time, end_time
REAL(real64) :: elapsed_time
CALL SYSTEM_CLOCK(count_rate=rate)
kk = 36
jj = 36
ii = 36
n = kk*jj*ii
ALLOCATE(res(n), ue(n), un(n), ut(n))
res = 0.0
ue = 0.0
un = 0.0
ut = 0.0
CALL SYSTEM_CLOCK(count=start_time)
DO iloop = 1, 10000
CALL sipiter2(kk, jj, ii, res, ue, un, ut)
END DO
CALL SYSTEM_CLOCK(count=end_time)
elapsed_time = REAL(end_time - start_time, real64)/REAL(rate, real64)
WRITE(*,*) "Elapsed time: ", elapsed_time
ENDPROGRAM bench
In my experiments I compiled the source files with '-O3' only, as I found no other options that would improve the performance. On my (aging) workstation 'ifx' needs roughly 1.3 seconds and 'ifort' needs 0.4 seconds to execute the test.
Any comments on this problem? Is there anything I don't see? Any easy compiler flags for 'ifx' that will make it shine?
I'm currently using the latest 'ifx', version 2022.2.1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The performance is a bit better for ifx with the compiler that will be available around the end of the year. But not good enough. I filed a bug report (CMPLRLLVM-42126) to see what the compiler team can do to improve the ifx performance some more.
I'll let you know what they do.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>You might see one of the 'features' of the SIP algorithm is the loop dependencies. Therefore, the loop(s) cannot be vectorized.
That is not completely true. It is difficult for a compiler to optimize this loop to using SIMD instruction.
This loop can be vectorized, to a limited extent, when written in C/C++ using the AVX** Intrinsics.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I'm aware of that trick and we actually tried it out. But that is not the problem here, since neither GFortran nor ifort apply any vectorization to the loop in the present form, yet manage to run it more than three times as fast as ifx. This is a serious performance setback.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To obtain some insight, run each using VTune and look at the disassembly.
You also might want to try !DIR$ NOVECTOR on the i and j loops (IOW only attempt to vector the k loop).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try:
DO k = 3, kk-2
res(k, j, i)= res(k, j, i) - un(k, j, i)*res(k, j+1, i) &
- ue(k, j, i)*res(k, j, i+1)
END DO
DO k = 3, kk-2
res(k, j, i)= res(k, j, i) - ut(k, j, i)*res(k+1, j, i)
END DO
With kk relatively small, the k'th indexes of res and ut will be in L1 cache.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim is on a good track with L1 cache use.
I compiled with "-O3 -xhost". ifort and ifx run faster; the times are closer between the two compilers.
The "-xhost" tidbit is documented in the Porting Guide for ifort Users to ifx.
According to the optrpt, neither compiler vectorizes the loops as noted.
I'll keep looking at this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The performance is a bit better for ifx with the compiler that will be available around the end of the year. But not good enough. I filed a bug report (CMPLRLLVM-42126) to see what the compiler team can do to improve the ifx performance some more.
I'll let you know what they do.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I'm already looking forward to a new version.
The code rewrite by jimdempseyatthecove above seems to improve performance for ifx 2022.2.1, but gives slightly reduced performance for ifort and gfortran...
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page