Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
27525 Discussions

## ifx performance regression over ifort and gfortran

Beginner
287 Views

I have a code where I have implemented the "Stones Implicit Procedure" (SIP) for solving a Poisson-like equation. With 'ifx' the performance of the code is severely deteriorated.

I have narrowed the problem down to a few loops, and for the sake of simplicity, this is the worst loop:

``````SUBROUTINE sipiter2(kk, jj, ii, res, ue, un, ut)
IMPLICIT NONE

INTEGER, INTENT(in) :: kk, jj, ii
REAL, INTENT(inout) :: res(kk, jj, ii)
REAL, INTENT(in) :: ue(kk, jj, ii), un(kk, jj, ii), ut(kk, jj, ii)

INTEGER :: k, j, i

DO i = ii-2, 3, -1
DO j = jj-2, 3, -1
DO k = kk-2, 3, -1
res(k, j, i)= res(k, j, i) - un(k, j, i)*res(k, j+1, i) &
- ut(k, j, i)*res(k+1, j, i) - ue(k, j, i)*res(k, j, i+1)
END DO
END DO
END DO

END SUBROUTINE sipiter2``````

There are other loops that are performing non-optimal as well, but this is by far the worst. On my workstation this loop needs more than 3 times the runtime when compiled with ifx compared to ifort or gfortran.

You might see one of the 'features' of the SIP algorithm is the loop dependencies. Therefore, the loop(s) cannot be vectorized, and we can rule out this as a cause for the performance difference.

The code can be benchmarked by creating a dummy program that feed the 'sipiter2' routine some working arrays like this:

``````PROGRAM bench
USE, INTRINSIC :: ISO_FORTRAN_ENV

IMPLICIT NONE

INTEGER :: kk, jj, ii
REAL, ALLOCATABLE :: res(:), ue(:), un(:), ut(:)

INTEGER :: n, iloop

INTEGER(int64) :: rate, start_time, end_time
REAL(real64) :: elapsed_time
CALL SYSTEM_CLOCK(count_rate=rate)

kk = 36
jj = 36
ii = 36
n = kk*jj*ii
ALLOCATE(res(n), ue(n), un(n), ut(n))

res = 0.0
ue = 0.0
un = 0.0
ut = 0.0

CALL SYSTEM_CLOCK(count=start_time)
DO iloop = 1, 10000
CALL sipiter2(kk, jj, ii, res, ue, un, ut)
END DO
CALL SYSTEM_CLOCK(count=end_time)

elapsed_time = REAL(end_time - start_time, real64)/REAL(rate, real64)
WRITE(*,*) "Elapsed time: ", elapsed_time
ENDPROGRAM bench
``````

In my experiments I compiled the source files with '-O3' only, as I found no other options that would improve the performance. On my (aging) workstation 'ifx' needs roughly 1.3 seconds and 'ifort' needs 0.4 seconds to execute the test.

Any comments on this problem? Is there anything I don't see? Any easy compiler flags for 'ifx' that will make it shine?

I'm currently using the latest 'ifx', version 2022.2.1.

1 Solution
Moderator
229 Views

The performance is a bit better for ifx with the compiler that will be available around the end of the year. But not good enough. I filed a bug report (CMPLRLLVM-42126) to see what the compiler team can do to improve the ifx performance some more.

I'll let you know what they do.

7 Replies
Black Belt
263 Views

>>You might see one of the 'features' of the SIP algorithm is the loop dependencies. Therefore, the loop(s) cannot be vectorized.

That is not completely true. It is difficult for a compiler to optimize this loop to using SIMD instruction.

This loop can be vectorized, to a limited extent, when written in C/C++ using the AVX** Intrinsics.

Jim Dempsey

Beginner
252 Views

Yes, I'm aware of that trick and we actually tried it out. But that is not the problem here, since neither GFortran nor ifort apply any vectorization to the loop in the present form, yet manage to run it more than three times as fast as ifx. This is a serious performance setback.

Black Belt
247 Views

To obtain some insight, run each using VTune and look at the disassembly.

You also might want to try !DIR\$ NOVECTOR on the i and j loops (IOW only attempt to vector the k loop).

Jim Dempsey

Black Belt
245 Views

Try:

``````            DO k = 3, kk-2
res(k, j, i)= res(k, j, i) - un(k, j, i)*res(k, j+1, i) &
- ue(k, j, i)*res(k, j, i+1)
END DO
DO k = 3, kk-2
res(k, j, i)= res(k, j, i) - ut(k, j, i)*res(k+1, j, i)
END DO
``````

With kk relatively small, the k'th indexes of res and ut will be in L1 cache.

Jim Dempsey

Moderator
244 Views

Jim is on a good track with L1 cache use.

I compiled with "-O3 -xhost". ifort and ifx run faster; the times are closer between the two compilers.

The "-xhost" tidbit is documented in the Porting Guide for ifort Users to ifx.

According to the optrpt, neither compiler vectorizes the loops as noted.

I'll keep looking at this.

Moderator
230 Views

The performance is a bit better for ifx with the compiler that will be available around the end of the year. But not good enough. I filed a bug report (CMPLRLLVM-42126) to see what the compiler team can do to improve the ifx performance some more.

I'll let you know what they do.

Beginner
130 Views

Thanks, I'm already looking forward to a new version.

The code rewrite by jimdempseyatthecove above seems to improve performance for ifx 2022.2.1, but gives slightly reduced performance for ifort and gfortran...