Solved: Re: Re:ifx performance regression over ifort and gfortran

hakostra1 · ‎11-22-2022

I have a code where I have implemented the "Stones Implicit Procedure" (SIP) for solving a Poisson-like equation. With 'ifx' the performance of the code is severely deteriorated.

I have narrowed the problem down to a few loops, and for the sake of simplicity, this is the worst loop:

SUBROUTINE sipiter2(kk, jj, ii, res, ue, un, ut)
    IMPLICIT NONE

    INTEGER, INTENT(in) :: kk, jj, ii
    REAL, INTENT(inout) :: res(kk, jj, ii)
    REAL, INTENT(in) :: ue(kk, jj, ii), un(kk, jj, ii), ut(kk, jj, ii)

    INTEGER :: k, j, i

    DO i = ii-2, 3, -1
        DO j = jj-2, 3, -1
            DO k = kk-2, 3, -1
                res(k, j, i)= res(k, j, i) - un(k, j, i)*res(k, j+1, i) &
                    - ut(k, j, i)*res(k+1, j, i) - ue(k, j, i)*res(k, j, i+1)
            END DO
        END DO
    END DO

END SUBROUTINE sipiter2

There are other loops that are performing non-optimal as well, but this is by far the worst. On my workstation this loop needs more than 3 times the runtime when compiled with ifx compared to ifort or gfortran.

You might see one of the 'features' of the SIP algorithm is the loop dependencies. Therefore, the loop(s) cannot be vectorized, and we can rule out this as a cause for the performance difference.

The code can be benchmarked by creating a dummy program that feed the 'sipiter2' routine some working arrays like this:

PROGRAM bench
    USE, INTRINSIC :: ISO_FORTRAN_ENV

    IMPLICIT NONE

    INTEGER :: kk, jj, ii
    REAL, ALLOCATABLE :: res(:), ue(:), un(:), ut(:)

    INTEGER :: n, iloop

    INTEGER(int64) :: rate, start_time, end_time
    REAL(real64) :: elapsed_time
    CALL SYSTEM_CLOCK(count_rate=rate)

    kk = 36
    jj = 36
    ii = 36
    n = kk*jj*ii
    ALLOCATE(res(n), ue(n), un(n), ut(n))

    res = 0.0
    ue = 0.0
    un = 0.0
    ut = 0.0

    CALL SYSTEM_CLOCK(count=start_time)
    DO iloop = 1, 10000
        CALL sipiter2(kk, jj, ii, res, ue, un, ut)
    END DO
    CALL SYSTEM_CLOCK(count=end_time)

    elapsed_time = REAL(end_time - start_time, real64)/REAL(rate, real64)
    WRITE(*,*) "Elapsed time: ", elapsed_time
ENDPROGRAM bench

In my experiments I compiled the source files with '-O3' only, as I found no other options that would improve the performance. On my (aging) workstation 'ifx' needs roughly 1.3 seconds and 'ifort' needs 0.4 seconds to execute the test.

Any comments on this problem? Is there anything I don't see? Any easy compiler flags for 'ifx' that will make it shine?

I'm currently using the latest 'ifx', version 2022.2.1.

Barbara_P_Intel · ‎11-22-2022

The performance is a bit better for ifx with the compiler that will be available around the end of the year. But not good enough. I filed a bug report (CMPLRLLVM-42126) to see what the compiler team can do to improve the ifx performance some more.

I'll let you know what they do.

View solution in original post

jimdempseyatthecove · ‎11-22-2022

>>You might see one of the 'features' of the SIP algorithm is the loop dependencies. Therefore, the loop(s) cannot be vectorized.

That is not completely true. It is difficult for a compiler to optimize this loop to using SIMD instruction.

This loop can be vectorized, to a limited extent, when written in C/C++ using the AVX** Intrinsics.

Jim Dempsey

hakostra1 · ‎11-22-2022

Yes, I'm aware of that trick and we actually tried it out. But that is not the problem here, since neither GFortran nor ifort apply any vectorization to the loop in the present form, yet manage to run it more than three times as fast as ifx. This is a serious performance setback.

jimdempseyatthecove · ‎11-22-2022

To obtain some insight, run each using VTune and look at the disassembly.

You also might want to try !DIR$ NOVECTOR on the i and j loops (IOW only attempt to vector the k loop).

Jim Dempsey

jimdempseyatthecove · ‎11-22-2022

Try:

            DO k = 3, kk-2
                res(k, j, i)= res(k, j, i) - un(k, j, i)*res(k, j+1, i) &
                    - ue(k, j, i)*res(k, j, i+1)
            END DO
            DO k = 3, kk-2
                res(k, j, i)= res(k, j, i) - ut(k, j, i)*res(k+1, j, i)
            END DO

With kk relatively small, the k'th indexes of res and ut will be in L1 cache.

Jim Dempsey

Barbara_P_Intel · ‎11-22-2022

Jim is on a good track with L1 cache use.

I compiled with "-O3 -xhost". ifort and ifx run faster; the times are closer between the two compilers.

The "-xhost" tidbit is documented in the Porting Guide for ifort Users to ifx.

According to the optrpt, neither compiler vectorizes the loops as noted.

I'll keep looking at this.

Barbara_P_Intel · ‎11-22-2022

The performance is a bit better for ifx with the compiler that will be available around the end of the year. But not good enough. I filed a bug report (CMPLRLLVM-42126) to see what the compiler team can do to improve the ifx performance some more.

I'll let you know what they do.

hakostra1 · ‎11-29-2022

Thanks, I'm already looking forward to a new version.

The code rewrite by jimdempseyatthecove above seems to improve performance for ifx 2022.2.1, but gives slightly reduced performance for ifort and gfortran...