Bug when using IFX with OpenMP SIMD directive

Yue-Wu · ‎10-21-2024

Title: Bug when using IFX with OpenMP SIMD directive

System: Windows 10 22H2 with VS2022 or Windows WSL2
CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
OneAPI version: 2024.1

I have a numerical computation program parallelized with OpenMP, which runs well using ifort or gfortran compilers with any commonly seen compiling options. For the attached code, the correct output iteration number is 1259 (should use FP64). (It's only one indication that the result is correct. Actually when the output is visualized, there are significant differences if the iteration number is different.)

Recently I am trying to compile it with IFX. However, I have the wired observations.

For the 3 loops that I marked with comment "! BUG with IFX " in the file "AWENO_solver.f90":

1. If I use OMP PARALLEL DO SIMD or simply OMP SIMD directive to any of them, then: (1) If compiled using ifort or gfortran with any options, everything is fine; (2) if compiled using IFX with "-O0 -qopenmp -r8" or "-On -qopenmp-stubs -r8" (n can be 1,2,3), everything is fine; (3) if compiled using IFX with "-O2 -qopenmp -r8", then the result is wrong (in the sense that the iteration number is 1250, and the plotted solution is greatly different).

2. If I just use OMP PARALLEL DO to them, the result is always fine regardless of the compiler and compiling options.

It seems there is something wrong with IFX+OpenMP SIMD?

Ron_Green · ‎10-22-2024

What version of ifx do you use?

Yue-Wu · ‎10-22-2024

I am using ifx (IFX) 2024.2.1 20240711

Yue-Wu · ‎10-22-2024

Hi there, I am using IFX 2024.2.1 20240711

Ron_Green · ‎10-23-2024

I cannot reproduce the issue. I replaced the OMP directives in the 3 loops with SIMD directives as shown below.

I compile with the 2024.2.0 compiler which is same-as 2024.2.1, there was no change in the compiler between these 2 versions.

rm -Rf *.o *.mod a.out
ifx -what -V -O2 -r8 -qopenmp -c io.f90
ifx -what -V -O2 -r8 -qopenmp -c weno.f90
ifx -what -V -O2 -r8 -qopenmp -c Euler_PDE.f90
ifx -what -V -O2 -r8 -qopenmp -c Euler_nflux.f90
ifx -what -V -O2 -r8 -qopenmp -c AWENO_solver.f90
ifx -what -V -O2 -r8 -qopenmp -c problem.f90
ifx -what -V -O2 -r8 -qopenmp main.F90 io.o weno.o Euler_PDE.o Euler_nflux.o AWENO_solver.f90 problem.o

                ! BUG with IFX:
                ! If this loop uses OMP DO SIMD or simply OMP SIMD, and the program is compiled with IFX with args containing -qopenmp -O2, then the total num of iterations will be 1250, and the result is wrong!
                ! Under the above condition, if compiled with IFX with -O0 or compiled with IFORT or GFORTRAN, the total num of iterations will be 1259, and the result is correct.
                !rwg !$omp parallel do schedule(static)
                !$omp do simd
                do i = 1-ste_r, Nx+ste_r
                    sonics(i) = SQRT( gamma * abs( u_pri(3,i) / u_pri(1,i) ) ) ! stable implementation
                end do
                !$omp end do simd
                !rwg !$omp end parallel do

        if (disp_correction .or. use_flux_limiter) then
            ! compute the exact flux (can be and should be done from outside)

                ! BUG with IFX:
                ! If this loop uses OMP DO SIMD or simply OMP SIMD, and the program is compiled with IFX with args containing -qopenmp -O2, then the total num of iterations will be 1250, and the result is wrong!
                ! Under the above condition, if compiled with IFX with -O0 or compiled with IFORT or GFORTRAN, the total num of iterations will be 1259, and the result is correct.
            !rwg !$omp parallel do schedule(static)
            !$omp do simd
            do i = 1-ste_r, Nx+ste_r
                FF(:,i) = Euler_advective_flux(u_con(:,i), u_pri(3,i), [1.0], 1)
            end do
            !$omp end do simd
            !rwg !$omp end parallel do

        end if

        if (interp_method == CH_RI) then
            ! compute the Riemann invariants (can be and should be done from outside)

                ! BUG with IFX:
                ! If this loop uses OMP DO SIMD or simply OMP SIMD, and the program is compiled with IFX with args containing -qopenmp -O2, then the total num of iterations will be 1250, and the result is wrong!
                ! Under the above condition, if compiled with IFX with -O0 or compiled with IFORT or GFORTRAN, the total num of iterations will be 1259, and the result is correct.
            !rwg !$omp parallel do schedule(static) firstprivate(gamma_coef)
            !$omp do simd
            do i = 1-ste_r, Nx+ste_r
                            RIs(1,i) = u_pri(2,i) - gamma_coef * sonics(i)
                                RIs(2,i) = sqrt(u_pri(3,i)**(1.0/gamma) / u_pri(1,i))
                                RIs(3,i) = u_pri(2,i) + gamma_coef * sonics(i)
            end do
            !$omp end do simd
            !rwg !$omp end parallel do

results, trimmed down. But I am on Redhat Linux. I will try this on Windows when my server comes back up.

more /etc/redhat-release

Red Hat Enterprise Linux release 8.6 (Ootpa)

model name : Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz

t= 3.7989637457480699E-02
t= 3.7989637457480699E-02
Solving completed.
total number of time steps=  1259
cpu time= 1.6401E+00 s
OMP_num_threads=  8
Program terminates successfully.

Ron_Green · ‎10-23-2024

My Windows server shows the same result 1259 time steps as expected.

Windows options. /O2 /real-size:64 /Qopenmp

Ran on 72 threads on a 2 processor Xeon Gold 6140 with Windows

Yue-Wu · ‎10-23-2024

Hi Ron,

Thank you for your effort very much. Now I can get the correct results using "-O3 -r8 -qopenmp -fpp". BTW: Sorry that I forgot to include all my compiling options. I tried several cases and finally found that the possible problem may be with -ipo. I list my results here. My opinion now is that my previously encountered problem has NOTHING to do with OpenMP or SIMD, because such error can also emerge if I compile without OpenMP.

Moreover, I doubt that this problem is caused by inlining, because when I test on Windows with "/Qipo /fpp /Qopenmp /real-size:64" (like setting -ipo on Linux), I can get correct results using additional "/Ob0" and wrong results using additional "/Ob2".

Interestingly, EVERY time I got the "wrong" result, the iteration count and the plotted solution is always the same. It behaves pretty like a regular bug.

Options	correct plot and correct iteration time (==1259)
-O1 -ipo -r8 -qopenmp -fpp	yes
-O2 -ipo -r8 -qopenmp -fpp	NO
-O3 -ipo -r8 -qopenmp -fpp	NO
-O1 -r8 -qopenmp -fpp	yes
-O2 -r8 -qopenmp -fpp	yes
-O3 -r8 -qopenmp -fpp	yes
-O1 -xHost -r8 -qopenmp -fpp	yes
-O2 -xHost -r8 -qopenmp -fpp	yes
-O3 -xHost -r8 -qopenmp -fpp	yes
-O1 -ipo -r8 -fpp	yes
-O2 -ipo -r8 -fpp	NO
-O3 -ipo -r8 -fpp	NO

Ron_Green · ‎10-23-2024

This is great news. The IPO will change the vectorization behavior by inlining. I can check the -qopt-report output to confirm but I suspect ABS and maybe SQRT gets inlined with the IPO option.

This code runs very quickly with openmp. Do you see any need for IPO? Maybe it could just be avoided if the performance without it is good enough.

I will run some tests with IPO along with -fp-model options and timings to see if we can get IPO and maintain the same convergence timesteps.

Yue-Wu · ‎10-23-2024

Yes. I am trying -ipo just becuase this code is a building block of a future "big" computation code.

I think that -fp-model is not crucial (actually I have tested different -fp-model's). The reason is that, if I compile using -r16 (i.e. /real-size:128) and disable OpenMP, then:

"-O3 -xHost -fpp" gives correct plot and correct iteration nums;

"-O3 -ipo -xHost -fpp" gives incorrect plot (almost the same as those incorrect ones using -r8) and incorrect iteration nums.

My opinion is that some "structural and logical" but not "floating point-al" error happens when "-ipo" is used with "-O2" or "-O3".

Remark: such error doesn't happen on ifort.

Bug when using IFX with OpenMP SIMD directive

Compile Error

Fortran Language

OpenMP

Runtime error