Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29277 Discussions

Bug when using IFX with OpenMP SIMD directive

Yue-Wu
Beginner
2,658 Views

Title: Bug when using IFX with OpenMP SIMD directive

 

System: Windows 10 22H2 with VS2022 or Windows WSL2
CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
OneAPI version: 2024.1

I have a numerical computation program parallelized with OpenMP, which runs well using ifort or gfortran compilers with any commonly seen compiling options. For the attached code, the correct output iteration number is 1259 (should use FP64). (It's only one indication that the result is correct. Actually when the output is visualized, there are significant differences if the iteration number is different.)

 

Recently I am trying to compile it with IFX. However, I have the wired observations. 

For the 3 loops that I marked with comment "! BUG with IFX " in the file "AWENO_solver.f90":

1. If I use OMP PARALLEL DO SIMD or simply OMP SIMD directive to any of them, then: (1) If compiled using ifort or gfortran with any options, everything is fine; (2) if compiled using IFX with "-O0 -qopenmp -r8" or "-On -qopenmp-stubs -r8" (n can be 1,2,3), everything is fine; (3) if compiled using IFX with "-O2 -qopenmp -r8", then the result is wrong (in the sense that the iteration number is 1250, and the plotted solution is greatly different). 

2. If I just use OMP PARALLEL DO to them, the result is always fine regardless of the compiler and compiling options. 

 

It seems there is something wrong with IFX+OpenMP SIMD?

0 Kudos
8 Replies
Ron_Green
Moderator
2,543 Views

What version of ifx do you use?

0 Kudos
Yue-Wu
Beginner
2,120 Views

I am using ifx (IFX) 2024.2.1 20240711

0 Kudos
Yue-Wu
Beginner
2,477 Views

Hi there, I am using IFX 2024.2.1 20240711

0 Kudos
Ron_Green
Moderator
2,369 Views

I cannot reproduce the issue.  I replaced the OMP directives in the 3 loops with SIMD directives as shown below.

I compile with the 2024.2.0 compiler which is same-as 2024.2.1, there was no change in the compiler between these 2 versions.

rm -Rf *.o *.mod a.out
ifx -what -V -O2 -r8 -qopenmp -c io.f90
ifx -what -V -O2 -r8 -qopenmp -c weno.f90
ifx -what -V -O2 -r8 -qopenmp -c Euler_PDE.f90
ifx -what -V -O2 -r8 -qopenmp -c Euler_nflux.f90
ifx -what -V -O2 -r8 -qopenmp -c AWENO_solver.f90
ifx -what -V -O2 -r8 -qopenmp -c problem.f90
ifx -what -V -O2 -r8 -qopenmp main.F90 io.o weno.o Euler_PDE.o Euler_nflux.o AWENO_solver.f90 problem.o
                ! BUG with IFX:
                ! If this loop uses OMP DO SIMD or simply OMP SIMD, and the program is compiled with IFX with args containing -qopenmp -O2, then the total num of iterations will be 1250, and the result is wrong!
                ! Under the above condition, if compiled with IFX with -O0 or compiled with IFORT or GFORTRAN, the total num of iterations will be 1259, and the result is correct.
                !rwg !$omp parallel do schedule(static)
                !$omp do simd
                do i = 1-ste_r, Nx+ste_r
                    sonics(i) = SQRT( gamma * abs( u_pri(3,i) / u_pri(1,i) ) ) ! stable implementation
                end do
                !$omp end do simd
                !rwg !$omp end parallel do

        if (disp_correction .or. use_flux_limiter) then
            ! compute the exact flux (can be and should be done from outside)

                ! BUG with IFX:
                ! If this loop uses OMP DO SIMD or simply OMP SIMD, and the program is compiled with IFX with args containing -qopenmp -O2, then the total num of iterations will be 1250, and the result is wrong!
                ! Under the above condition, if compiled with IFX with -O0 or compiled with IFORT or GFORTRAN, the total num of iterations will be 1259, and the result is correct.
            !rwg !$omp parallel do schedule(static)
            !$omp do simd
            do i = 1-ste_r, Nx+ste_r
                FF(:,i) = Euler_advective_flux(u_con(:,i), u_pri(3,i), [1.0], 1)
            end do
            !$omp end do simd
            !rwg !$omp end parallel do

        end if

        if (interp_method == CH_RI) then
            ! compute the Riemann invariants (can be and should be done from outside)

                ! BUG with IFX:
                ! If this loop uses OMP DO SIMD or simply OMP SIMD, and the program is compiled with IFX with args containing -qopenmp -O2, then the total num of iterations will be 1250, and the result is wrong!
                ! Under the above condition, if compiled with IFX with -O0 or compiled with IFORT or GFORTRAN, the total num of iterations will be 1259, and the result is correct.
            !rwg !$omp parallel do schedule(static) firstprivate(gamma_coef)
            !$omp do simd
            do i = 1-ste_r, Nx+ste_r
                            RIs(1,i) = u_pri(2,i) - gamma_coef * sonics(i)
                                RIs(2,i) = sqrt(u_pri(3,i)**(1.0/gamma) / u_pri(1,i))
                                RIs(3,i) = u_pri(2,i) + gamma_coef * sonics(i)
            end do
            !$omp end do simd
            !rwg !$omp end parallel do

 

results, trimmed down.  But I am on Redhat Linux.  I will try this on Windows when my server comes back up.

more /etc/redhat-release

Red Hat Enterprise Linux release 8.6 (Ootpa)

model name : Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz

t= 3.7989637457480699E-02
t= 3.7989637457480699E-02
Solving completed.
total number of time steps=  1259
cpu time= 1.6401E+00 s
OMP_num_threads=  8
Program terminates successfully.

 

0 Kudos
Ron_Green
Moderator
2,359 Views

My Windows server shows the same result 1259 time steps as expected.

Windows options. /O2 /real-size:64 /Qopenmp 

Ran on 72 threads on a 2 processor Xeon Gold 6140 with Windows 

0 Kudos
Yue-Wu
Beginner
2,348 Views

Hi Ron,

Thank you for your effort very much. Now I can get the correct results using "-O3 -r8 -qopenmp -fpp". BTW: Sorry that I forgot to include all my compiling options. I tried several cases and finally found that the possible problem may be with -ipo. I list my results here. My opinion now is that my previously encountered problem has NOTHING to do with OpenMP or SIMD, because such error can also emerge if I compile without OpenMP. 

Moreover, I doubt that this problem is caused by inlining, because when I test on Windows with "/Qipo /fpp /Qopenmp /real-size:64" (like setting -ipo on Linux), I can get correct results using additional "/Ob0" and wrong results using additional "/Ob2". 

Interestingly, EVERY time I got the "wrong" result, the iteration count and the plotted solution is always the same. It behaves pretty like a regular bug. 

Optionscorrect plot and correct iteration time (==1259)
-O1 -ipo      -r8 -qopenmp -fppyes
-O2 -ipo      -r8 -qopenmp -fppNO
-O3 -ipo      -r8 -qopenmp -fppNO
-O1               -r8 -qopenmp -fppyes
-O2               -r8 -qopenmp -fppyes
-O3               -r8 -qopenmp -fppyes
-O1 -xHost -r8 -qopenmp -fppyes
-O2 -xHost -r8 -qopenmp -fppyes
-O3 -xHost -r8 -qopenmp -fppyes
-O1 -ipo      -r8                     -fppyes
-O2 -ipo      -r8                     -fpp NO
-O3 -ipo      -r8                     -fpp NO

 

0 Kudos
Ron_Green
Moderator
2,339 Views

This is great news.  The IPO will change the vectorization behavior by inlining.  I can check the -qopt-report output to confirm but I suspect ABS and maybe SQRT gets inlined with the IPO option.

This code runs very quickly with openmp.  Do you see any need for IPO?  Maybe it could just be avoided if the performance without it is good enough.  

I will run some tests with IPO along with -fp-model options and timings to see if we can get IPO and maintain the same convergence timesteps. 

0 Kudos
Yue-Wu
Beginner
2,333 Views

Yes. I am trying -ipo just becuase this code is a building block of a future "big" computation code. 

 

I think that -fp-model is not crucial (actually I have tested different -fp-model's). The reason is that, if I compile using -r16 (i.e. /real-size:128) and disable OpenMP, then: 

"-O3 -xHost -fpp" gives correct plot and correct iteration nums;

"-O3 -ipo -xHost -fpp" gives incorrect plot (almost the same as those incorrect ones using -r8) and incorrect iteration nums.

My opinion is that some "structural and logical" but not "floating point-al" error happens when "-ipo" is used with "-O2" or "-O3". 

Remark: such error doesn't happen on ifort. 

0 Kudos
Reply