Solved: Possible bug with ifort/ifx 2023.2.0 and OpenMP SIMD

Gaurav-Saxena · ‎07-17-2024

I compile the following Fortran program

program hello
use omp_lib
implicit none
integer, parameter::ns = 300, ny = 960, nx = 360
integer, parameter::EXTRA = 0
integer :: ix, iy, is
double precision, allocatable::b2stbr_phys_sna(:),sna0(:,:,:,:),na(:,:,:)
double precision :: T_START, T_END


allocate(b2stbr_phys_sna(0:ns-1 + EXTRA))
allocate(sna0(-1:nx,-1:ny,0:1,0:ns-1 + EXTRA))
allocate(na(-1:nx,-1:ny,0:ns-1 + EXTRA))

b2stbr_phys_sna=0.0
sna0 = 0.01
na = 0.02

T_START = omp_get_wtime()
!$OMP PARALLEL DO DEFAULT(NONE) COLLAPSE(2) SHARED(sna0,na) PRIVATE(is,iy,ix) REDUCTION(+:b2stbr_phys_sna)
do is=0,ns-1
do iy=-1,ny
!$OMP SIMD REDUCTION(+:b2stbr_phys_sna)
do ix=-1,nx
b2stbr_phys_sna(is)=b2stbr_phys_sna(is)+sna0(ix,iy,0,is)+sna0(ix,iy,1,is)*na(ix,iy,is)
enddo
enddo
enddo
!$OMP END PARALLEL DO
T_END = omp_get_wtime()

deallocate(b2stbr_phys_sna)
deallocate(sna0)
deallocate(na)

PRINT *, "Work took", T_END - T_START, "seconds"

end program hello

with:

ifx -g -O2 -qopt-report=3 -qopenmp -xhost -mprefer-vector-width=512 ifx_test.f90 -o ifx_test.exe

and with ifort as:

ifort -g -O2 -qopt-report=3 -qopenmp -xhost -qopt-zmm-usage=high ifx_test.f90 -o ifx_test.exe

.

I then set export OMP_NUM_THREADS=2 to run like : ./ifx_test.exe

It produces a segmentation fault.

With gfortran/13.2.0 compiling like

gfortran ifx_test.f90 -fopenmp -fopenmp-simd -O3 -o g_ifx_test.exe

and running with export OMP_NUM_THREADS=2

produces no error.

When I remove the
!$OMP SIMD REDUCTION(+:b2stbr_phys_sna)

line (with any number of threads), it always runs successfully with ifort / ifx.

Could this be an ifort / ifx compiler bug with OpenMP SIMD ?

jimdempseyatthecove · ‎07-17-2024

The loop structures you have, to not present multiple thread writes to the same cells of array b2stbr_phys_sna, and therefore the REDUCTION clause on the !$OMP PARALLEL... directive is not required.

The REDUCTION clause on the !$OMP SIMD should not be required as well. The compiler optimization should see that a summation is being performed to a scalar. The LHS of the = is scalar, RHS can be vectorized, and there is no loop order dependencies.

You can use VTune on fully optimized code, and examine the Disassembly to see if the code was vectorized.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎07-17-2024

The loop structures you have, to not present multiple thread writes to the same cells of array b2stbr_phys_sna, and therefore the REDUCTION clause on the !$OMP PARALLEL... directive is not required.

The REDUCTION clause on the !$OMP SIMD should not be required as well. The compiler optimization should see that a summation is being performed to a scalar. The LHS of the = is scalar, RHS can be vectorized, and there is no loop order dependencies.

You can use VTune on fully optimized code, and examine the Disassembly to see if the code was vectorized.

Jim Dempsey

Gaurav-Saxena · ‎07-18-2024

Hi Jim,

Many thanks for the explanation. I now fully understand why the reduction clause is not required on the OpenMP parallel region. Further, if I remove the reduction clause on the OMP SIMD directive, the compiler still vectorizes the innermost loop. I will remove them. Still ... a bit strange that 2 and more threads cause a segmentation fault with fort/ifx 2023.2.0 and not with gfortran 13.2.0. Many thanks again. Please consider this as solved and closed.

Best regards,

Gaurav

jimdempseyatthecove · ‎07-19-2024

I agree that segmentation fault should not have resulted from the superfluous REDUCTION clauses. Having them in there would have resulted in each thread getting a (stack) copy of the b2stbr_phys_sna array, size 300 (2400 bytes), significantly less than ~1MB default stack size.

Jim Dempsey

TobiasK · ‎08-01-2024

@Gaurav-Saxena

the reduction clause at the simd construct is a must. omp simd just tells the compiler to vectorize but does not guarantee to handle the reduction correctly. Please set a higher OMP_STACKSIZE, e.g. OMP_STACKSIZE=1G.

Anyway, if you are writing this code for a CPU platform, I would change the loopnest and remove the collapse(2) clause. The outer loop is large enough to handle CPU threading and that way you don't need a reduction over the array b2stbr_phys_sna

!$OMP PARALLEL DO DEFAULT(NONE) SHARED(sna0,na,b2stbr_phys_sna) PRIVATE(t,is,iy,ix)

do is=0,ns-1

t=0.d0

do iy=-1,ny

!$OMP SIMD REDUCTION(+:t)

do ix=-1,nx

t=t+sna0(ix,iy,0,is)+sna0(ix,iy,1,is)*na(ix,iy,is)

enddo

b2stbr_phys_sna(is)=b2stbr_phys_sna(is)+t

enddo

!$OMP END PARALLEL DO

jimdempseyatthecove · ‎08-02-2024

Tobias,

In the above case, t is private and scalar. The compiler optimization should have been able to determine the code was suitable for vector-wide summation within the do ix loop, and vector-wide reduction upon exit of the loop, all without the !$omp simd reduction clause. I believe it worked this way with ifort, has this become necessary with ifx???

Considering that the same source code can be compiled with and without -openmp you'd want/expect the optimization behavior to be the same.

Jim Dempsey

TobiasK · ‎08-20-2024

@jimdempseyatthecove

sorry for the long delay, but I wanted to be sure on this. I talked to the vectorizer and OpenMP team. I know that in some cases the compiler is able to vectorize with omp simd without adding the reduction clause. However, the standard does not guarantee that the code is correct if the reduction clause is omitted. I have seen cases where the compiler generated wrong code if the reduction clause is not present. So in short: reduction is a must if you want the compiler to generate correct code. The initial code posted works fine but needs an increased stack size for the array reduction.