Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28915 Discussions

Possible bug with ifort/ifx 2023.2.0 and OpenMP SIMD

Gaurav-Saxena
Beginner
983 Views

I compile the following Fortran program

 

 

program hello
use omp_lib
implicit none
integer, parameter::ns = 300, ny = 960, nx = 360
integer, parameter::EXTRA = 0
integer :: ix, iy, is
double precision, allocatable::b2stbr_phys_sna(:),sna0(:,:,:,:),na(:,:,:)
double precision :: T_START, T_END


allocate(b2stbr_phys_sna(0:ns-1 + EXTRA))
allocate(sna0(-1:nx,-1:ny,0:1,0:ns-1 + EXTRA))
allocate(na(-1:nx,-1:ny,0:ns-1 + EXTRA))

b2stbr_phys_sna=0.0
sna0 = 0.01
na = 0.02

T_START = omp_get_wtime()
!$OMP PARALLEL DO DEFAULT(NONE) COLLAPSE(2) SHARED(sna0,na) PRIVATE(is,iy,ix) REDUCTION(+:b2stbr_phys_sna)
do is=0,ns-1
do iy=-1,ny
!$OMP SIMD REDUCTION(+:b2stbr_phys_sna)
do ix=-1,nx
b2stbr_phys_sna(is)=b2stbr_phys_sna(is)+sna0(ix,iy,0,is)+sna0(ix,iy,1,is)*na(ix,iy,is)
enddo
enddo
enddo
!$OMP END PARALLEL DO
T_END = omp_get_wtime()

deallocate(b2stbr_phys_sna)
deallocate(sna0)
deallocate(na)

PRINT *, "Work took", T_END - T_START, "seconds"

end program hello

 

 

with:

ifx -g -O2 -qopt-report=3 -qopenmp -xhost -mprefer-vector-width=512 ifx_test.f90 -o ifx_test.exe

and with ifort as:

ifort -g -O2 -qopt-report=3 -qopenmp -xhost -qopt-zmm-usage=high  ifx_test.f90 -o ifx_test.exe 

.

I then set export OMP_NUM_THREADS=2 to run like : ./ifx_test.exe 

It produces a segmentation fault. 

With gfortran/13.2.0 compiling like

gfortran ifx_test.f90 -fopenmp -fopenmp-simd -O3 -o g_ifx_test.exe

and running with export OMP_NUM_THREADS=2 

produces no error. 

When I remove the 
!$OMP SIMD REDUCTION(+:b2stbr_phys_sna)

line (with any number of threads), it always runs successfully with ifort / ifx. 

Could this be an ifort / ifx compiler bug with OpenMP SIMD  ? 

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
938 Views

The loop structures you have, to not present multiple thread writes to the same cells of array b2stbr_phys_sna, and therefore the REDUCTION clause on the !$OMP PARALLEL... directive is not required.

 

The REDUCTION clause on the !$OMP SIMD should not be required as well. The compiler optimization should see that a summation is being performed to a scalar. The LHS of the = is scalar, RHS can be vectorized, and there is no loop order dependencies.

 

You can use VTune on fully optimized code, and examine the Disassembly to see if the code was vectorized.

 

Jim Dempsey

View solution in original post

6 Replies
jimdempseyatthecove
Honored Contributor III
939 Views

The loop structures you have, to not present multiple thread writes to the same cells of array b2stbr_phys_sna, and therefore the REDUCTION clause on the !$OMP PARALLEL... directive is not required.

 

The REDUCTION clause on the !$OMP SIMD should not be required as well. The compiler optimization should see that a summation is being performed to a scalar. The LHS of the = is scalar, RHS can be vectorized, and there is no loop order dependencies.

 

You can use VTune on fully optimized code, and examine the Disassembly to see if the code was vectorized.

 

Jim Dempsey

Gaurav-Saxena
Beginner
876 Views

Hi Jim,

Many thanks for the explanation. I now fully understand why the reduction clause is not required on the OpenMP parallel region. Further, if I remove the reduction clause on the OMP SIMD directive, the compiler still vectorizes the innermost loop. I will remove them. Still ... a bit strange that 2 and more threads cause a segmentation fault with fort/ifx 2023.2.0 and not with gfortran 13.2.0. Many thanks again. Please consider this as solved and closed. 

Best regards,

Gaurav

0 Kudos
jimdempseyatthecove
Honored Contributor III
813 Views

I agree that segmentation fault should not have resulted from the superfluous REDUCTION clauses. Having them in there would have resulted in each thread getting a (stack) copy of the  b2stbr_phys_sna array, size 300 (2400 bytes), significantly less than ~1MB default stack size.

 

Jim Dempsey

0 Kudos
TobiasK
Moderator
675 Views

@Gaurav-Saxena

the reduction clause at the simd construct is a must. omp simd just tells the compiler to vectorize but does not guarantee to handle the reduction correctly. Please set a higher OMP_STACKSIZE, e.g. OMP_STACKSIZE=1G.

Anyway, if you are writing this code for a CPU platform, I would change the loopnest and remove the collapse(2) clause. The outer loop is large enough to handle CPU threading and that way you don't need a reduction over the array b2stbr_phys_sna


!$OMP PARALLEL DO DEFAULT(NONE) SHARED(sna0,na,b2stbr_phys_sna) PRIVATE(t,is,iy,ix)                                                                                                      

do is=0,ns-1

   t=0.d0

  do iy=-1,ny

   !$OMP SIMD REDUCTION(+:t)                                                                                                                                                                                

   do ix=-1,nx

     t=t+sna0(ix,iy,0,is)+sna0(ix,iy,1,is)*na(ix,iy,is)

   enddo

  enddo

b2stbr_phys_sna(is)=b2stbr_phys_sna(is)+t

enddo

!$OMP END PARALLEL DO 


0 Kudos
jimdempseyatthecove
Honored Contributor III
638 Views

Tobias,

In the above case, t is private and scalar. The compiler optimization should have been able to determine the code was suitable for  vector-wide summation within the do ix loop, and vector-wide reduction upon exit of the loop, all without the !$omp simd reduction clause.  I believe it worked this way with ifort, has this become necessary with ifx???

 

Considering that the same source code can be compiled with and without -openmp you'd want/expect the optimization behavior to be the same.

 

Jim Dempsey

0 Kudos
TobiasK
Moderator
469 Views

@jimdempseyatthecove


sorry for the long delay, but I wanted to be sure on this. I talked to the vectorizer and OpenMP team. I know that in some cases the compiler is able to vectorize with omp simd without adding the reduction clause. However, the standard does not guarantee that the code is correct if the reduction clause is omitted. I have seen cases where the compiler generated wrong code if the reduction clause is not present. So in short: reduction is a must if you want the compiler to generate correct code. The initial code posted works fine but needs an increased stack size for the array reduction.


0 Kudos
Reply