- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I compile the following Fortran program
program hello
use omp_lib
implicit none
integer, parameter::ns = 300, ny = 960, nx = 360
integer, parameter::EXTRA = 0
integer :: ix, iy, is
double precision, allocatable::b2stbr_phys_sna(:),sna0(:,:,:,:),na(:,:,:)
double precision :: T_START, T_END
allocate(b2stbr_phys_sna(0:ns-1 + EXTRA))
allocate(sna0(-1:nx,-1:ny,0:1,0:ns-1 + EXTRA))
allocate(na(-1:nx,-1:ny,0:ns-1 + EXTRA))
b2stbr_phys_sna=0.0
sna0 = 0.01
na = 0.02
T_START = omp_get_wtime()
!$OMP PARALLEL DO DEFAULT(NONE) COLLAPSE(2) SHARED(sna0,na) PRIVATE(is,iy,ix) REDUCTION(+:b2stbr_phys_sna)
do is=0,ns-1
do iy=-1,ny
!$OMP SIMD REDUCTION(+:b2stbr_phys_sna)
do ix=-1,nx
b2stbr_phys_sna(is)=b2stbr_phys_sna(is)+sna0(ix,iy,0,is)+sna0(ix,iy,1,is)*na(ix,iy,is)
enddo
enddo
enddo
!$OMP END PARALLEL DO
T_END = omp_get_wtime()
deallocate(b2stbr_phys_sna)
deallocate(sna0)
deallocate(na)
PRINT *, "Work took", T_END - T_START, "seconds"
end program hello
with:
ifx -g -O2 -qopt-report=3 -qopenmp -xhost -mprefer-vector-width=512 ifx_test.f90 -o ifx_test.exe
and with ifort as:
ifort -g -O2 -qopt-report=3 -qopenmp -xhost -qopt-zmm-usage=high ifx_test.f90 -o ifx_test.exe
.
I then set export OMP_NUM_THREADS=2 to run like : ./ifx_test.exe
It produces a segmentation fault.
With gfortran/13.2.0 compiling like
gfortran ifx_test.f90 -fopenmp -fopenmp-simd -O3 -o g_ifx_test.exe
and running with export OMP_NUM_THREADS=2
produces no error.
When I remove the
!$OMP SIMD REDUCTION(+:b2stbr_phys_sna)
line (with any number of threads), it always runs successfully with ifort / ifx.
Could this be an ifort / ifx compiler bug with OpenMP SIMD ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The loop structures you have, to not present multiple thread writes to the same cells of array b2stbr_phys_sna, and therefore the REDUCTION clause on the !$OMP PARALLEL... directive is not required.
The REDUCTION clause on the !$OMP SIMD should not be required as well. The compiler optimization should see that a summation is being performed to a scalar. The LHS of the = is scalar, RHS can be vectorized, and there is no loop order dependencies.
You can use VTune on fully optimized code, and examine the Disassembly to see if the code was vectorized.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The loop structures you have, to not present multiple thread writes to the same cells of array b2stbr_phys_sna, and therefore the REDUCTION clause on the !$OMP PARALLEL... directive is not required.
The REDUCTION clause on the !$OMP SIMD should not be required as well. The compiler optimization should see that a summation is being performed to a scalar. The LHS of the = is scalar, RHS can be vectorized, and there is no loop order dependencies.
You can use VTune on fully optimized code, and examine the Disassembly to see if the code was vectorized.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Many thanks for the explanation. I now fully understand why the reduction clause is not required on the OpenMP parallel region. Further, if I remove the reduction clause on the OMP SIMD directive, the compiler still vectorizes the innermost loop. I will remove them. Still ... a bit strange that 2 and more threads cause a segmentation fault with fort/ifx 2023.2.0 and not with gfortran 13.2.0. Many thanks again. Please consider this as solved and closed.
Best regards,
Gaurav
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree that segmentation fault should not have resulted from the superfluous REDUCTION clauses. Having them in there would have resulted in each thread getting a (stack) copy of the b2stbr_phys_sna array, size 300 (2400 bytes), significantly less than ~1MB default stack size.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the reduction clause at the simd construct is a must. omp simd just tells the compiler to vectorize but does not guarantee to handle the reduction correctly. Please set a higher OMP_STACKSIZE, e.g. OMP_STACKSIZE=1G.
Anyway, if you are writing this code for a CPU platform, I would change the loopnest and remove the collapse(2) clause. The outer loop is large enough to handle CPU threading and that way you don't need a reduction over the array b2stbr_phys_sna
!$OMP PARALLEL DO DEFAULT(NONE) SHARED(sna0,na,b2stbr_phys_sna) PRIVATE(t,is,iy,ix)
do is=0,ns-1
t=0.d0
do iy=-1,ny
!$OMP SIMD REDUCTION(+:t)
do ix=-1,nx
t=t+sna0(ix,iy,0,is)+sna0(ix,iy,1,is)*na(ix,iy,is)
enddo
enddo
b2stbr_phys_sna(is)=b2stbr_phys_sna(is)+t
enddo
!$OMP END PARALLEL DO
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tobias,
In the above case, t is private and scalar. The compiler optimization should have been able to determine the code was suitable for vector-wide summation within the do ix loop, and vector-wide reduction upon exit of the loop, all without the !$omp simd reduction clause. I believe it worked this way with ifort, has this become necessary with ifx???
Considering that the same source code can be compiled with and without -openmp you'd want/expect the optimization behavior to be the same.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sorry for the long delay, but I wanted to be sure on this. I talked to the vectorizer and OpenMP team. I know that in some cases the compiler is able to vectorize with omp simd without adding the reduction clause. However, the standard does not guarantee that the code is correct if the reduction clause is omitted. I have seen cases where the compiler generated wrong code if the reduction clause is not present. So in short: reduction is a must if you want the compiler to generate correct code. The initial code posted works fine but needs an increased stack size for the array reduction.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page