I have the following code which causes the Intel Fortran compiler to cause a runtime SIGSEGV crash for array sizes 600^2:
do iter = 1, num_iterations !$OMP PARALLEL SHARED( A, A_new, outside, inside ) !$OMP WORKSHARE A_new(:, :) = outside(:, :) * A(:, :) + inside(:, :) * 0.25_rp * & ( cshift(A(:, :), dim = 1, shift = 1 ) + & cshift(A(:, :), dim = 1, shift = -1 ) + & cshift(A(:, :), dim = 2, shift = 1 ) + & cshift(A(:, :), dim = 2, shift = -1 )) A(:, :) = A_new(:, :) !$OMP END WORKSHARE !$OMP END PARALLEL end do
If I switch off OpenMP, it works fine so I think there is something wrong with the Intel OpenMP runtime system. This is for Intel Fortran version 17.0.0 20160721, and this also happens for Intel Fortran 18.0.0. I have attached a tar file which contains the two Fortran source code files. Just type "make" and execute the resulting code.
Your code is incorrect. The last statement inside the workshare has temporal dependencies with the first statement inside the workshare..
This said, the programming error would cause SIGSEGV if any of the threads stack size was insufficient.
There is a secondary problem with the above code (when executed by multiple threads in the workshare). Perhapse Steve L. can answer this.
Considering that a thread's WORKSHARE slice of the arrays, in particular array A, just what does cshift(A(:,:)...) use as a source, and equally important produce as a destination, as well as what is the slice of the temporary result?
IOW cshift (IMHO) is incompatible/ambiguous with use within WORKSHARE.
Check your extended compiler code generation report regarding parallelization. You will likely find that the above WORKSHARE was not parallelized due to the code being incompatible with being sliced by WORKSHARE. This will result in the serial code requiring a stack size of (assuming REAL(8)):
600 * 600 * 8 = 2,985,600 bytes per temp, * 8 temps = 23,884,800 bytes
i.e. you will need an available stack size of at least 24MB at the point of execution of the above code.
The above code is horribly inefficient
Try this for starters:
integer :: LBA1, LBA2, S1, S2, IP1, IM1, JP1, JM1 LBA1 = LBOUND(A,DIM=1) LBA2 = LBOUND(A,DIM=2) S1 = SIZE(A,DIM=1) S2 = SIZE(A,DIM=2) do iter = 1, num_iterations !$OMP PARALLEL SHARED( A, outside, inside, LBA1, LBA2, S1, S2 ), PRIVATE(I, J, IP1, IM1, JP1, JM1) !OMP DO COLLAPSE(2) DO J=0,S2-1 DO I=0,S1-1 IP1 = MOD(I+1,S1) IM1 = MOD(I+S1-1,S1) JP1 = MOD(I+1,S2) JM1 = MOD(I+S2-1,S2) A(I+LBA1, J+LBA2) = outside(I+LBA1, J+LBA2) * A(I+LBA1, J+LBA2) + inside(I+LBA1, J+LBA2) * 0.25_rp * & ( A(IP1+LBA1, J+LBA2) + & A(IM1+LBA1, J+LBA2) + & A(I+LBA1, JP1+LBA2) + & A(I+LBA1, JM1+LBA2)) END DO END DO !$OMP END PARALLEL end do
The above (untested) code can improve vectorization by specializing the perimeter cells (in scalar), then vectorizing the interior cells). I will leave that as an exercise for you.
Many thanks for your critique - it is much appreciated. I cannot have barriers in workshare blocks, so I have to create another workshare block to ensure correctness. Thanks for spotting that mistake. The line
A_new(:, :) = outside(:, :) * A(:, :) + inside(:, :) * 0.25_rp * & ( cshift(A(:, :), dim = 1, shift = 1 ) + & cshift(A(:, :), dim = 1, shift = -1 ) + & cshift(A(:, :), dim = 2, shift = 1 ) + & cshift(A(:, :), dim = 2, shift = -1 ))
Has A_new(:, :) on the left-hand side and A(:, :) on the right-hand side. You are spot-on about the stack size. I increased it with "ulimit -s unlimited".
What you ask of the compiler is incongruous with the requirements of WORKSHARE.
Within WORKSHARE, SomeArray(:,:) represents a section of the array. One, or both, of the dimensions are not the complete range of the original array.
IOW the A(:,:) as used within the cshift (within the WORKSHARE), would conceptually be:
cshift(A(SliceLowerBound1:SliceUpperBound1, SliceLowerBound2:SliceUpperBound2), dim = 1, shift=1)...
IOW the cshift is performed on .NOT. the entire array, but rather some slice of the array.
If (when) the slice does not incorporate the full range of the selected dimension to shift, the rotation would occur using the lower and upper bounds of the new slice and not that of the original array. This is not consistent with what the serial code would do (i.e. rotate about the full range of the array on specified dim).