Hello all, I've been trying to compile and run code with intel 19 (19.0.5 and 184.108.40.206) that had been working fine with ifort 18.0.5 but without much success. I've been getting various segfaults that seem to occur at the intersection of openmp and O3 optimization. I was able to come up with a simple code (below) that reproduces a problem, although I'm not certain it is precisely the problem I'm having in my real code as it doesn't manifest in quite the same way.
For the simple code, I find that when I compile with "-qopenmp" and run with several threads I correctly get
Checksum 1 = 151500000.000000
Checksum 2 = 151500000.000000
Checksum 3 = 0.000000000000000E+000
whereas when using optimization I instead get,
Flags: -qopenmp -O3
Checksum 1 = 151500000.000000
Checksum 2 = 151500000.000000
Checksum 3 = 1617187600.00000
The code works in 18.xx with both compiler options. I also notice that if I try to compile the simple code with "-qopenmp -O3 -fPIC" I get the strange compiler error,
/tmp/ifortz68jYX.i90: catastrophic error: **Internal compiler error: segmentation violation signal raised** Please report this error along with the circumstances in which it occurred in a Software Problem Report. Note: File and line given may not be explicit cause of this error.
although if I compile my full code with "fPIC" I do not get that error (although it doesn't work). In the simple reproducer the code runs but the numbers are wrong, whereas in my real code I get a seg fault that seems to be related to the array-valued function used in the OMP loop nest. The size of the array is only 3x1 though and I'm specifying both "ulimit -s unlimited" and OMP_STACKSIZE to be more than sufficient.
Thanks in advance for any help you can provide.
module math use, intrinsic :: iso_fortran_env implicit none integer, parameter :: rp = REAL64 integer, parameter :: NDIM = 3 integer, parameter :: N = 100 contains function cross(a, b) real(rp), dimension(NDIM), intent(in) :: a, b real(rp), dimension(NDIM) :: cross cross(1) = a(2)*b(3) - a(3)*b(2) cross(2) = a(3)*b(1) - a(1)*b(3) cross(3) = a(1)*b(2) - a(2)*b(1) end function cross end module math program ompbug use math implicit none integer :: i,j,k real(rp), allocatable, dimension(:,:,:,:) :: Q1,Q2,Q3 real(rp), dimension(NDIM) :: V1,V2 allocate(Q1(N,N,N,NDIM)) allocate(Q2(N,N,N,NDIM)) allocate(Q3(N,N,N,NDIM)) !$OMP PARALLEL DO default(shared) collapse(2) & !$OMP private(i,j,k,V1,V2) do i=1,N do j=1,N do k=1,N V1 = [1.0*i,1.0*j,1.0*k] V2 = [1.0*k,1.0*j,1.0*i] Q1(i,j,k,:) = V1 Q2(i,j,k,:) = V2 Q3(i,j,k,:) = cross(V1,V2) enddo enddo enddo write(*,*) 'Checksum 1 = ', sum(Q1) write(*,*) 'Checksum 2 = ', sum(Q2) write(*,*) 'Checksum 3 = ', sum(Q3) end program ompbug
-fPIC is not the problem. The code generated at O3, loop transforms, mess up the generated code. At first I thought it was the COLLAPSE clause but I removed it and the error remains. It's the loop xforms along with the array syntax for the 3 NDIM elements that is confusing the loop optimization.
I'll write up a bug report. you know that O2 works right? Use that workaround for now, use O2.
I appreciate this may be a minimal example of a problem but depending on default SCHEDULE, your example provided could have a high memory <> cache transfer demand which could be overcome by attempting to modify memory locally for each thread.
I would suggest the following change, if possible could improve the performance. Given N = 100, (ie, much larger than the number of threads) I would also expect that collapse(2) is not a significant change.
allocate (Q1(NDIM,N,N,N)) allocate (Q2(NDIM,N,N,N)) allocate (Q3(NDIM,N,N,N)) !$OMP PARALLEL DO default(shared) collapse(2) & !$OMP private(i,j,k,V1,V2) do k=1,N do j=1,N do i=1,N V1 = [1.0*i,1.0*j,1.0*k] V2 = [1.0*k,1.0*j,1.0*i] Q1(:,i,j,k) = V1 Q2(:,i,j,k) = V2 Q3(:,i,j,k) = cross(V1,V2) enddo enddo enddo
My thoughts, I would be interested if you disagree.
Ronald and John, thank you both for taking the time to look at my example and your helpful comments. It took me a while to come up with a simple example that caused *A* problem, so it might be the case that this isn't sufficiently representative as it doesn't manifest in the exact same way as the problem with my code.
Ronald: Reducing O2 flag seems to fix this specific example, but when running my full code even with O2 there seems to be NANs popping up somewhere and I seem to be having trouble getting a useful traceback. Again, the code runs fine with gfortran and ifort 18.xx.
Regarding your comment about the loop transform and array syntax, this general kind of loop nest is something I use a lot for vectors on a 3D grid. Is there some way of structuring the loop nest differently so that I can use some vector formula in the loop body and assign it to individual cells without confusing the compiler?
John: Yes, this was a minimal example. I take both your points about the memory access pattern and the collapse. For Ni,Nj,Nk cells I'd typically have Q(Ni,Nj,Nk,NDIM) and loop nest order k,j,k.
For the collapse, I typically find that I'm running 2 threads/core on the supercomputers we run on, so N=100 in just the outer-most loop isn't significantly larger than the 64 threads I'd be using. Also, depending on the MPI decomposition (we usually run 1 MPI rank/node, with OMP within the node) you may have small-ish dimensions in K or J. I don't think I understand your statement about modifying memory locally for each thread, could you clarify that?
Thanks to both of you for your help!
My suggestion was to overcome an anticipated memory coherence problem when using Q1(i,...,:) which would probably result in many cache updates.
Q1(:,...,i) would definitely mitigate this problem.
Even Q1(k,j,i,:) would be better, as then Q1(k,...) might mean that each thread cache goes to a different memory page.
I would expect with "Q1(i..." and 64 threads, you would be paying a large memory <> cache performance penalty, although my experience is for only 8-12 threads. As the number of threads increase, I expect there are lots of these issues that would reduce the thread efficiency.
The use of collapse(2) would certainly help when N=65 !
John (and Kareem)
While the initialization loop as written in #1 might benefit with index rearrangement as John posts in #4, your compute intensive loops may benefit more with the original arrangement if you can construct your calls to pass in and array of X, array of Y and array of Z and thus be amenable to vectorization. It is difficult to offer best advice for your application without seeing the complete code and being able to study it using VTune.
We're going through some older bug reports and I tested this reproducer. It is working correctly with the latest compiler release 2021.4.0 which is part of oneAPI HPC Toolkit.
Please give it a try and let me know how it works for you.