Re: SEGFAULT with OpenMP. Stack Problem?

arktos · ‎02-25-2009

Hello All,

I have what it appears to be a common segfault problem with openmp, but the
it seems that the ulimit and KMP_STACKSIZE (and the variants i've seen based
on these) do not have an effect. I wonder if anyone has a suggestion about this.

Background: System Q9650 with 8GB running OpenSuse 11.1 64. Compiler: ifort 11.0.
Code: Fortran 77 written originally to run on cray (ymp and later) and being modified
to run with openmp on intel procs.

Problem: Segfaults when the memory use increases. Web search remedies appear
ineffective.

The code below is a minimal version that produces the problem. It was identified and
extracted from the program.

[plain]      Common / WORK / W1(20,20,21,21),W2(20,20,21,21),W3(20,20,21,21)

....


c  NOTE:   N1= 64   N2=N3=21     M2=M3=10


!$OMP  PARALLEL DEFAULT(SHARED)
!$OMP+ PRIVATE( Iw, JEL, KEL, JGLL, KGLL, W1, W2, W3 )
!$OMP DO
      Do Iw = 1, N1
c
c - Interface condition of the PRESSURE GRADIENT.
c
      Do KEL  = 2, N3
      Do JEL  = 1, N2
      Do JGLL = 1, M2
c
      W1(JGLL,1,JEL,KEL) = W1(JGLL,1,JEL,KEL) + W1(JGLL,M3,JEL,KEL-1)
      W2(JGLL,1,JEL,KEL) = W2(JGLL,1,JEL,KEL) + W2(JGLL,M3,JEL,KEL-1)
      W3(JGLL,1,JEL,KEL) = W3(JGLL,1,JEL,KEL) + W3(JGLL,M3,JEL,KEL-1)
c
      End Do
      End Do
      End Do
c
      End Do
!$OMP END DO
!$OMP END PARALLEL
[/plain]

Any ideas what may be the matter?

Thank you.
--

jimdempseyatthecove · ‎02-25-2009

One of the problems I see is you are making PRIVATE arrays contained in COMMON (very large arrays too).

Try something like this:

[cpp]! in module
type    TypeThreadContext
  SEQUENCE
  REAL, pointer :: W1(:,:,:,:)
  REAL, pointer :: W2(:,:,:,:)
  REAL, pointer :: W3(:,:,:,:)
end type TypeThreadContext

type(TypeThreadContext) :: ThreadContext
COMMON /CONTEXT/ ThreadContext
!$OMP THREADPRIVATE(/CONTEXT/)
-----------------------------
! in initialization code once only
!$OMP PARALLEL
  if(.not. associated(W1)) allocate(W1(20,20,21,21))
  if(.not. associated(W2)) allocate(W2(20,20,21,21))
  if(.not. associated(W3)) allocate(W3(20,20,21,21))
  if(.not. associated(W4)) allocate(W4(20,20,21,21))
!$OMP END PARALLEL
---------------------------

Don't forget to deallocate on finish up code

I think you can use allocatable in the threadprivate area
At the time I wrote this the compiler would accept pointers but not allocatables
[/cpp]

Jim Dempsey

arktos · ‎02-25-2009

Thanks Jim. I will try this and see how it goes.
--

Ron_Green · ‎02-25-2009

You didn't specify your compiler options.

Try adding this before the outter loop:

cDEC$ NOVECTOR

to disable vectorization on the loop nest. Does -vec-report indicate that any of the loops in question are vectorized?

I notice the RHS and LHS in the assignment statements overlap, so some really large array temporaries may be being used in the vectorized case.

ron

jimdempseyatthecove · ‎02-25-2009

Quoting - arktos

Thanks Jim. I will try this and see how it goes.
--

One other point to make.

If you are using nested parallel regions you will have to init the W1, W2,... allocations in those theads as well.
It might not hurt to make the allocation a subroutine and insert into the code that uses W1, ...

subroutine foo(...
! near top
if(.not. associated(W1)) call InitW1234() ! some code like this
do i = 1,yourBigLoop
... ! code using W1, ...
end do
! leave W1,W2,... allocated for next time
end subroutine foo

The if(.not.associated is a light weight test (one integerword being tested)

Jim Dempsey

jimdempseyatthecove · ‎02-25-2009

Also,

Rudra (frequent poster) sent me some code that exhibited a stack allocation problem with OpenMP on my system (Windows XP x64), he did not report this as a problem Linux x?? but it may have been a latent problem waiting to happen.

This problem occured in the starup code. i.e. the error would occure _prior_ to reaching the 1st statement in main. Couldn't fix with futzing with OMP_STACKSIZE. Some sort of compiler problem. A rearrangement of code got it working.

Is your problem occuring before you can reach the 1st statement in your program?
I may have some suggestions to work around that if that is your problem.

Jim Dempsey

TimP · ‎02-25-2009

Quoting - Ronald Green (Intel)

I notice the RHS and LHS in the assignment statements overlap, so some really large array temporaries may be being used in the vectorized case.

I don't see how there can be an overlap, unless one of N3,N2,M2,M3 were out of bounds. Would the compilation behave better if all were compile time constants (e.g. PARAMETER)? You certainly don't want array temporaries in a case like this. If M2 is only 10, and the compiler vectorizes while optimizing for 20, but has to take it as variable, vectorization probably slows it down even without temporaries, so NO VECTOR is a reasonable choice.

arktos · ‎02-25-2009

Just to add some answers to your questions.
Missed them last night (was 1 or 2am here).

@jim

The program has executed quite a few subroutines before reaching
the subroutine with the problem we are discussing. One of
the preceeding subroutines has be multithreaded with no
problem (but it does not use w1, w2 & w3 ).

@ron

ifort -c -O3 -fpp -openmp -parallel

--------------------------------------------------------------------------

Thank you all.

I should have mentioned that all reals are *8.

I have tried the following with exactly the same parameters as before:

[plain]!$OMP  PARALLEL DEFAULT(SHARED)
!$OMP+ PRIVATE( Iw,JEL,KEL,JGLL,KGLL, W1, W2, W3 , T1 )

!$OMP DO
      Do Iw = 1, N1
c
c - Interface condition of the PRESSURE GRADIENT.
c
      Do KEL  = 2, N3
      Do JEL  = 1, N2
      Do JGLL = 1, M2
c
      T1 =  DFLOAT( JGLL)
      W1(JGLL,1,JEL,KEL) = 3.0d0*T1
      W2(JGLL,1,JEL,KEL) = 2.6565d0*T1*T1
      W3(JGLL,1,JEL,KEL) = 3.876d0*T1
c
      End Do
      End Do
      End Do
c
      End Do
!$OMP END DO
!$OMP END PARALLEL
[/plain]

This segfaults

I have also tried:

[cpp]c
      Do KEL  = 2, N3
      Do JEL  = 1, N2
cDEC$ NOVECTOR
      Do JGLL = 1, M2
c
      T1 =  DFLOAT( JGLL)
      W1(JGLL,1,JEL,KEL) = 3.0d0*T1
      W2(JGLL,1,JEL,KEL) = 2.6565d0*T1*T1
      W3(JGLL,1,JEL,KEL) = 3.876d0*T1
c
      End Do
      End Do
      End Do


and 

c
cDEC$ NOVECTOR
      Do KEL  = 2, N3
      Do JEL  = 1, N2
      Do JGLL = 1, M2
c
      T1 =  DFLOAT( JGLL)
      W1(JGLL,1,JEL,KEL) = 3.0d0*T1
      W2(JGLL,1,JEL,KEL) = 2.6565d0*T1*T1
      W3(JGLL,1,JEL,KEL) = 3.876d0*T1
c
      End Do
      End Do
      End Do
[/cpp]

Both segfault.

I have yet to try Jim's suggestions.

I would like to increase the size of the arrays still farther (so that I can increase my simulation
Reynolds number). So,I am looking for something I have proper control over.

--

arktos · ‎02-26-2009

OK.

I have used the suggestion made by Jim and the segfault seems to have gone.

I have also increased the arrays involved and it seems still ok.

I need yet to compare the results against a standard case to see
if the numbers at the end of each step are the same.

Thanks again Jim.
--

jimdempseyatthecove · ‎02-26-2009

Quoting - arktos

Hello All,

I have what it appears to be a common segfault problem with openmp, but the
it seems that the ulimit and KMP_STACKSIZE (and the variants i've seen based
on these) do not have an effect. I wonder if anyone has a suggestion about this.

Background: System Q9650 with 8GB running OpenSuse 11.1 64. Compiler: ifort 11.0.
Code: Fortran 77 written originally to run on cray (ymp and later) and being modified
to run with openmp on intel procs.

Problem: Segfaults when the memory use increases. Web search remedies appear
ineffective.

The code below is a minimal version that produces the problem. It was identified and
extracted from the program.

[plain]      Common / WORK / W1(20,20,21,21),W2(20,20,21,21),W3(20,20,21,21)

....


c  NOTE:   N1= 64   N2=N3=21     M2=M3=10


!$OMP  PARALLEL DEFAULT(SHARED)
!$OMP+ PRIVATE( Iw, JEL, KEL, JGLL, KGLL, W1, W2, W3 )
!$OMP DO
      Do Iw = 1, N1
c
c - Interface condition of the PRESSURE GRADIENT.
c
      Do KEL  = 2, N3
      Do JEL  = 1, N2
      Do JGLL = 1, M2
c
      W1(JGLL,1,JEL,KEL) = W1(JGLL,1,JEL,KEL) + W1(JGLL,M3,JEL,KEL-1)
      W2(JGLL,1,JEL,KEL) = W2(JGLL,1,JEL,KEL) + W2(JGLL,M3,JEL,KEL-1)
      W3(JGLL,1,JEL,KEL) = W3(JGLL,1,JEL,KEL) + W3(JGLL,M3,JEL,KEL-1)
c
      End Do
      End Do
      End Do
c
      End Do
!$OMP END DO
!$OMP END PARALLEL
[/plain]

Any ideas what may be the matter?

Thank you.
--

Arktos,

Set asside the SEGFAULT issue for a moment and look at the code sample you presented

Ask yourself: What do you expect the PRIVATE clause will be doing for you for W1, W2 and W3?

Answer: create seperate arrays for use by the additional threads past the current thread. These additional arrays will be un-initialized.

Ask yourself: Why are you using un-initialized arrays to compute your pressure gradient?

And why is JGLL iterating over 1/2 the cells?

To address the first concern, you code as if the W1, W2, W3 data are to be preserved per thread, from parallel region to parallel region. If so then these data must reside in thread private area (or be obtained from arrays indexed off of thread number (careful of nested OpenMP levels)). I addressed the issue of thread private data earlier.

But Wait.

Then you issue an !$OMP DO inside the parallel region. This means a portion of the DO Iw iteration space will be run by each thread. This implies that a thread specific portion of each un-initialized array of W1, W2, W3 will be run independently. So you are producting parts of the undefined data from your uninitialized data.

I think you do not want W1, W2, W3 as private, but not seeing all your code, it is hard to give good advice.

Once you resolve the PRIVATE issue, then if SEGFAULT remains, I suggest you simplify the parallel region interaction with the main code by making the thriple loop into a subroutine, place at bottom of file and call it from where it came. Note, you can pass in the arguments M2, N2, N3, W1, W2, W3.

arktos · ‎02-27-2009

Jim,

What u say is ok but as I mentioned in the original post the section of code
that I posted here was a minimal code that reproduced the problem. It does
no useful work as it was shown here.

Originally I did not want to post the complete subroutine so as not to place too
much strain on people who wanted to go thru it.

Yes W1, W2 and W3 are temporary storage arrays and are initialised and used as shown
in the complete subroutine below. (threadprivate may be an overkill here).

I located the problem area by selectively commenting out successively each loop.

I should mention as of now I have some numerical differences in the solution and I am
looking into this.

Here is the original parallel section of the subroutine:

[plain]      Subroutine Newu(Nx,Ny,Nz,NLy,NLZ,U,V,W,P)
c     -----------------------------------------
c
c
      IMPLICIT REAL*8 (A-H,O-Z)
      include 'blkio.h'
      include 'blkdata.h'
      include 'blkwork.h'
      include 'blktable1.h'
      include 'blktable2.h'
      include 'blktable3.h'
c
c
      Real*8 U(NX,NY,NZ,NLY,NLZ), V(NX,NY,NZ,NLY,NLZ),
     .       W(NX,NY,NZ,NLY,NLZ), P(NY*NZ*NLY*NLZ,NX)

      include 'omp_lib.h'
c
      Write(ndout,'(a)')' * Newu : IN '
c
!$OMP  PARALLEL DEFAULT(SHARED)
!$OMP+ PRIVATE(I,Iw,IIw,IID,JEL,KEL,JGLL,KGLL,NBLOCK,LBLK,DYH,DZH,RYZ,
!$OMP+         Wv,DPDX,DPDY,DPDZ,PMN,PMM,W1,W2,W3 )

c     KEL = omp_get_max_threads()
c     write(6,*)' -- MAX THREADS =', KEL

c - Go through each streamwise wavenumber.

!$OMP DO
      Do Iw = 1, N1
c
      IID = IIDA( Iw )
c
      IIw = Iw
      If( Iw .eq. 1 ) IIw = 2
c
      NBLOCK = 0
      Do KEL = 1, N3
      Do JEL = 1, N2
c
      NBLOCK = NBLOCK + 1
      LBLK   = (NBLOCK-1)*NPEL
c
      DYH = 0.5D0*DYELM(JEL,KEL)
      DZH = 0.5D0*DZELM(JEL,KEL)
c
c
c - Calculate the PRESSURE GRADIENTS.
c
      Do KGLL = 1, M3
      Do JGLL = 1, M2
c
      DPDX = 0.0D0
      DPDY = 0.0D0
      DPDZ = 0.0D0
c
      Do I = 1, NPEL
c
         PMN = P(LBLK+I,Iw)
         PMM = P(LBLK+I,IIw+IID)
c
         DPDX = DPDX + PMM*DPDXM(I,JGLL,KGLL)
         DPDY = DPDY + PMN*DPDYM(I,JGLL,KGLL)
         DPDZ = DPDZ + PMN*DPDZM(I,JGLL,KGLL)
c
      END DO
c
c - Note : no negative sign for the x derivative since both the
c - multiplication by the complex unity and the use of the sign
c - flag IID produces the correct overall sign.
c
      W1(JGLL,KGLL,JEL,KEL) = IID * DT * DYH * DZH * Wv * DPDX
      W2(JGLL,KGLL,JEL,KEL) =       DT * DZH * DPDY
      W3(JGLL,KGLL,JEL,KEL) =       DT * DYH * DPDZ
c
      End Do
      End Do
      End Do
      End Do
c
c - Interface condition of the PRESSURE GRADIENT.
c
      Do KEL  = 2, N3
      Do JEL  = 1, N2
      Do JGLL = 1, M2
c
      W1(JGLL,1,JEL,KEL) = W1(JGLL,1,JEL,KEL) + W1(JGLL,M3,JEL,KEL-1)
      W2(JGLL,1,JEL,KEL) = W2(JGLL,1,JEL,KEL) + W2(JGLL,M3,JEL,KEL-1)
      W3(JGLL,1,JEL,KEL) = W3(JGLL,1,JEL,KEL) + W3(JGLL,M3,JEL,KEL-1)
c
      End Do
      End Do
      End Do

c
      Do KEL  = 1, N3-1
         Do JEL  = 1, N2
            Do JGLL = 1, M2
c
c              W1(JGLL,M3,JEL,KEL) = W1(JGLL,1,JEL,KEL+1)
c              W2(JGLL,M3,JEL,KEL) = W2(JGLL,1,JEL,KEL+1)
c              W3(JGLL,M3,JEL,KEL) = W3(JGLL,1,JEL,KEL+1)
c
            End Do
         End Do
      End Do
c

c
c     Do JEL  = 2, N2
c     Do KEL  = 1, N3
c     Do KGLL = 1, M3
c
c     W1(1,KGLL,JEL,KEL) = W1(1,KGLL,JEL,KEL) + W1(M2,KGLL,JEL-1,KEL)
c     W2(1,KGLL,JEL,KEL) = W2(1,KGLL,JEL,KEL) + W2(M2,KGLL,JEL-1,KEL)
c     W3(1,KGLL,JEL,KEL) = W3(1,KGLL,JEL,KEL) + W3(M2,KGLL,JEL-1,KEL)
c
c     End Do
c     End Do
c     End Do
c
c
      Do JEL  = 1, N2-1
         Do KEL  = 1, N3
            Do KGLL = 1, M3
c              W1(M2,KGLL,JEL,KEL) = W1(1,KGLL,JEL+1,KEL)
c              W2(M2,KGLL,JEL,KEL) = W2(1,KGLL,JEL+1,KEL)
c              W3(M2,KGLL,JEL,KEL) = W3(1,KGLL,JEL+1,KEL)
            End Do
         End Do
      End Do
c
c
c - Find the NEW VELOCITIES.
c
      Do KEL = 1, N3
      Do JEL = 1, N2
c
      Do KGLL = 1, M3
      Do JGLL = 1, M2
c
c     RYZ = RBI(JGLL,KGLL,JEL,KEL)
c     DPDX = W1(JGLL,KGLL,JEL,KEL)
c     DPDY = W2(JGLL,KGLL,JEL,KEL)
c     DPDZ = W3(JGLL,KGLL,JEL,KEL)
c
c     U(Iw,JGLL,KGLL,JEL,KEL) = RYZ*( U(Iw,JGLL,KGLL,JEL,KEL) + DPDX )
c     V(Iw,JGLL,KGLL,JEL,KEL) = RYZ*(  V(Iw,JGLL,KGLL,JEL,KEL) + DPDY )
c     W(Iw,JGLL,KGLL,JEL,KEL) = RYZ*( W(Iw,JGLL,KGLL,JEL,KEL) + DPDZ )
c
      End Do
      End Do
      End Do
      End Do
c
c
      End Do
!$OMP END DO
!$OMP END PARALLEL
c - End of Iw loop.
[/plain]

....

Return
End

jimdempseyatthecove · ‎02-27-2009

Arktos,

Thread private (persistant) copies of W1, W2, W3 would be best when Newu is called many times. This would reduce possibilities of memory fragmentation during runtime.

Try this

[cpp]Subroutine Newu(Nx,Ny,Nz,NLy,NLZ,U,V,W,P)
...
! *** remove COMMON declaration of W1, W2, W3
! add to subrouting scope
! *** but do not allocate

real*8, allocatable :: W1(:,:,:,:), W2(:,:,:,:), W3(:,:,:,:)
...

!$OMP  PARALLEL DEFAULT(SHARED)
!$OMP+ PRIVATE(I,Iw,IIw,IID,JEL,KEL,JGLL,KGLL,NBLOCK,LBLK,DYH,DZH,RYZ,
!$OMP+         Wv,DPDX,DPDY,DPDZ,PMN,PMM,W1,W2,W3 )

! *** add allocation here
	allocate(W1(M2,M3,N2,N3), STAT = I)
	if(I .ne. 0) call YourFatalMemoryAllocationRoutine()

	allocate(W2(M2,M3,N2,N3), STAT = I)
	if(I .ne. 0) call YourFatalMemoryAllocationRoutine()

	allocate(W3(M2,M3,N2,N3), STAT = I)
	if(I .ne. 0) call YourFatalMemoryAllocationRoutine()

c     KEL = omp_get_max_threads()
c     write(6,*)' -- MAX THREADS =', KEL
c - Go through each streamwise wavenumber.
!$OMP DO
      Do Iw = 1, N1
	...
!$OMP END DO

   deallocate(W3)
   deallocate(W2)
   deallocate(W1)
!$OMP END PARALLEL
[/cpp]

Jim Dempsey

arktos · ‎03-02-2009

Jim,

Your latest suggestion removes the segfault too. Also, the numerical values seem
in agreement with a standard case I have for checking the results.

I will have to make a few more tests and then I will flag the thread as having
resolved the problem.

This will be the second of the four computationally intensive subroutines that
will be using multithreading in this code. The remaining two are more complicated
and it is quite likely that something interesting will show up. And then I have 4 more
simulation codes to multithread...

Many thanks to everyone for their contribution.
--