Segfault in omp program. Need support from Intel!! - Page 2

may_ka · ‎03-19-2017

Hi,

one of my programs is crashing when runnig a threaded version. When running it inside gdb the output left me helpless:

[New LWP 397493]

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 397493]
0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()

gdb bt yielded:

#0 0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()
#1 0x0000000001dae38b in __kmp_fork_barrier(int, int) ()
#2 0x0000000001d150c0 in __kmp_launch_thread ()
#3 0x0000000001d5d341 in _INTERNAL_26_______src_z_Linux_util_cpp_47afea4b::__kmp_launch_worker(void*) ()
#4 0x0000000001eb3ff7 in start_thread ()
#5 0x0000000001f2507b in clone ()

To get an idea about parts of the structure of the program a code snippet which mimics what the program is doing is given below. However, this is just for examplification, I have not tested whether the snippet will produce the same segfaut.

Module Mod_Root
  Implicit none
  Type :: root
  End type root
End Module Mod_Root
Module Mod_Sigma
  use Mod_Root, only: root
  Implicit None
  Type, abstract, extends(root) :: Sigma
    Real, Pointer, contiguous :: PreMult(:,:), PostMult(:,:)
  contains
    Procedure(SubMult), PAss, Public, Deferred :: Mult
  end type Sigma
  Abstract Interface
    Subroutine SubMult(this)
      Import Sigma
      Class(Sigma), Intent(In) :: this
    End Subroutine SubMult
  End Interface
  Private :: SubMult
End Module Mod_Sigma
Module Mod_Sigma_Type_A
  use Mod_Sigma, only: Sigma
  Type, extends(Sigma) :: Sigma_Type_A
    Real, Allocatable :: Mat(:,:,:)
  contains
    Procedure, Pass, Public :: Mult=>SubMult
  End type Sigma_Type_A
  Private :: SubMult
contains
  Subroutine SubMult(this)
    Implicit None
    Class(Sigma_Type_A), Intent(In) :: this
    Integer :: i
    Do i=1,size(this%Mat,3)
      this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))
    End Do
  End Subroutine SubMult
End Module Mod_Sigma_Type_A
Module Mod_Sigma_Type_B
  use Mod_Sigma, only: Sigma
  Type, extends(Sigma) :: Sigma_Type_B
    Real, Allocatable :: Mat(:,:)
  contains
    Procedure, Pass, Public :: Mult=>SubMult
  End type Sigma_Type_B
  Private :: SubMult
contains
  Subroutine SubMult(this)
    Implicit None
    Class(Sigma_Type_B), Intent(In) :: this
    this%PostMult=matmul(this%PreMult,this%Mat)
  End Subroutine SubMult
End Module Mod_Sigma_Type_B
Module Mod_Struct
  use Mod_Root, only: root
  use Mod_Sigma, only: sigma
  Type,extends(root), abstract :: Struct
    Class(Sigma), Allocatable :: Sigma
  Contains
    Procedure(SubMult), Public, PAss, Deferred :: Mult
  End type Struct
  Type :: StructPt
    CLass(Struct), Pointer :: pt
  end type StructPt
  Abstract interface
    Subroutine SubMult(this)
      Import Struct
      Class(Struct), Intent(InOut), Target :: this
    end Subroutine SubMult
  End interface
End Module Mod_Struct
Module Mod_Struct_A
  use Mod_Struct
  Type, extends(Struct) :: Struct_Type_A
    Real, Allocatable :: Mat1(:,:), Mat2(:,:)
  Contains
    Procedure, Pass, Public :: Mult => SubMultSigma
  End type Struct_Type_A
  Private :: SubMultSigma
contains
  Subroutine SubMultSigma(this)
    Implicit None
    Class(Struct_Type_A), Intent(InOut), Target :: this
    this%Sigma%PreMult=>this%Mat1
    this%Sigma%PostMult=>this%Mat2
    call this%Sigma%Mult()
  End Subroutine SubMultSigma
End Module Mod_Struct_A
Program Test
  use Mod_Struct
  use Mod_Struct_A
  use Mod_Sigma_Type_A
  use Mod_Sigma_Type_B
  Type(Struct_Type_A), Target :: a, b
  Class(StructPt), Allocatable :: x(:)
  Integer :: i
  allocate(Sigma_Type_A::a%sigma)
  allocate(Sigma_Type_B::b%sigma)
  Allocate(x(2))
  x(1)%pt=>a;x(2)%pt=>b
  !$OMP PARALLEL DO PRIVATE(i)
  Do i=1,2
    call x(i)%pt%Mult()
  End Do
  !$OMP END PARALLEL DO
End Program Test

The segfault in my progrram occurs in a location similar to when calling x(i)%pt%Mult, but only if b%sigma has been allocated as type "Sigma_Type_B". If both, a and b, has been allocated as type "Sigma_Type_A", the program runs fine invaribaly of the size of the relevant arrays. Moreover, threaded or unthreaded the pogram always runs when the involved arrays are small. However, when arrays occupy up to 200GB of RAM and different type allocations are used, it crashes.

ifort version is 17.01, linux version is centos 7 kerner 3.10, stack size is set to unlimited, omp_stacksize to 32MB.

compiler flags were

-assume byterecl -warn nounused -warn declarations -O0 -static -check all -traceback -warn interface -check noarg_temp_created -mkl=parallel -qopenmp

Neither at compile time nor at run time any errors or warnings occured. The pogram ran on a machine with 56 "Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz" processors and 512GB RAM.

Given the compiler flags I used and running the program inside gdb I am running out of ideas at this point. It would be great if one form Intel could look into this. I could suppliy an executable and a data set which triggers the segfault.

Thanks a lot.

Martyn_C_Intel · ‎03-22-2017

I'm not sure you can reconcile "contiguous" with

this%Sigma%MPreMult=>this%Mat1(1:this%dim1,:)
this%Sigma%MPostMult=>this%Mat2(1:this%dim1,:)

if the lower bound of Mat1 and Mat2 is zero. There's a gap between each column.

Incidentally, you are not calling MKL, so the -mkl setting is irrelevant. You are calling the Fortran intrinsic matmul(). This won't be threaded unless you compile with -O3 -parallel or -qopt-matmul.

But most important, you need a much larger value of the thread stack size. I was able to build and run both variants successfully with 8 threads and OMP_STACKSIZE=5000M . I didn't try to determine optimum values.

may_ka · ‎03-22-2017

Hi Martyn,

thanks for the comment. But I am struggeling to understand why one version runs while the other fails. I assume that both have the same ram, stack and omp_stack demand.

Thanks.

jimdempseyatthecove · ‎03-23-2017

FWIW The OMP_STACKSIZE setting affects the additional OpenMP threads stack sizes but not the main thread stack size.

Jim Dempsey

Martyn_C_Intel · ‎03-23-2017

Yes, but the main thread stack size was already set to unlimited. I found that the test program ran with OMP_STACKSIZE=2000M but failed with 1800M.

This style of modern Fortran code can sometimes result in the compiler making a lot of temporary array copies. For example, the "contiguous" keyword doesn't simply assert that an assumed shape array or pointer is contiguous; it requires the compiler to ensure that is so. If the compiler isn't sure that the assumed shape array or pointer will be contiguous in all circumstances, it will generate a temporary copy that is contiguous, normally on the stack. If this happens within an OpenMP thread, it goes on the thread stack. Plus any automatic objects that need to be private to a thread will go onto each thread stack. This may be why removing the "contiguous" keyword makes the program work. I think it plausible, though I don't know for sure, that the call involving "SPECIAL" is causing an temporary copy and so using more stack space. I note that the array size corresponding to your first printout is just under 2GB. (57 * 4343921 * 8). It's possible that the compiler worries about the contiguity of the data in the special type case. But that's too hard to figure out from reading the code, or whether the compiler could do better. The warning about contiguity reported by Kevin, which comes from the source lines I called out in my previous post, is also an indication that something like this is going on.

Incidentally, there is an alternative to increasing the thread stack size. If you compile with -heap-arrays, which causes temporary arrays to be allocated on the heap instead of on the stack, you don't need to increase OMP_STACKSIZE. The downside, for an OpenMP program, is that the synchronization required to keep all those allocations threadsafe can sometimes impact performance, especially if the number of threads is large.

TimP · ‎03-24-2017

Even at OMP_STACKSIZE=20M, it may be difficult to run a large number of threads (such as 180).

Steve_Lionel · ‎03-24-2017

Also keep in mind that setting the program stack size to "unlimited" really means to set it to whatever maximum is configured in the kernel. It isn't really unlimited.