Segfault in omp program. Need support from Intel!!

may_ka · ‎03-19-2017

Hi,

one of my programs is crashing when runnig a threaded version. When running it inside gdb the output left me helpless:

[New LWP 397493]

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 397493]
0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()

gdb bt yielded:

#0 0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()
#1 0x0000000001dae38b in __kmp_fork_barrier(int, int) ()
#2 0x0000000001d150c0 in __kmp_launch_thread ()
#3 0x0000000001d5d341 in _INTERNAL_26_______src_z_Linux_util_cpp_47afea4b::__kmp_launch_worker(void*) ()
#4 0x0000000001eb3ff7 in start_thread ()
#5 0x0000000001f2507b in clone ()

To get an idea about parts of the structure of the program a code snippet which mimics what the program is doing is given below. However, this is just for examplification, I have not tested whether the snippet will produce the same segfaut.

Module Mod_Root
  Implicit none
  Type :: root
  End type root
End Module Mod_Root
Module Mod_Sigma
  use Mod_Root, only: root
  Implicit None
  Type, abstract, extends(root) :: Sigma
    Real, Pointer, contiguous :: PreMult(:,:), PostMult(:,:)
  contains
    Procedure(SubMult), PAss, Public, Deferred :: Mult
  end type Sigma
  Abstract Interface
    Subroutine SubMult(this)
      Import Sigma
      Class(Sigma), Intent(In) :: this
    End Subroutine SubMult
  End Interface
  Private :: SubMult
End Module Mod_Sigma
Module Mod_Sigma_Type_A
  use Mod_Sigma, only: Sigma
  Type, extends(Sigma) :: Sigma_Type_A
    Real, Allocatable :: Mat(:,:,:)
  contains
    Procedure, Pass, Public :: Mult=>SubMult
  End type Sigma_Type_A
  Private :: SubMult
contains
  Subroutine SubMult(this)
    Implicit None
    Class(Sigma_Type_A), Intent(In) :: this
    Integer :: i
    Do i=1,size(this%Mat,3)
      this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))
    End Do
  End Subroutine SubMult
End Module Mod_Sigma_Type_A
Module Mod_Sigma_Type_B
  use Mod_Sigma, only: Sigma
  Type, extends(Sigma) :: Sigma_Type_B
    Real, Allocatable :: Mat(:,:)
  contains
    Procedure, Pass, Public :: Mult=>SubMult
  End type Sigma_Type_B
  Private :: SubMult
contains
  Subroutine SubMult(this)
    Implicit None
    Class(Sigma_Type_B), Intent(In) :: this
    this%PostMult=matmul(this%PreMult,this%Mat)
  End Subroutine SubMult
End Module Mod_Sigma_Type_B
Module Mod_Struct
  use Mod_Root, only: root
  use Mod_Sigma, only: sigma
  Type,extends(root), abstract :: Struct
    Class(Sigma), Allocatable :: Sigma
  Contains
    Procedure(SubMult), Public, PAss, Deferred :: Mult
  End type Struct
  Type :: StructPt
    CLass(Struct), Pointer :: pt
  end type StructPt
  Abstract interface
    Subroutine SubMult(this)
      Import Struct
      Class(Struct), Intent(InOut), Target :: this
    end Subroutine SubMult
  End interface
End Module Mod_Struct
Module Mod_Struct_A
  use Mod_Struct
  Type, extends(Struct) :: Struct_Type_A
    Real, Allocatable :: Mat1(:,:), Mat2(:,:)
  Contains
    Procedure, Pass, Public :: Mult => SubMultSigma
  End type Struct_Type_A
  Private :: SubMultSigma
contains
  Subroutine SubMultSigma(this)
    Implicit None
    Class(Struct_Type_A), Intent(InOut), Target :: this
    this%Sigma%PreMult=>this%Mat1
    this%Sigma%PostMult=>this%Mat2
    call this%Sigma%Mult()
  End Subroutine SubMultSigma
End Module Mod_Struct_A
Program Test
  use Mod_Struct
  use Mod_Struct_A
  use Mod_Sigma_Type_A
  use Mod_Sigma_Type_B
  Type(Struct_Type_A), Target :: a, b
  Class(StructPt), Allocatable :: x(:)
  Integer :: i
  allocate(Sigma_Type_A::a%sigma)
  allocate(Sigma_Type_B::b%sigma)
  Allocate(x(2))
  x(1)%pt=>a;x(2)%pt=>b
  !$OMP PARALLEL DO PRIVATE(i)
  Do i=1,2
    call x(i)%pt%Mult()
  End Do
  !$OMP END PARALLEL DO
End Program Test

The segfault in my progrram occurs in a location similar to when calling x(i)%pt%Mult, but only if b%sigma has been allocated as type "Sigma_Type_B". If both, a and b, has been allocated as type "Sigma_Type_A", the program runs fine invaribaly of the size of the relevant arrays. Moreover, threaded or unthreaded the pogram always runs when the involved arrays are small. However, when arrays occupy up to 200GB of RAM and different type allocations are used, it crashes.

ifort version is 17.01, linux version is centos 7 kerner 3.10, stack size is set to unlimited, omp_stacksize to 32MB.

compiler flags were

-assume byterecl -warn nounused -warn declarations -O0 -static -check all -traceback -warn interface -check noarg_temp_created -mkl=parallel -qopenmp

Neither at compile time nor at run time any errors or warnings occured. The pogram ran on a machine with 56 "Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz" processors and 512GB RAM.

Given the compiler flags I used and running the program inside gdb I am running out of ideas at this point. It would be great if one form Intel could look into this. I could suppliy an executable and a data set which triggers the segfault.

Thanks a lot.

jimdempseyatthecove · ‎03-20-2017

The following will definitely create an array temporary:

this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))

I am not sure about this%PostMult=matmul(this%PreMult,this%Mat) creating an array temporary (probably does internally) and not sure of size.

You may be running out of RAM.

Is the failure occurring in a test program where you loop starting with small arrays, increasing the sizes each iteration until it crashes?

If so, try a test that starts with an initial allocation just larger than the failing size. If this run the first iteration but the next one/few fail, then it is likely an allocation issue.

Edit: The issue may not be the quantity of RAM, but the fragmentation of the RAM due to sequence of allocations and deallocations (plus heap manager). The allocation failure may be occurring inside matmul (or in the statement) when it tries to obtain an array temporary (and you have no means to add a STATUS=xxx to check allocation failure).

Jim Dempsey

jimdempseyatthecove · ‎03-20-2017

Note, if your problem is found to be caused by memory fragmentation, then rearranging your explicit allocation might improve matters. However, if you get stuck, you might try locating a Linux "Low Fragmentation" Heap manager. I found http://jemalloc.net/ but I haven't used this and cannot attest to if this works with Intel Visual Fortran. If fragmentation is your problem, and if you find an alternate heap manager that resolves (postpones further) the issue, then please report back your findings.

Jim Dempsey

may_ka · ‎03-20-2017

Hi, thanks for the response. As I wrote above, the code was just for examplification about the nested structure, I have not tested whether it will trigger the seg fault. Moreover, when I run the real program using only a single sigma type, everything goes alright although it may need 5 x more ram. In addition, I have a working omp version which differ in its structure only that sigma is not an extra class. The structure of sigma is incorrporated into struct having a 2d or 3d matrix alternatively allocated at start. Also, I made the matmul call in the example for simplicity. In "reality" that is an mkl-blas call. However, both approaches, matmul and mkl fail, but only for a particular class (say Sigma_Type_B), but not if class sigma in a and b is allocated to "Sigma_Type_B". However, note that the example code is for visualisation of the "class-soaked" nested structure. It may not result in a seg fault when run.

Thanks.

may_ka · ‎03-20-2017

Why should there be an allocation with the matmul call??? Although not given in the example code, all arrays are allocated to proper and fitting size at the program start. Implicit allocations are avoided. Moreover, I don't think that mkl-blas calls allow for automatic re-allocation.

Thanks

jimdempseyatthecove · ‎03-20-2017

Why should there be an allocation with the matmul call???

this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))

The ":" on the right most index of this%PostMult(i,:) and this%PreMult(i,:) specify non-contiguous array sections. The matmul (or user called subroutine/function) typically require contiguous array sections. Therefor the compiler will auto-collect the input(s) and auto-scatter the output(s) via use of an array temporary. The gather/scatter may be omitted depending on the interface or lack thereof, with regard to INTENT(IN) INTENT(OUT) and/or INTENT(INOUT), or lack of INTENT specification.

If you can re-arrange your array indices to permit

this%PostMult(:,i)=matmul(this%PreMult(:,i),this%Mat(:,:,i)) ! change index order on PostMult, PreMult

Then this may eliminate the array temporary (as well as gather/scatter).

Jim Dempsey

may_ka · ‎03-21-2017

Hi, beside the correct point raised by Jim, but which may not cause the problem, here is a working example.

Module Data_Kind
  Implicit None
  Integer, Parameter :: IkXL=Selected_Int_Kind(12)
  Integer, Parameter :: IkL=Selected_Int_Kind(8)
  Integer, Parameter :: IkM=Selected_Int_Kind(4)
  Integer, Parameter :: IkS=Selected_Int_Kind(2)
  Integer(Ikl), Parameter :: RkDbl=Selected_Real_Kind(15,100)
  Integer(Ikl), Parameter :: RkSgl=Selected_Real_Kind(6,37)
  Real(rkdbl), Parameter :: RSZero=10e-12
End Module Data_Kind
Module Mod_Root
  use Data_Kind
  Implicit none
  Type :: root
  End type root
End Module Mod_Root
Module Mod_Sigma
  use Data_Kind
  use Mod_Root, only: root
  Implicit None
  Type, abstract, extends(root) :: Sigma
    Real(rkdbl), Pointer, contiguous :: MPreMult(:,:), MPostMult(:,:)
    Real(rkdbl), Pointer, contiguous :: VPreMult(:), VPostMult(:)
  contains
    Procedure(SubMult), PAss, Public, Deferred :: Mult
    Procedure(SubInit), PAss, Public, Deferred :: Init
  end type Sigma
  Abstract Interface
    Subroutine SubMult(this)
      Import Sigma
      Class(Sigma), Intent(In) :: this
    End Subroutine SubMult
    Subroutine SubInit(this,dim1,dim2)
      use data_kind
      Import Sigma
      Class(Sigma), Intent(InOut) :: this
      Integer(Ikxl), Intent(in) :: dim1
      Integer(Ikxl), Intent(in), optional :: dim2
    End Subroutine SubInit
  End Interface
  Private :: SubMult, SubInit
End Module Mod_Sigma
Module Mod_Sigma_Type_Special
  use Data_Kind
  use Mod_Sigma, only: Sigma
  Type, extends(Sigma) :: Sigma_Type_Special
    Real(rkdbl), Allocatable :: Mat(:,:), dd(:)
  contains
    Procedure, Pass, Public :: Init=> SubInit
    Procedure, Pass, Public :: Mult=>SubMult
  End type Sigma_Type_Special
  Private :: SubMult, SubInit
contains
  Subroutine SubInit(this,dim1,dim2)
    Class(Sigma_Type_Special), Intent(InOut) :: this
    Integer(Ikxl), Intent(in) :: dim1
    Integer(Ikxl), Intent(in), optional :: dim2
    outer:block
      if(.not.present(dim2)) tHen
        write(*,*) "error";exit outer
      End if
      if(dim1<=0.or.dim2<=0) Then
        write(*,*) "error";exit outer
      End if
      write(*,"(*(g0:"",""))") "Special dim1",dim1,"dim2",dim2
      Allocate(this%Mat(dim1,dim1),this%dd(dim2))
    end block outer
  end Subroutine SubInit
  Subroutine SubMult(this)
    !$ use omp_lib
    Implicit None
    Class(Sigma_Type_Special), Intent(In) :: this
    Integer(Ikxl) :: i
    this%MPostMult=matmul(this%MPreMult,this%Mat)
    !$ call omp_set_num_threads(size(this%MPostMult,2))
    !$OMP PARALLEL DO PRIVATE(i)
    Do i=1,size(this%MPostMult,2)
      this%MPostMult(:,i)=this%MPostMult(:,i)*this%dd
    End Do
    !$OMP END PARALLEL DO
  End Subroutine SubMult
End Module Mod_Sigma_Type_Special
Module Mod_Sigma_S
  use Data_Kind
  use Mod_Sigma, only: Sigma
  Type, abstract, extends(Sigma) :: Sigma_S
  End type Sigma_S
End Module Mod_Sigma_S
Module Mod_Sigma_Type_3D
  use Data_Kind
  use Mod_Sigma_S, only: Sigma_S
  Type, extends(Sigma_S) :: Sigma_Type_3D
    Real(rkdbl), Allocatable :: Mat(:,:,:)
  contains
    Procedure, Pass, Public :: Init=> SubInit
    Procedure, Pass, Public :: Mult=>SubMult
  End type Sigma_Type_3D
  Private :: SubMult, SubInit
contains
  Subroutine SubInit(this,dim1,dim2)
    Class(Sigma_Type_3D), Intent(InOut) :: this
    Integer(Ikxl), Intent(in) :: dim1
    Integer(Ikxl), Intent(in), optional :: dim2
    outer:block
      if(.not.present(dim2)) tHen
        write(*,*) "error";exit outer
      End if
      if(dim1<=0.or.dim2<=0) Then
        write(*,*) "error";exit outer
      End if
      write(*,"(*(g0:"",""))") "3D dim1",dim1,"dim2",dim2
      Allocate(this%Mat(dim1,dim1,dim2))
    end block outer
  end Subroutine SubInit
  Subroutine SubMult(this)
    Implicit None
    Class(Sigma_Type_3D), Intent(In) :: this
    Integer(Ikxl) :: i
    !$ call omp_set_num_threads(size(this%MPostMult,2))
    !$OMP PARALLEL DO PRIVATE(i)
    Do i=1,size(this%Mat,3)
      this%MPostMult(i,:)=matmul(this%MPreMult(i,:),this%Mat(:,:,i))
    End Do
    !$OMP END PARALLEL DO
  End Subroutine SubMult
End Module Mod_Sigma_Type_3D
Module Mod_Sigma_Type_1D
  use Data_Kind
  use Mod_Sigma_S, only: Sigma_S
  Type, extends(Sigma_S) :: Sigma_Type_1D
    Real(rkdbl), Allocatable :: vec(:)
  contains
    Procedure, Pass, Public :: Init=> SubInit
    Procedure, Pass, Public :: Mult=>SubMult
  End type Sigma_Type_1D
  Private :: SubMult, SubInit
contains
  Subroutine SubInit(this,dim1,dim2)
    Class(Sigma_Type_1D), Intent(InOut) :: this
    Integer(Ikxl), Intent(in) :: dim1
    Integer(Ikxl), Intent(in), optional :: dim2
    outer:block
      if(dim1==0) Then
        write(*,*) "error";exit outer
      End if
      Allocate(this%vec(dim1))
    end block outer
  end Subroutine SubInit
  Subroutine SubMult(this)
    !$ use omp_lib
    Implicit None
    Class(Sigma_Type_1D), Intent(In) :: this
    Integer(Ikxl) :: i
    if(associated(this%MPostMult)) tHen
      !$ call omp_set_num_threads(size(this%MPostMult,2))
      !$OMP PARALLEL DO PRIVATE(i)
      Do i=1,size(this%MPreMult,2)
        this%MPostMult(:,i)=this%MPreMult(:,i)*this%vec
      End Do
      !$OMP END PARALLEL DO
    Elseif(associated(this%VPostMult)) Then
      !$ call omp_set_num_threads(40)
      !$OMP PARALLEL Do
      Do i=1,size(this%VPostMult)
         this%VPostMult(i)=this%VPreMult(i)*this%vec(i)
      End Do
      !$OMP END PARALLEL DO
    End if
  End Subroutine SubMult
End Module Mod_Sigma_Type_1D
Module Mod_Struct
  use Data_Kind
  use Mod_Root, only: root
  use Mod_Sigma, only: sigma
  Type,extends(root), abstract :: Struct
    Class(Sigma), Allocatable :: Sigma
  Contains
    Procedure(SubInit), Public, Pass, Deferred :: Init
    Procedure(SubMult), Public, PAss, Deferred :: Mult
  End type Struct
  Type :: StructPt
    CLass(Struct), Pointer :: pt
  end type StructPt
  Abstract interface
    Subroutine SubMult(this)
      Import Struct
      Class(Struct), Intent(InOut), Target :: this
    end Subroutine SubMult
    Subroutine SubInit(this,dim1,dim2,what)
      use data_kind
      Import Struct
      Class(Struct), Intent(InOut) :: this
      Integer(Ikxl), Intent(In) :: dim1
      Integer(Ikxl), Intent(In), optional :: dim2
      Character(len=*), Intent(In), optional :: what
    end Subroutine SubInit
  End interface
End Module Mod_Struct
Module Mod_Struct_A
  use Data_Kind
  use Mod_Struct, only: Struct
  use Mod_Sigma_Type_3D, only: Sigma_Type_3D
  use Mod_Sigma_Type_Special, only: Sigma_Type_Special
  Type, extends(Struct) :: Struct_Type_A
    Real(rkdbl), Allocatable :: Mat1(:,:), Mat2(:,:)
    Integer(ikxl) :: dim1, dim2
  Contains
    Procedure, Pass, Public :: Mult => SubMultSigma
    Procedure, Pass, Public :: Init => SubInit
  End type Struct_Type_A
  Private :: SubMultSigma, SubInit
contains
  Subroutine SubInit(this,dim1,dim2,What)
    Class(Struct_Type_A), Intent(InOut) :: this
    Integer(Ikxl), Intent(In) :: dim1
    Integer(Ikxl), Intent(In), optional :: dim2
    Character(len=*), Intent(In), optional :: What
    outer:block
      if(.not.present(dim2)) Then
        write(*,*) "error"; exit outer
      end if
      this%dim1=dim1;this%dim2=dim2
      Allocate(this%Mat1(0:dim1,dim2),this%Mat2(0:dim1,dim2))
      if(present(what)) Then
        Select Case(trim(adjustL(what)))
        Case("3D")
          Allocate(Sigma_Type_3D::this%sigma)
          call this%sigma%init(dim1=dim2,dim2=dim1)
        Case("SPECIAL")
          Allocate(Sigma_Type_Special::this%sigma)
          call this%sigma%init(dim1=dim2,dim2=dim1)
        End Select
      Else
        Allocate(Sigma_Type_3D::this%sigma)
        call this%sigma%init(dim1=dim2,dim2=dim1)
      End if
    End block outer
  End Subroutine SubInit
  Subroutine SubMultSigma(this)
    Implicit None
    Class(Struct_Type_A), Intent(InOut), Target :: this
    this%Sigma%MPreMult=>this%Mat1(1:this%dim1,:)
    this%Sigma%MPostMult=>this%Mat2(1:this%dim1,:)
    call this%Sigma%Mult()
  End Subroutine SubMultSigma
End Module Mod_Struct_A
Module Mod_Struct_B
  use Data_Kind
  use Mod_Sigma_Type_1D, only: Sigma_Type_1D
  use Mod_Struct, only: Struct
  Type, extends(Struct) :: Struct_Type_B
    Real(rkdbl), Allocatable :: Mat1(:), Mat2(:)
    Integer(ikxl) :: dim1
  Contains
    Procedure, Pass, Public :: Mult => SubMultSigma
    Procedure, Pass, Public :: Init => SubInit
  End type Struct_Type_B
  Private :: SubMultSigma, SubInit
contains
  Subroutine SubInit(this,dim1,dim2,What)
    Class(Struct_Type_B), Intent(InOut) :: this
    Integer(Ikxl), Intent(In) :: dim1
    Integer(Ikxl), Intent(In), optional :: dim2
    Character(len=*), Intent(In), optional :: what
    outer:block
      this%dim1=dim1
      Allocate(this%Mat1(0:dim1),this%Mat2(0:dim1))
      Allocate(Sigma_Type_1D::this%sigma)
      call this%sigma%init(dim1=dim1)
    End block outer
  End Subroutine SubInit
  Subroutine SubMultSigma(this)
    Implicit None
    Class(Struct_Type_B), Intent(InOut), Target :: this
    this%Sigma%VPreMult=>this%Mat1(1:this%dim1)
    this%Sigma%VPostMult=>this%Mat2(1:this%dim1)
    call this%Sigma%Mult()
  End Subroutine SubMultSigma
End Module Mod_Struct_B
Program Test
  !$ use omp_lib
  use Data_Kind
  use Mod_Struct
  use Mod_Struct_A
  use Mod_Struct_B
  Type(Struct_Type_B), Target :: fi
  Type(Struct_Type_A), Target :: ge, geG, sxh, pe
  Class(StructPt), Allocatable :: x(:)
  Integer :: i
  Integer(Ikxl), Parameter :: nFi=162469876, nLGG=794, nLSxH=533346, nLPE=1564626
  Integer(Ikxl), Parameter :: nFGen=57, nFGGS=57, nFSxH=49, nFPE=4
  Integer(Ikxl), Parameter :: nLGen=4343921
  call fi%init(dim1=nFi)
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  !!@@@@ lower ram demand but seg-fault
  call ge%init(dim1=nLGen,dim2=nFGen,what="SPECIAL")
  !!@@@@ higher ram demand but running
  !call ge%init(dim1=nLGen,dim2=nFGen,what="3D")
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  call geG%init(dim1=nLGG,dim2=nFGen,what="3D")
  call pe%init(dim1=nLPE,dim2=nFPE,what="3D")
  call sxh%init(dim1=nLSxH,dim2=nFSxH,what="3D")
  Allocate(x(5))
  x(1)%pt=>fi
  x(2)%pt=>pe;x(3)%pt=>sxh;x(4)%pt=>geG
  x(5)%pt=>ge
  !$ call omp_set_nested(.TRUE.)
  !$OMP PARALLEL DO PRIVATE(i)
  Do i=1,size(x)
    write(*,*) i
    call x(i)%pt%Mult()
    write(*,*) i
  End Do
  !$OMP END PARALLEL DO
End Program Test

When "ge" is initialized with "what=3D", which is much more RAM demanding, the pogram runs. But when "ge" is initialized with "what=SPECIAL", which should be very RAM economical, the pogram crashes with a seg-fault. Running it in gdb gives:

[New LWP 430243]

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 430243]
0x0000000000576c7a in __intel_avx_rep_memset ()
(gdb) bt
#0  0x0000000000576c7a in __intel_avx_rep_memset ()
#1  0x0000000000402c71 in mod_sigma_type_special_mp_submult_ ()
#2  0x0000000000402b32 in MAIN__ ()
#3  0x000000000046ae23 in __kmp_invoke_microtask ()
#4  0x0000000000423a90 in __kmp_invoke_task_func ()
#5  0x0000000000422d55 in __kmp_launch_thread ()
#6  0x000000000046b211 in _INTERNAL_24_______src_z_Linux_util_c_54df53be::__kmp_launch_worker(void*) ()
#7  0x0000000000587914 in start_thread (arg=0x2aaf2ffff900) at pthread_create.c:312
#8  0x00000000005f81e9 in clone ()
(gdb)

Compiler command was

ifort -warn nounused -warn declarations -O3 -static -warn interface -mkl=parallel -qopenmp tmp.f90

The pogram ran on a machine with 56 "Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz" processors and 512GB RAM.

ifort version is 17.01, stack size is set to unlimited, omp_stacksize to 32MB.

Thanks for any idea??

jimdempseyatthecove · ‎03-21-2017

There are some potential, and glaring issues.

1) you are using nested parallel regions, nothing wrong with this in particular, however your code is not making any considerations for oversubscription issue. This could result in your outer region establishing a thread pool of 56 threads, and each inner region having each thread establish its own thread pool of 56 threads. IOW 56 * 56 threads, or 3,136 threads.

2) Your compile line option is using -mkl=parallel. When a serial program uses MKL you specify -mkl=parallel to get the advantage of parallelization (in your serial program) for your calls into MKL. Unless you take special programming considerations, parallel programs should use -mkl=serial. Not taking such programming considerations could result in each thread calling MKL (via matmul calling MKL) could result in that call for that thread generating an additional 56 thread pool. IOW potentially 56*56*56 threads (175,616). The actual number of thread generated may be affected by additional default parameters and/or environment variables

Suggestions:

Change your ifort option to -mkl=serial, and balance the number of threads using the outer loop with the number of threads spawned for each thread in the inner loop.

Examples

2 threads on outer loop, 28 threads on inner loop (2 * 28 = 56)
4 threads on outer loop, 14 threads on inner loop ( 4 * 14 = 56)

Once you do this, and if it runs successfully, then you might experiment with some additional tuning.

MKL_NUM_THREADS=4
-mkl=parallel
2 threads outer region, 7 threads inner region (2*7*4 = 56)

Then you can experiment with slight oversubscription (at one of the layers at a time).

Jim Dempsey

may_ka · ‎03-21-2017

Hi Jim.

I am(was) aware of the oversubscription problem. Solving it seems only trivial for the example because in reality all these hard coded numbers are variables. Thus, not only the length of the outer loop is defined by the input data (the number of objects), also dimensions of all arrays in each direction are unknown at compile time. Moreover, in reality there might be more than two levels of loops. I made versions using mkl=serial for the original program, and as far as I remember speed was reduced. I'll post numbers about that soon.

I am still struggleing to understand why the segfault occurs when changing object classes. The oversubscription problem is virulent invariable of the object class. Moreover, when reducing the first dimension of some of the Mat1 and Mat2 arrays, which does NOT change number of theards called because they are derived from the second dimension, the pogram runs. Finally, when incorporating the "sigma" structure into "struct", abandoning the "sigma" class (which is just a helper to make stuff flexible and clear), and putting the multiplications in some "if allocated" statements, the program runs as well, of coures with the same level of oversubscription.

So how does this all fit together??? I was more certain about an omp bug (thats why I called for intel help in the thread title but no one bothered ...... hm?)!!

Thanks.

jimdempseyatthecove · ‎03-21-2017

Can you get the segfault to occur while debugging *** with the debugger set to trap on segfault. Then see if you can examine the disassembly code to determine what address is being reference, by which thread, then walk up the call stack. At some point you will (may) reach a Fortran source statement (or some meaningfull address in a support library). This information, in the right hands, may yield a solution (coding error) or strong indication of a bug inside the Fortran runtime, C runtime, and/or library code.

Your original problem (#1) had segfault in __kmp_hyper_barrier_release
This latest problem occurs in __intel_avx_rep_memset

Something has changed??

Also, you are compiling (linking) with -static. The Intel OpenMP library is distributed only as a shared library. What you need to check is to as if the linker is linking the Intel shared OpenMP library... or if the linker is linking in gcc's static OpenMP library. Mixing vendors threading libraries may be problematic.

Jim Dempsey

Kevin_D_Intel · ‎03-21-2017

I have been trying your test case in post #7 but with limited progress. My systems lack static libs necessary to link with –static and while I can link w/o –static and run as is with a single thread, it does not run when initialized with “ge” as you indicated, so I did not feel I was even close to replicating the issue. I am inquiring w/others for assistance.

may_ka · ‎03-21-2017

Hi Jim,

the example code is just an example, the original debugging output is from the original, not posted example. However, the original program crashes at a structural component 100% similar to that in the example program, and the remedy for the orginal program is the same as for the example: use the same object class or incorporate the object characteristics and abandon that class. However, I'll follow up your advice on the linker and try to get my head around your debugging option although this is uncharted territory for me at the moment.

Thanks

may_ka · ‎03-21-2017

Hi Kevin,

thanks for the post. This gives me a bit of hope because people at intel looking into it as well. I don't know whether I will be successfull following Jim's proposal because of personal incapacity. As I mentioned in one of my earlier comments I could you supply an executable (or with several thousand lines of source code and a makefile) and a data set which triggers the fault. Let me know.

Thanks

Kevin_D_Intel · ‎03-21-2017

A little more progress. For the test case in post #7, using either initialization method (3D or SPECIAL), our internal development compiler flags a condition about a pointer with CONTIGUOUS attribute being made to a non-contiguous target.

If I remove the CONTIGUOUS attribute (lines 22 & 23) the test case then runs with 17.0.1.

I don't have much familiarity with this. Is that a helpful information?

may_ka · ‎03-21-2017

Hi kevin,

thanks for the info. I had a thought in that direction already. The problem is that the matrices mat1 and mat2 have their first dimension from 0 to N where the pointer is associated from 1 to N. This should be contiguous with regard to the array because it is a contiguous section, but not contiguous with regard how the array is stored. From my understanding the contiguous attribute relates to the array and to its storage, but I am not sure about it. Maybe you or Jim can clarify.

Thanks

may_ka · ‎03-21-2017

Kevin,

I removed the contiguous attribute but on my server it's still crashing. It is also crashing when removing "-static". Intrerestingly then I get this output:

*** longjmp causes uninitialized stack frame ***: ./a.out terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x2b59ed3f729f]
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x2b59ed49242c]
/lib/x86_64-linux-gnu/libc.so.6(+0x10e33d)[0x2b59ed49233d]
/lib/x86_64-linux-gnu/libc.so.6(__longjmp_chk+0x29)[0x2b59ed492299]
./a.out[0x488bd2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x2b59ed176330]
/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0xb5241)[0x2b59ecb72241]
/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x5595c)[0x2b59ecb1295c]
/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x571d8)[0x2b59ecb141d8]
/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x80110)[0x2b59ecb3d110]
/opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0xb1193)[0x2b59ecb6e193]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x2b59ed16e184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b59ed481bed]

Maybe that is of any help.

may_ka · ‎03-21-2017

Jim,

when using "-mkl=sequential" the crash is still virulent. With "-static" removed the crash report is:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
a.out              0000000000489D61  Unknown               Unknown  Unknown
a.out              0000000000487E9B  Unknown               Unknown  Unknown
a.out              0000000000450CE4  Unknown               Unknown  Unknown
a.out              0000000000450AF6  Unknown               Unknown  Unknown
a.out              00000000004320A9  Unknown               Unknown  Unknown
a.out              0000000000407416  Unknown               Unknown  Unknown
libpthread-2.19.s  00002B0E0BD2F330  Unknown               Unknown  Unknown
a.out              0000000000494E3A  Unknown               Unknown  Unknown
a.out              0000000000403E64  Unknown               Unknown  Unknown
a.out              0000000000403D22  Unknown               Unknown  Unknown
libiomp5.so        00002B0E0BA2CD13  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        00002B0E0B9FCB17  Unknown               Unknown  Unknown
libiomp5.so        00002B0E0B9FC1C5  Unknown               Unknown  Unknown
libiomp5.so        00002B0E0BA2D193  Unknown               Unknown  Unknown
libpthread-2.19.s  00002B0E0BD27184  Unknown               Unknown  Unknown
libc-2.19.so       00002B0E0C03ABED  clone                 Unknown  Unknown

Moreover, the crash remains even with all omp flags removed except the one around the outer loop in the main program. This should rule out the oversubscription problem as a possible cause. The only remedy is still to change the object class for sigma in object "ge". I can imagine that it more and more boils down to a library bug somewhere.

Thanks for any comment.

Thanks

may_ka · ‎03-21-2017

Hi Jim and Kevin,

I changed to "-mkl=sequential" and removed "-static" with no luck. Still crashing. Interestingly it also crashes when removing all omp flags except those in the main program. This also rules out the oversubscription as a possible cause. The seg fault report was:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
a.out              0000000000489811  Unknown               Unknown  Unknown
a.out              000000000048794B  Unknown               Unknown  Unknown
a.out              0000000000450794  Unknown               Unknown  Unknown
a.out              00000000004505A6  Unknown               Unknown  Unknown
a.out              0000000000431B59  Unknown               Unknown  Unknown
a.out              0000000000406EC6  Unknown               Unknown  Unknown
libpthread-2.19.s  00002AFD84EB3330  Unknown               Unknown  Unknown
a.out              00000000004948FA  Unknown               Unknown  Unknown
a.out              0000000000403D47  Unknown               Unknown  Unknown
a.out              0000000000403C22  Unknown               Unknown  Unknown
libiomp5.so        00002AFD84BB0D13  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        00002AFD84B80B17  Unknown               Unknown  Unknown
libiomp5.so        00002AFD84B801C5  Unknown               Unknown  Unknown
libiomp5.so        00002AFD84BB1193  Unknown               Unknown  Unknown
libpthread-2.19.s  00002AFD84EAB184  Unknown               Unknown  Unknown
libc-2.19.so       00002AFD851BEBED  clone                 Unknown  Unknown

compiler command was:

ifort -warn nounused -warn declarations -O3 -warn interface -mkl=sequential -qopenmp tmp.f90

It ran on a machine with 32 "Intel(R) Xeon(R) E5-2630 v3" processors and 256G of RAM (Note that with omp flags removed the program will only have 5 threads). Still the only remedy is to change class.

Thanks for any comment.

jimdempseyatthecove · ‎03-22-2017

In you code example you have:

this%Sigma%PreMult=>this%Mat1
this%Sigma%PostMult=>this%Mat2

Where Mat1 and Mat2 are allocatables, but nowhere have they been allocated. Your actual code may have performed the allocations.

A pointer can have a stride other than 1. While removing contiguous would resolve a misunderstanding with the Fortran code, it may present an issue with MKL Which requires contiguous array sections.

You could quickly test this out:

  Subroutine SubMult(this)
    Implicit None
    Class(Sigma_Type_A), Intent(In) :: this
    Integer :: i
    real, allocatable :: tempPostMult(:), tempPpremult(:)
    if(size(this%PostMult,1) /= size(this%Mat,3)) stop
    if(size(this%PostMult,1) /= size(this%PreMult,1)) stop
    if(size(this%PostMult,2) /= size(this%PreMult,2)) stop
   Do i=1,size(this%Mat,3)
      tempPremult = this%PreMult(i,:)
      tempPostMult = matmul(tempPremult,this%Mat(:,:,i))
      this%PostMult(i,:)=tempPostMult 
    End Do
  End Subroutine SubMult

And do the same thing in the parallel regions using matmul (remembering to make private the temp arrays).

The intention of the code is to help identify (eliminate) potential conflicts, and not as a recommended solution.

Jim Dempsey

may_ka · ‎03-22-2017

Hi Jim,

Mat1 and Mat2 are allocated in line 223 and 267 of the working example. With regard to the contiguous attribute which is in the working example I wondered in #15 about it and I am still not sure whethe I am using it correctly. As written in #15, Mat1 and Mat2 have their first dimension starting at zero and ending at N, whereas the pointer is associated from 1 to N. From my understanding that should be ok for a pointer with the contiguous attribute.

Thanks

may_ka · ‎03-22-2017

Jim, your example is with regard to the 3D matrix, which is perfectly fine. However, the actual solution to that problem is to reshape the 3D Matrix into a 1D vector. The multiplication with a 2D Matrix is then done by iterating over a product loop using columns of the 2D Matrix and contiguous sections of the vectorized 3D matrix. Since the number of multiplications and additons remains the same, depending on the hardware this gives a speed up between 0 and 20% (if I remember correctly) For the sake of simplicity have not posted that code here (its rather linear algebra than compiler related). If you are interested let me know I can get you a copy.