- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
one of my programs is crashing when runnig a threaded version. When running it inside gdb the output left me helpless:
[New LWP 397493]
Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 397493]
0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()
gdb bt yielded:
#0 0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()
#1 0x0000000001dae38b in __kmp_fork_barrier(int, int) ()
#2 0x0000000001d150c0 in __kmp_launch_thread ()
#3 0x0000000001d5d341 in _INTERNAL_26_______src_z_Linux_util_cpp_47afea4b::__kmp_launch_worker(void*) ()
#4 0x0000000001eb3ff7 in start_thread ()
#5 0x0000000001f2507b in clone ()
To get an idea about parts of the structure of the program a code snippet which mimics what the program is doing is given below. However, this is just for examplification, I have not tested whether the snippet will produce the same segfaut.
Module Mod_Root
Implicit none
Type :: root
End type root
End Module Mod_Root
Module Mod_Sigma
use Mod_Root, only: root
Implicit None
Type, abstract, extends(root) :: Sigma
Real, Pointer, contiguous :: PreMult(:,:), PostMult(:,:)
contains
Procedure(SubMult), PAss, Public, Deferred :: Mult
end type Sigma
Abstract Interface
Subroutine SubMult(this)
Import Sigma
Class(Sigma), Intent(In) :: this
End Subroutine SubMult
End Interface
Private :: SubMult
End Module Mod_Sigma
Module Mod_Sigma_Type_A
use Mod_Sigma, only: Sigma
Type, extends(Sigma) :: Sigma_Type_A
Real, Allocatable :: Mat(:,:,:)
contains
Procedure, Pass, Public :: Mult=>SubMult
End type Sigma_Type_A
Private :: SubMult
contains
Subroutine SubMult(this)
Implicit None
Class(Sigma_Type_A), Intent(In) :: this
Integer :: i
Do i=1,size(this%Mat,3)
this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))
End Do
End Subroutine SubMult
End Module Mod_Sigma_Type_A
Module Mod_Sigma_Type_B
use Mod_Sigma, only: Sigma
Type, extends(Sigma) :: Sigma_Type_B
Real, Allocatable :: Mat(:,:)
contains
Procedure, Pass, Public :: Mult=>SubMult
End type Sigma_Type_B
Private :: SubMult
contains
Subroutine SubMult(this)
Implicit None
Class(Sigma_Type_B), Intent(In) :: this
this%PostMult=matmul(this%PreMult,this%Mat)
End Subroutine SubMult
End Module Mod_Sigma_Type_B
Module Mod_Struct
use Mod_Root, only: root
use Mod_Sigma, only: sigma
Type,extends(root), abstract :: Struct
Class(Sigma), Allocatable :: Sigma
Contains
Procedure(SubMult), Public, PAss, Deferred :: Mult
End type Struct
Type :: StructPt
CLass(Struct), Pointer :: pt
end type StructPt
Abstract interface
Subroutine SubMult(this)
Import Struct
Class(Struct), Intent(InOut), Target :: this
end Subroutine SubMult
End interface
End Module Mod_Struct
Module Mod_Struct_A
use Mod_Struct
Type, extends(Struct) :: Struct_Type_A
Real, Allocatable :: Mat1(:,:), Mat2(:,:)
Contains
Procedure, Pass, Public :: Mult => SubMultSigma
End type Struct_Type_A
Private :: SubMultSigma
contains
Subroutine SubMultSigma(this)
Implicit None
Class(Struct_Type_A), Intent(InOut), Target :: this
this%Sigma%PreMult=>this%Mat1
this%Sigma%PostMult=>this%Mat2
call this%Sigma%Mult()
End Subroutine SubMultSigma
End Module Mod_Struct_A
Program Test
use Mod_Struct
use Mod_Struct_A
use Mod_Sigma_Type_A
use Mod_Sigma_Type_B
Type(Struct_Type_A), Target :: a, b
Class(StructPt), Allocatable :: x(:)
Integer :: i
allocate(Sigma_Type_A::a%sigma)
allocate(Sigma_Type_B::b%sigma)
Allocate(x(2))
x(1)%pt=>a;x(2)%pt=>b
!$OMP PARALLEL DO PRIVATE(i)
Do i=1,2
call x(i)%pt%Mult()
End Do
!$OMP END PARALLEL DO
End Program Test
The segfault in my progrram occurs in a location similar to when calling x(i)%pt%Mult, but only if b%sigma has been allocated as type "Sigma_Type_B". If both, a and b, has been allocated as type "Sigma_Type_A", the program runs fine invaribaly of the size of the relevant arrays. Moreover, threaded or unthreaded the pogram always runs when the involved arrays are small. However, when arrays occupy up to 200GB of RAM and different type allocations are used, it crashes.
ifort version is 17.01, linux version is centos 7 kerner 3.10, stack size is set to unlimited, omp_stacksize to 32MB.
compiler flags were
-assume byterecl -warn nounused -warn declarations -O0 -static -check all -traceback -warn interface -check noarg_temp_created -mkl=parallel -qopenmp
Neither at compile time nor at run time any errors or warnings occured. The pogram ran on a machine with 56 "Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz" processors and 512GB RAM.
Given the compiler flags I used and running the program inside gdb I am running out of ideas at this point. It would be great if one form Intel could look into this. I could suppliy an executable and a data set which triggers the segfault.
Thanks a lot.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following will definitely create an array temporary:
this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))
I am not sure about this%PostMult=matmul(this%PreMult,this%Mat) creating an array temporary (probably does internally) and not sure of size.
You may be running out of RAM.
Is the failure occurring in a test program where you loop starting with small arrays, increasing the sizes each iteration until it crashes?
If so, try a test that starts with an initial allocation just larger than the failing size. If this run the first iteration but the next one/few fail, then it is likely an allocation issue.
Edit: The issue may not be the quantity of RAM, but the fragmentation of the RAM due to sequence of allocations and deallocations (plus heap manager). The allocation failure may be occurring inside matmul (or in the statement) when it tries to obtain an array temporary (and you have no means to add a STATUS=xxx to check allocation failure).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note, if your problem is found to be caused by memory fragmentation, then rearranging your explicit allocation might improve matters. However, if you get stuck, you might try locating a Linux "Low Fragmentation" Heap manager. I found http://jemalloc.net/ but I haven't used this and cannot attest to if this works with Intel Visual Fortran. If fragmentation is your problem, and if you find an alternate heap manager that resolves (postpones further) the issue, then please report back your findings.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, thanks for the response. As I wrote above, the code was just for examplification about the nested structure, I have not tested whether it will trigger the seg fault. Moreover, when I run the real program using only a single sigma type, everything goes alright although it may need 5 x more ram. In addition, I have a working omp version which differ in its structure only that sigma is not an extra class. The structure of sigma is incorrporated into struct having a 2d or 3d matrix alternatively allocated at start. Also, I made the matmul call in the example for simplicity. In "reality" that is an mkl-blas call. However, both approaches, matmul and mkl fail, but only for a particular class (say Sigma_Type_B), but not if class sigma in a and b is allocated to "Sigma_Type_B". However, note that the example code is for visualisation of the "class-soaked" nested structure. It may not result in a seg fault when run.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why should there be an allocation with the matmul call??? Although not given in the example code, all arrays are allocated to proper and fitting size at the program start. Implicit allocations are avoided. Moreover, I don't think that mkl-blas calls allow for automatic re-allocation.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why should there be an allocation with the matmul call???
this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))
The ":" on the right most index of this%PostMult(i,:) and this%PreMult(i,:) specify non-contiguous array sections. The matmul (or user called subroutine/function) typically require contiguous array sections. Therefor the compiler will auto-collect the input(s) and auto-scatter the output(s) via use of an array temporary. The gather/scatter may be omitted depending on the interface or lack thereof, with regard to INTENT(IN) INTENT(OUT) and/or INTENT(INOUT), or lack of INTENT specification.
If you can re-arrange your array indices to permit
this%PostMult(:,i)=matmul(this%PreMult(:,i),this%Mat(:,:,i)) ! change index order on PostMult, PreMult
Then this may eliminate the array temporary (as well as gather/scatter).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, beside the correct point raised by Jim, but which may not cause the problem, here is a working example.
Module Data_Kind
Implicit None
Integer, Parameter :: IkXL=Selected_Int_Kind(12)
Integer, Parameter :: IkL=Selected_Int_Kind(8)
Integer, Parameter :: IkM=Selected_Int_Kind(4)
Integer, Parameter :: IkS=Selected_Int_Kind(2)
Integer(Ikl), Parameter :: RkDbl=Selected_Real_Kind(15,100)
Integer(Ikl), Parameter :: RkSgl=Selected_Real_Kind(6,37)
Real(rkdbl), Parameter :: RSZero=10e-12
End Module Data_Kind
Module Mod_Root
use Data_Kind
Implicit none
Type :: root
End type root
End Module Mod_Root
Module Mod_Sigma
use Data_Kind
use Mod_Root, only: root
Implicit None
Type, abstract, extends(root) :: Sigma
Real(rkdbl), Pointer, contiguous :: MPreMult(:,:), MPostMult(:,:)
Real(rkdbl), Pointer, contiguous :: VPreMult(:), VPostMult(:)
contains
Procedure(SubMult), PAss, Public, Deferred :: Mult
Procedure(SubInit), PAss, Public, Deferred :: Init
end type Sigma
Abstract Interface
Subroutine SubMult(this)
Import Sigma
Class(Sigma), Intent(In) :: this
End Subroutine SubMult
Subroutine SubInit(this,dim1,dim2)
use data_kind
Import Sigma
Class(Sigma), Intent(InOut) :: this
Integer(Ikxl), Intent(in) :: dim1
Integer(Ikxl), Intent(in), optional :: dim2
End Subroutine SubInit
End Interface
Private :: SubMult, SubInit
End Module Mod_Sigma
Module Mod_Sigma_Type_Special
use Data_Kind
use Mod_Sigma, only: Sigma
Type, extends(Sigma) :: Sigma_Type_Special
Real(rkdbl), Allocatable :: Mat(:,:), dd(:)
contains
Procedure, Pass, Public :: Init=> SubInit
Procedure, Pass, Public :: Mult=>SubMult
End type Sigma_Type_Special
Private :: SubMult, SubInit
contains
Subroutine SubInit(this,dim1,dim2)
Class(Sigma_Type_Special), Intent(InOut) :: this
Integer(Ikxl), Intent(in) :: dim1
Integer(Ikxl), Intent(in), optional :: dim2
outer:block
if(.not.present(dim2)) tHen
write(*,*) "error";exit outer
End if
if(dim1<=0.or.dim2<=0) Then
write(*,*) "error";exit outer
End if
write(*,"(*(g0:"",""))") "Special dim1",dim1,"dim2",dim2
Allocate(this%Mat(dim1,dim1),this%dd(dim2))
end block outer
end Subroutine SubInit
Subroutine SubMult(this)
!$ use omp_lib
Implicit None
Class(Sigma_Type_Special), Intent(In) :: this
Integer(Ikxl) :: i
this%MPostMult=matmul(this%MPreMult,this%Mat)
!$ call omp_set_num_threads(size(this%MPostMult,2))
!$OMP PARALLEL DO PRIVATE(i)
Do i=1,size(this%MPostMult,2)
this%MPostMult(:,i)=this%MPostMult(:,i)*this%dd
End Do
!$OMP END PARALLEL DO
End Subroutine SubMult
End Module Mod_Sigma_Type_Special
Module Mod_Sigma_S
use Data_Kind
use Mod_Sigma, only: Sigma
Type, abstract, extends(Sigma) :: Sigma_S
End type Sigma_S
End Module Mod_Sigma_S
Module Mod_Sigma_Type_3D
use Data_Kind
use Mod_Sigma_S, only: Sigma_S
Type, extends(Sigma_S) :: Sigma_Type_3D
Real(rkdbl), Allocatable :: Mat(:,:,:)
contains
Procedure, Pass, Public :: Init=> SubInit
Procedure, Pass, Public :: Mult=>SubMult
End type Sigma_Type_3D
Private :: SubMult, SubInit
contains
Subroutine SubInit(this,dim1,dim2)
Class(Sigma_Type_3D), Intent(InOut) :: this
Integer(Ikxl), Intent(in) :: dim1
Integer(Ikxl), Intent(in), optional :: dim2
outer:block
if(.not.present(dim2)) tHen
write(*,*) "error";exit outer
End if
if(dim1<=0.or.dim2<=0) Then
write(*,*) "error";exit outer
End if
write(*,"(*(g0:"",""))") "3D dim1",dim1,"dim2",dim2
Allocate(this%Mat(dim1,dim1,dim2))
end block outer
end Subroutine SubInit
Subroutine SubMult(this)
Implicit None
Class(Sigma_Type_3D), Intent(In) :: this
Integer(Ikxl) :: i
!$ call omp_set_num_threads(size(this%MPostMult,2))
!$OMP PARALLEL DO PRIVATE(i)
Do i=1,size(this%Mat,3)
this%MPostMult(i,:)=matmul(this%MPreMult(i,:),this%Mat(:,:,i))
End Do
!$OMP END PARALLEL DO
End Subroutine SubMult
End Module Mod_Sigma_Type_3D
Module Mod_Sigma_Type_1D
use Data_Kind
use Mod_Sigma_S, only: Sigma_S
Type, extends(Sigma_S) :: Sigma_Type_1D
Real(rkdbl), Allocatable :: vec(:)
contains
Procedure, Pass, Public :: Init=> SubInit
Procedure, Pass, Public :: Mult=>SubMult
End type Sigma_Type_1D
Private :: SubMult, SubInit
contains
Subroutine SubInit(this,dim1,dim2)
Class(Sigma_Type_1D), Intent(InOut) :: this
Integer(Ikxl), Intent(in) :: dim1
Integer(Ikxl), Intent(in), optional :: dim2
outer:block
if(dim1==0) Then
write(*,*) "error";exit outer
End if
Allocate(this%vec(dim1))
end block outer
end Subroutine SubInit
Subroutine SubMult(this)
!$ use omp_lib
Implicit None
Class(Sigma_Type_1D), Intent(In) :: this
Integer(Ikxl) :: i
if(associated(this%MPostMult)) tHen
!$ call omp_set_num_threads(size(this%MPostMult,2))
!$OMP PARALLEL DO PRIVATE(i)
Do i=1,size(this%MPreMult,2)
this%MPostMult(:,i)=this%MPreMult(:,i)*this%vec
End Do
!$OMP END PARALLEL DO
Elseif(associated(this%VPostMult)) Then
!$ call omp_set_num_threads(40)
!$OMP PARALLEL Do
Do i=1,size(this%VPostMult)
this%VPostMult(i)=this%VPreMult(i)*this%vec(i)
End Do
!$OMP END PARALLEL DO
End if
End Subroutine SubMult
End Module Mod_Sigma_Type_1D
Module Mod_Struct
use Data_Kind
use Mod_Root, only: root
use Mod_Sigma, only: sigma
Type,extends(root), abstract :: Struct
Class(Sigma), Allocatable :: Sigma
Contains
Procedure(SubInit), Public, Pass, Deferred :: Init
Procedure(SubMult), Public, PAss, Deferred :: Mult
End type Struct
Type :: StructPt
CLass(Struct), Pointer :: pt
end type StructPt
Abstract interface
Subroutine SubMult(this)
Import Struct
Class(Struct), Intent(InOut), Target :: this
end Subroutine SubMult
Subroutine SubInit(this,dim1,dim2,what)
use data_kind
Import Struct
Class(Struct), Intent(InOut) :: this
Integer(Ikxl), Intent(In) :: dim1
Integer(Ikxl), Intent(In), optional :: dim2
Character(len=*), Intent(In), optional :: what
end Subroutine SubInit
End interface
End Module Mod_Struct
Module Mod_Struct_A
use Data_Kind
use Mod_Struct, only: Struct
use Mod_Sigma_Type_3D, only: Sigma_Type_3D
use Mod_Sigma_Type_Special, only: Sigma_Type_Special
Type, extends(Struct) :: Struct_Type_A
Real(rkdbl), Allocatable :: Mat1(:,:), Mat2(:,:)
Integer(ikxl) :: dim1, dim2
Contains
Procedure, Pass, Public :: Mult => SubMultSigma
Procedure, Pass, Public :: Init => SubInit
End type Struct_Type_A
Private :: SubMultSigma, SubInit
contains
Subroutine SubInit(this,dim1,dim2,What)
Class(Struct_Type_A), Intent(InOut) :: this
Integer(Ikxl), Intent(In) :: dim1
Integer(Ikxl), Intent(In), optional :: dim2
Character(len=*), Intent(In), optional :: What
outer:block
if(.not.present(dim2)) Then
write(*,*) "error"; exit outer
end if
this%dim1=dim1;this%dim2=dim2
Allocate(this%Mat1(0:dim1,dim2),this%Mat2(0:dim1,dim2))
if(present(what)) Then
Select Case(trim(adjustL(what)))
Case("3D")
Allocate(Sigma_Type_3D::this%sigma)
call this%sigma%init(dim1=dim2,dim2=dim1)
Case("SPECIAL")
Allocate(Sigma_Type_Special::this%sigma)
call this%sigma%init(dim1=dim2,dim2=dim1)
End Select
Else
Allocate(Sigma_Type_3D::this%sigma)
call this%sigma%init(dim1=dim2,dim2=dim1)
End if
End block outer
End Subroutine SubInit
Subroutine SubMultSigma(this)
Implicit None
Class(Struct_Type_A), Intent(InOut), Target :: this
this%Sigma%MPreMult=>this%Mat1(1:this%dim1,:)
this%Sigma%MPostMult=>this%Mat2(1:this%dim1,:)
call this%Sigma%Mult()
End Subroutine SubMultSigma
End Module Mod_Struct_A
Module Mod_Struct_B
use Data_Kind
use Mod_Sigma_Type_1D, only: Sigma_Type_1D
use Mod_Struct, only: Struct
Type, extends(Struct) :: Struct_Type_B
Real(rkdbl), Allocatable :: Mat1(:), Mat2(:)
Integer(ikxl) :: dim1
Contains
Procedure, Pass, Public :: Mult => SubMultSigma
Procedure, Pass, Public :: Init => SubInit
End type Struct_Type_B
Private :: SubMultSigma, SubInit
contains
Subroutine SubInit(this,dim1,dim2,What)
Class(Struct_Type_B), Intent(InOut) :: this
Integer(Ikxl), Intent(In) :: dim1
Integer(Ikxl), Intent(In), optional :: dim2
Character(len=*), Intent(In), optional :: what
outer:block
this%dim1=dim1
Allocate(this%Mat1(0:dim1),this%Mat2(0:dim1))
Allocate(Sigma_Type_1D::this%sigma)
call this%sigma%init(dim1=dim1)
End block outer
End Subroutine SubInit
Subroutine SubMultSigma(this)
Implicit None
Class(Struct_Type_B), Intent(InOut), Target :: this
this%Sigma%VPreMult=>this%Mat1(1:this%dim1)
this%Sigma%VPostMult=>this%Mat2(1:this%dim1)
call this%Sigma%Mult()
End Subroutine SubMultSigma
End Module Mod_Struct_B
Program Test
!$ use omp_lib
use Data_Kind
use Mod_Struct
use Mod_Struct_A
use Mod_Struct_B
Type(Struct_Type_B), Target :: fi
Type(Struct_Type_A), Target :: ge, geG, sxh, pe
Class(StructPt), Allocatable :: x(:)
Integer :: i
Integer(Ikxl), Parameter :: nFi=162469876, nLGG=794, nLSxH=533346, nLPE=1564626
Integer(Ikxl), Parameter :: nFGen=57, nFGGS=57, nFSxH=49, nFPE=4
Integer(Ikxl), Parameter :: nLGen=4343921
call fi%init(dim1=nFi)
!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
!!@@@@ lower ram demand but seg-fault
call ge%init(dim1=nLGen,dim2=nFGen,what="SPECIAL")
!!@@@@ higher ram demand but running
!call ge%init(dim1=nLGen,dim2=nFGen,what="3D")
!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
call geG%init(dim1=nLGG,dim2=nFGen,what="3D")
call pe%init(dim1=nLPE,dim2=nFPE,what="3D")
call sxh%init(dim1=nLSxH,dim2=nFSxH,what="3D")
Allocate(x(5))
x(1)%pt=>fi
x(2)%pt=>pe;x(3)%pt=>sxh;x(4)%pt=>geG
x(5)%pt=>ge
!$ call omp_set_nested(.TRUE.)
!$OMP PARALLEL DO PRIVATE(i)
Do i=1,size(x)
write(*,*) i
call x(i)%pt%Mult()
write(*,*) i
End Do
!$OMP END PARALLEL DO
End Program Test
When "ge" is initialized with "what=3D", which is much more RAM demanding, the pogram runs. But when "ge" is initialized with "what=SPECIAL", which should be very RAM economical, the pogram crashes with a seg-fault. Running it in gdb gives:
[New LWP 430243] Program received signal SIGSEGV, Segmentation fault. [Switching to LWP 430243] 0x0000000000576c7a in __intel_avx_rep_memset () (gdb) bt #0 0x0000000000576c7a in __intel_avx_rep_memset () #1 0x0000000000402c71 in mod_sigma_type_special_mp_submult_ () #2 0x0000000000402b32 in MAIN__ () #3 0x000000000046ae23 in __kmp_invoke_microtask () #4 0x0000000000423a90 in __kmp_invoke_task_func () #5 0x0000000000422d55 in __kmp_launch_thread () #6 0x000000000046b211 in _INTERNAL_24_______src_z_Linux_util_c_54df53be::__kmp_launch_worker(void*) () #7 0x0000000000587914 in start_thread (arg=0x2aaf2ffff900) at pthread_create.c:312 #8 0x00000000005f81e9 in clone () (gdb)
Compiler command was
ifort -warn nounused -warn declarations -O3 -static -warn interface -mkl=parallel -qopenmp tmp.f90
The pogram ran on a machine with 56 "Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz" processors and 512GB RAM.
ifort version is 17.01, stack size is set to unlimited, omp_stacksize to 32MB.
Thanks for any idea??
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are some potential, and glaring issues.
1) you are using nested parallel regions, nothing wrong with this in particular, however your code is not making any considerations for oversubscription issue. This could result in your outer region establishing a thread pool of 56 threads, and each inner region having each thread establish its own thread pool of 56 threads. IOW 56 * 56 threads, or 3,136 threads.
2) Your compile line option is using -mkl=parallel. When a serial program uses MKL you specify -mkl=parallel to get the advantage of parallelization (in your serial program) for your calls into MKL. Unless you take special programming considerations, parallel programs should use -mkl=serial. Not taking such programming considerations could result in each thread calling MKL (via matmul calling MKL) could result in that call for that thread generating an additional 56 thread pool. IOW potentially 56*56*56 threads (175,616). The actual number of thread generated may be affected by additional default parameters and/or environment variables
Suggestions:
Change your ifort option to -mkl=serial, and balance the number of threads using the outer loop with the number of threads spawned for each thread in the inner loop.
Examples
2 threads on outer loop, 28 threads on inner loop (2 * 28 = 56)
4 threads on outer loop, 14 threads on inner loop ( 4 * 14 = 56)
Once you do this, and if it runs successfully, then you might experiment with some additional tuning.
MKL_NUM_THREADS=4
-mkl=parallel
2 threads outer region, 7 threads inner region (2*7*4 = 56)
Then you can experiment with slight oversubscription (at one of the layers at a time).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim.
I am(was) aware of the oversubscription problem. Solving it seems only trivial for the example because in reality all these hard coded numbers are variables. Thus, not only the length of the outer loop is defined by the input data (the number of objects), also dimensions of all arrays in each direction are unknown at compile time. Moreover, in reality there might be more than two levels of loops. I made versions using mkl=serial for the original program, and as far as I remember speed was reduced. I'll post numbers about that soon.
I am still struggleing to understand why the segfault occurs when changing object classes. The oversubscription problem is virulent invariable of the object class. Moreover, when reducing the first dimension of some of the Mat1 and Mat2 arrays, which does NOT change number of theards called because they are derived from the second dimension, the pogram runs. Finally, when incorporating the "sigma" structure into "struct", abandoning the "sigma" class (which is just a helper to make stuff flexible and clear), and putting the multiplications in some "if allocated" statements, the program runs as well, of coures with the same level of oversubscription.
So how does this all fit together??? I was more certain about an omp bug (thats why I called for intel help in the thread title but no one bothered ...... hm?)!!
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you get the segfault to occur while debugging *** with the debugger set to trap on segfault. Then see if you can examine the disassembly code to determine what address is being reference, by which thread, then walk up the call stack. At some point you will (may) reach a Fortran source statement (or some meaningfull address in a support library). This information, in the right hands, may yield a solution (coding error) or strong indication of a bug inside the Fortran runtime, C runtime, and/or library code.
Your original problem (#1) had segfault in __kmp_hyper_barrier_release
This latest problem occurs in __intel_avx_rep_memset
Something has changed??
Also, you are compiling (linking) with -static. The Intel OpenMP library is distributed only as a shared library. What you need to check is to as if the linker is linking the Intel shared OpenMP library... or if the linker is linking in gcc's static OpenMP library. Mixing vendors threading libraries may be problematic.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been trying your test case in post #7 but with limited progress. My systems lack static libs necessary to link with –static and while I can link w/o –static and run as is with a single thread, it does not run when initialized with “ge” as you indicated, so I did not feel I was even close to replicating the issue. I am inquiring w/others for assistance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
the example code is just an example, the original debugging output is from the original, not posted example. However, the original program crashes at a structural component 100% similar to that in the example program, and the remedy for the orginal program is the same as for the example: use the same object class or incorporate the object characteristics and abandon that class. However, I'll follow up your advice on the linker and try to get my head around your debugging option although this is uncharted territory for me at the moment.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin,
thanks for the post. This gives me a bit of hope because people at intel looking into it as well. I don't know whether I will be successfull following Jim's proposal because of personal incapacity. As I mentioned in one of my earlier comments I could you supply an executable (or with several thousand lines of source code and a makefile) and a data set which triggers the fault. Let me know.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A little more progress. For the test case in post #7, using either initialization method (3D or SPECIAL), our internal development compiler flags a condition about a pointer with CONTIGUOUS attribute being made to a non-contiguous target.
If I remove the CONTIGUOUS attribute (lines 22 & 23) the test case then runs with 17.0.1.
I don't have much familiarity with this. Is that a helpful information?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi kevin,
thanks for the info. I had a thought in that direction already. The problem is that the matrices mat1 and mat2 have their first dimension from 0 to N where the pointer is associated from 1 to N. This should be contiguous with regard to the array because it is a contiguous section, but not contiguous with regard how the array is stored. From my understanding the contiguous attribute relates to the array and to its storage, but I am not sure about it. Maybe you or Jim can clarify.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kevin,
I removed the contiguous attribute but on my server it's still crashing. It is also crashing when removing "-static". Intrerestingly then I get this output:
*** longjmp causes uninitialized stack frame ***: ./a.out terminated ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x2b59ed3f729f] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x2b59ed49242c] /lib/x86_64-linux-gnu/libc.so.6(+0x10e33d)[0x2b59ed49233d] /lib/x86_64-linux-gnu/libc.so.6(__longjmp_chk+0x29)[0x2b59ed492299] ./a.out[0x488bd2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x2b59ed176330] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0xb5241)[0x2b59ecb72241] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x5595c)[0x2b59ecb1295c] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x571d8)[0x2b59ecb141d8] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x80110)[0x2b59ecb3d110] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0xb1193)[0x2b59ecb6e193] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x2b59ed16e184] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b59ed481bed]
Maybe that is of any help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
when using "-mkl=sequential" the crash is still virulent. With "-static" removed the crash report is:
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source a.out 0000000000489D61 Unknown Unknown Unknown a.out 0000000000487E9B Unknown Unknown Unknown a.out 0000000000450CE4 Unknown Unknown Unknown a.out 0000000000450AF6 Unknown Unknown Unknown a.out 00000000004320A9 Unknown Unknown Unknown a.out 0000000000407416 Unknown Unknown Unknown libpthread-2.19.s 00002B0E0BD2F330 Unknown Unknown Unknown a.out 0000000000494E3A Unknown Unknown Unknown a.out 0000000000403E64 Unknown Unknown Unknown a.out 0000000000403D22 Unknown Unknown Unknown libiomp5.so 00002B0E0BA2CD13 __kmp_invoke_micr Unknown Unknown libiomp5.so 00002B0E0B9FCB17 Unknown Unknown Unknown libiomp5.so 00002B0E0B9FC1C5 Unknown Unknown Unknown libiomp5.so 00002B0E0BA2D193 Unknown Unknown Unknown libpthread-2.19.s 00002B0E0BD27184 Unknown Unknown Unknown libc-2.19.so 00002B0E0C03ABED clone Unknown Unknown
Moreover, the crash remains even with all omp flags removed except the one around the outer loop in the main program. This should rule out the oversubscription problem as a possible cause. The only remedy is still to change the object class for sigma in object "ge". I can imagine that it more and more boils down to a library bug somewhere.
Thanks for any comment.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim and Kevin,
I changed to "-mkl=sequential" and removed "-static" with no luck. Still crashing. Interestingly it also crashes when removing all omp flags except those in the main program. This also rules out the oversubscription as a possible cause. The seg fault report was:
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source a.out 0000000000489811 Unknown Unknown Unknown a.out 000000000048794B Unknown Unknown Unknown a.out 0000000000450794 Unknown Unknown Unknown a.out 00000000004505A6 Unknown Unknown Unknown a.out 0000000000431B59 Unknown Unknown Unknown a.out 0000000000406EC6 Unknown Unknown Unknown libpthread-2.19.s 00002AFD84EB3330 Unknown Unknown Unknown a.out 00000000004948FA Unknown Unknown Unknown a.out 0000000000403D47 Unknown Unknown Unknown a.out 0000000000403C22 Unknown Unknown Unknown libiomp5.so 00002AFD84BB0D13 __kmp_invoke_micr Unknown Unknown libiomp5.so 00002AFD84B80B17 Unknown Unknown Unknown libiomp5.so 00002AFD84B801C5 Unknown Unknown Unknown libiomp5.so 00002AFD84BB1193 Unknown Unknown Unknown libpthread-2.19.s 00002AFD84EAB184 Unknown Unknown Unknown libc-2.19.so 00002AFD851BEBED clone Unknown Unknown
compiler command was:
ifort -warn nounused -warn declarations -O3 -warn interface -mkl=sequential -qopenmp tmp.f90
It ran on a machine with 32 "Intel(R) Xeon(R) E5-2630 v3" processors and 256G of RAM (Note that with omp flags removed the program will only have 5 threads). Still the only remedy is to change class.
Thanks for any comment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In you code example you have:
this%Sigma%PreMult=>this%Mat1
this%Sigma%PostMult=>this%Mat2
Where Mat1 and Mat2 are allocatables, but nowhere have they been allocated. Your actual code may have performed the allocations.
A pointer can have a stride other than 1. While removing contiguous would resolve a misunderstanding with the Fortran code, it may present an issue with MKL Which requires contiguous array sections.
You could quickly test this out:
Subroutine SubMult(this) Implicit None Class(Sigma_Type_A), Intent(In) :: this Integer :: i real, allocatable :: tempPostMult(:), tempPpremult(:) if(size(this%PostMult,1) /= size(this%Mat,3)) stop if(size(this%PostMult,1) /= size(this%PreMult,1)) stop if(size(this%PostMult,2) /= size(this%PreMult,2)) stop Do i=1,size(this%Mat,3) tempPremult = this%PreMult(i,:) tempPostMult = matmul(tempPremult,this%Mat(:,:,i)) this%PostMult(i,:)=tempPostMult End Do End Subroutine SubMult
And do the same thing in the parallel regions using matmul (remembering to make private the temp arrays).
The intention of the code is to help identify (eliminate) potential conflicts, and not as a recommended solution.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Mat1 and Mat2 are allocated in line 223 and 267 of the working example. With regard to the contiguous attribute which is in the working example I wondered in #15 about it and I am still not sure whethe I am using it correctly. As written in #15, Mat1 and Mat2 have their first dimension starting at zero and ending at N, whereas the pointer is associated from 1 to N. From my understanding that should be ok for a pointer with the contiguous attribute.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, your example is with regard to the 3D matrix, which is perfectly fine. However, the actual solution to that problem is to reshape the 3D Matrix into a 1D vector. The multiplication with a 2D Matrix is then done by iterating over a product loop using columns of the 2D Matrix and contiguous sections of the vectorized 3D matrix. Since the number of multiplications and additons remains the same, depending on the hardware this gives a speed up between 0 and 20% (if I remember correctly) For the sake of simplicity have not posted that code here (its rather linear algebra than compiler related). If you are interested let me know I can get you a copy.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page