- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
one of my programs is crashing when runnig a threaded version. When running it inside gdb the output left me helpless:
[New LWP 397493]
Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 397493]
0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()
gdb bt yielded:
#0 0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()
#1 0x0000000001dae38b in __kmp_fork_barrier(int, int) ()
#2 0x0000000001d150c0 in __kmp_launch_thread ()
#3 0x0000000001d5d341 in _INTERNAL_26_______src_z_Linux_util_cpp_47afea4b::__kmp_launch_worker(void*) ()
#4 0x0000000001eb3ff7 in start_thread ()
#5 0x0000000001f2507b in clone ()
To get an idea about parts of the structure of the program a code snippet which mimics what the program is doing is given below. However, this is just for examplification, I have not tested whether the snippet will produce the same segfaut.
Module Mod_Root Implicit none Type :: root End type root End Module Mod_Root Module Mod_Sigma use Mod_Root, only: root Implicit None Type, abstract, extends(root) :: Sigma Real, Pointer, contiguous :: PreMult(:,:), PostMult(:,:) contains Procedure(SubMult), PAss, Public, Deferred :: Mult end type Sigma Abstract Interface Subroutine SubMult(this) Import Sigma Class(Sigma), Intent(In) :: this End Subroutine SubMult End Interface Private :: SubMult End Module Mod_Sigma Module Mod_Sigma_Type_A use Mod_Sigma, only: Sigma Type, extends(Sigma) :: Sigma_Type_A Real, Allocatable :: Mat(:,:,:) contains Procedure, Pass, Public :: Mult=>SubMult End type Sigma_Type_A Private :: SubMult contains Subroutine SubMult(this) Implicit None Class(Sigma_Type_A), Intent(In) :: this Integer :: i Do i=1,size(this%Mat,3) this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i)) End Do End Subroutine SubMult End Module Mod_Sigma_Type_A Module Mod_Sigma_Type_B use Mod_Sigma, only: Sigma Type, extends(Sigma) :: Sigma_Type_B Real, Allocatable :: Mat(:,:) contains Procedure, Pass, Public :: Mult=>SubMult End type Sigma_Type_B Private :: SubMult contains Subroutine SubMult(this) Implicit None Class(Sigma_Type_B), Intent(In) :: this this%PostMult=matmul(this%PreMult,this%Mat) End Subroutine SubMult End Module Mod_Sigma_Type_B Module Mod_Struct use Mod_Root, only: root use Mod_Sigma, only: sigma Type,extends(root), abstract :: Struct Class(Sigma), Allocatable :: Sigma Contains Procedure(SubMult), Public, PAss, Deferred :: Mult End type Struct Type :: StructPt CLass(Struct), Pointer :: pt end type StructPt Abstract interface Subroutine SubMult(this) Import Struct Class(Struct), Intent(InOut), Target :: this end Subroutine SubMult End interface End Module Mod_Struct Module Mod_Struct_A use Mod_Struct Type, extends(Struct) :: Struct_Type_A Real, Allocatable :: Mat1(:,:), Mat2(:,:) Contains Procedure, Pass, Public :: Mult => SubMultSigma End type Struct_Type_A Private :: SubMultSigma contains Subroutine SubMultSigma(this) Implicit None Class(Struct_Type_A), Intent(InOut), Target :: this this%Sigma%PreMult=>this%Mat1 this%Sigma%PostMult=>this%Mat2 call this%Sigma%Mult() End Subroutine SubMultSigma End Module Mod_Struct_A Program Test use Mod_Struct use Mod_Struct_A use Mod_Sigma_Type_A use Mod_Sigma_Type_B Type(Struct_Type_A), Target :: a, b Class(StructPt), Allocatable :: x(:) Integer :: i allocate(Sigma_Type_A::a%sigma) allocate(Sigma_Type_B::b%sigma) Allocate(x(2)) x(1)%pt=>a;x(2)%pt=>b !$OMP PARALLEL DO PRIVATE(i) Do i=1,2 call x(i)%pt%Mult() End Do !$OMP END PARALLEL DO End Program Test
The segfault in my progrram occurs in a location similar to when calling x(i)%pt%Mult, but only if b%sigma has been allocated as type "Sigma_Type_B". If both, a and b, has been allocated as type "Sigma_Type_A", the program runs fine invaribaly of the size of the relevant arrays. Moreover, threaded or unthreaded the pogram always runs when the involved arrays are small. However, when arrays occupy up to 200GB of RAM and different type allocations are used, it crashes.
ifort version is 17.01, linux version is centos 7 kerner 3.10, stack size is set to unlimited, omp_stacksize to 32MB.
compiler flags were
-assume byterecl -warn nounused -warn declarations -O0 -static -check all -traceback -warn interface -check noarg_temp_created -mkl=parallel -qopenmp
Neither at compile time nor at run time any errors or warnings occured. The pogram ran on a machine with 56 "Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz" processors and 512GB RAM.
Given the compiler flags I used and running the program inside gdb I am running out of ideas at this point. It would be great if one form Intel could look into this. I could suppliy an executable and a data set which triggers the segfault.
Thanks a lot.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following will definitely create an array temporary:
this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))
I am not sure about this%PostMult=matmul(this%PreMult,this%Mat) creating an array temporary (probably does internally) and not sure of size.
You may be running out of RAM.
Is the failure occurring in a test program where you loop starting with small arrays, increasing the sizes each iteration until it crashes?
If so, try a test that starts with an initial allocation just larger than the failing size. If this run the first iteration but the next one/few fail, then it is likely an allocation issue.
Edit: The issue may not be the quantity of RAM, but the fragmentation of the RAM due to sequence of allocations and deallocations (plus heap manager). The allocation failure may be occurring inside matmul (or in the statement) when it tries to obtain an array temporary (and you have no means to add a STATUS=xxx to check allocation failure).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note, if your problem is found to be caused by memory fragmentation, then rearranging your explicit allocation might improve matters. However, if you get stuck, you might try locating a Linux "Low Fragmentation" Heap manager. I found http://jemalloc.net/ but I haven't used this and cannot attest to if this works with Intel Visual Fortran. If fragmentation is your problem, and if you find an alternate heap manager that resolves (postpones further) the issue, then please report back your findings.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, thanks for the response. As I wrote above, the code was just for examplification about the nested structure, I have not tested whether it will trigger the seg fault. Moreover, when I run the real program using only a single sigma type, everything goes alright although it may need 5 x more ram. In addition, I have a working omp version which differ in its structure only that sigma is not an extra class. The structure of sigma is incorrporated into struct having a 2d or 3d matrix alternatively allocated at start. Also, I made the matmul call in the example for simplicity. In "reality" that is an mkl-blas call. However, both approaches, matmul and mkl fail, but only for a particular class (say Sigma_Type_B), but not if class sigma in a and b is allocated to "Sigma_Type_B". However, note that the example code is for visualisation of the "class-soaked" nested structure. It may not result in a seg fault when run.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why should there be an allocation with the matmul call??? Although not given in the example code, all arrays are allocated to proper and fitting size at the program start. Implicit allocations are avoided. Moreover, I don't think that mkl-blas calls allow for automatic re-allocation.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why should there be an allocation with the matmul call???
this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i))
The ":" on the right most index of this%PostMult(i,:) and this%PreMult(i,:) specify non-contiguous array sections. The matmul (or user called subroutine/function) typically require contiguous array sections. Therefor the compiler will auto-collect the input(s) and auto-scatter the output(s) via use of an array temporary. The gather/scatter may be omitted depending on the interface or lack thereof, with regard to INTENT(IN) INTENT(OUT) and/or INTENT(INOUT), or lack of INTENT specification.
If you can re-arrange your array indices to permit
this%PostMult(:,i)=matmul(this%PreMult(:,i),this%Mat(:,:,i)) ! change index order on PostMult, PreMult
Then this may eliminate the array temporary (as well as gather/scatter).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, beside the correct point raised by Jim, but which may not cause the problem, here is a working example.
Module Data_Kind Implicit None Integer, Parameter :: IkXL=Selected_Int_Kind(12) Integer, Parameter :: IkL=Selected_Int_Kind(8) Integer, Parameter :: IkM=Selected_Int_Kind(4) Integer, Parameter :: IkS=Selected_Int_Kind(2) Integer(Ikl), Parameter :: RkDbl=Selected_Real_Kind(15,100) Integer(Ikl), Parameter :: RkSgl=Selected_Real_Kind(6,37) Real(rkdbl), Parameter :: RSZero=10e-12 End Module Data_Kind Module Mod_Root use Data_Kind Implicit none Type :: root End type root End Module Mod_Root Module Mod_Sigma use Data_Kind use Mod_Root, only: root Implicit None Type, abstract, extends(root) :: Sigma Real(rkdbl), Pointer, contiguous :: MPreMult(:,:), MPostMult(:,:) Real(rkdbl), Pointer, contiguous :: VPreMult(:), VPostMult(:) contains Procedure(SubMult), PAss, Public, Deferred :: Mult Procedure(SubInit), PAss, Public, Deferred :: Init end type Sigma Abstract Interface Subroutine SubMult(this) Import Sigma Class(Sigma), Intent(In) :: this End Subroutine SubMult Subroutine SubInit(this,dim1,dim2) use data_kind Import Sigma Class(Sigma), Intent(InOut) :: this Integer(Ikxl), Intent(in) :: dim1 Integer(Ikxl), Intent(in), optional :: dim2 End Subroutine SubInit End Interface Private :: SubMult, SubInit End Module Mod_Sigma Module Mod_Sigma_Type_Special use Data_Kind use Mod_Sigma, only: Sigma Type, extends(Sigma) :: Sigma_Type_Special Real(rkdbl), Allocatable :: Mat(:,:), dd(:) contains Procedure, Pass, Public :: Init=> SubInit Procedure, Pass, Public :: Mult=>SubMult End type Sigma_Type_Special Private :: SubMult, SubInit contains Subroutine SubInit(this,dim1,dim2) Class(Sigma_Type_Special), Intent(InOut) :: this Integer(Ikxl), Intent(in) :: dim1 Integer(Ikxl), Intent(in), optional :: dim2 outer:block if(.not.present(dim2)) tHen write(*,*) "error";exit outer End if if(dim1<=0.or.dim2<=0) Then write(*,*) "error";exit outer End if write(*,"(*(g0:"",""))") "Special dim1",dim1,"dim2",dim2 Allocate(this%Mat(dim1,dim1),this%dd(dim2)) end block outer end Subroutine SubInit Subroutine SubMult(this) !$ use omp_lib Implicit None Class(Sigma_Type_Special), Intent(In) :: this Integer(Ikxl) :: i this%MPostMult=matmul(this%MPreMult,this%Mat) !$ call omp_set_num_threads(size(this%MPostMult,2)) !$OMP PARALLEL DO PRIVATE(i) Do i=1,size(this%MPostMult,2) this%MPostMult(:,i)=this%MPostMult(:,i)*this%dd End Do !$OMP END PARALLEL DO End Subroutine SubMult End Module Mod_Sigma_Type_Special Module Mod_Sigma_S use Data_Kind use Mod_Sigma, only: Sigma Type, abstract, extends(Sigma) :: Sigma_S End type Sigma_S End Module Mod_Sigma_S Module Mod_Sigma_Type_3D use Data_Kind use Mod_Sigma_S, only: Sigma_S Type, extends(Sigma_S) :: Sigma_Type_3D Real(rkdbl), Allocatable :: Mat(:,:,:) contains Procedure, Pass, Public :: Init=> SubInit Procedure, Pass, Public :: Mult=>SubMult End type Sigma_Type_3D Private :: SubMult, SubInit contains Subroutine SubInit(this,dim1,dim2) Class(Sigma_Type_3D), Intent(InOut) :: this Integer(Ikxl), Intent(in) :: dim1 Integer(Ikxl), Intent(in), optional :: dim2 outer:block if(.not.present(dim2)) tHen write(*,*) "error";exit outer End if if(dim1<=0.or.dim2<=0) Then write(*,*) "error";exit outer End if write(*,"(*(g0:"",""))") "3D dim1",dim1,"dim2",dim2 Allocate(this%Mat(dim1,dim1,dim2)) end block outer end Subroutine SubInit Subroutine SubMult(this) Implicit None Class(Sigma_Type_3D), Intent(In) :: this Integer(Ikxl) :: i !$ call omp_set_num_threads(size(this%MPostMult,2)) !$OMP PARALLEL DO PRIVATE(i) Do i=1,size(this%Mat,3) this%MPostMult(i,:)=matmul(this%MPreMult(i,:),this%Mat(:,:,i)) End Do !$OMP END PARALLEL DO End Subroutine SubMult End Module Mod_Sigma_Type_3D Module Mod_Sigma_Type_1D use Data_Kind use Mod_Sigma_S, only: Sigma_S Type, extends(Sigma_S) :: Sigma_Type_1D Real(rkdbl), Allocatable :: vec(:) contains Procedure, Pass, Public :: Init=> SubInit Procedure, Pass, Public :: Mult=>SubMult End type Sigma_Type_1D Private :: SubMult, SubInit contains Subroutine SubInit(this,dim1,dim2) Class(Sigma_Type_1D), Intent(InOut) :: this Integer(Ikxl), Intent(in) :: dim1 Integer(Ikxl), Intent(in), optional :: dim2 outer:block if(dim1==0) Then write(*,*) "error";exit outer End if Allocate(this%vec(dim1)) end block outer end Subroutine SubInit Subroutine SubMult(this) !$ use omp_lib Implicit None Class(Sigma_Type_1D), Intent(In) :: this Integer(Ikxl) :: i if(associated(this%MPostMult)) tHen !$ call omp_set_num_threads(size(this%MPostMult,2)) !$OMP PARALLEL DO PRIVATE(i) Do i=1,size(this%MPreMult,2) this%MPostMult(:,i)=this%MPreMult(:,i)*this%vec End Do !$OMP END PARALLEL DO Elseif(associated(this%VPostMult)) Then !$ call omp_set_num_threads(40) !$OMP PARALLEL Do Do i=1,size(this%VPostMult) this%VPostMult(i)=this%VPreMult(i)*this%vec(i) End Do !$OMP END PARALLEL DO End if End Subroutine SubMult End Module Mod_Sigma_Type_1D Module Mod_Struct use Data_Kind use Mod_Root, only: root use Mod_Sigma, only: sigma Type,extends(root), abstract :: Struct Class(Sigma), Allocatable :: Sigma Contains Procedure(SubInit), Public, Pass, Deferred :: Init Procedure(SubMult), Public, PAss, Deferred :: Mult End type Struct Type :: StructPt CLass(Struct), Pointer :: pt end type StructPt Abstract interface Subroutine SubMult(this) Import Struct Class(Struct), Intent(InOut), Target :: this end Subroutine SubMult Subroutine SubInit(this,dim1,dim2,what) use data_kind Import Struct Class(Struct), Intent(InOut) :: this Integer(Ikxl), Intent(In) :: dim1 Integer(Ikxl), Intent(In), optional :: dim2 Character(len=*), Intent(In), optional :: what end Subroutine SubInit End interface End Module Mod_Struct Module Mod_Struct_A use Data_Kind use Mod_Struct, only: Struct use Mod_Sigma_Type_3D, only: Sigma_Type_3D use Mod_Sigma_Type_Special, only: Sigma_Type_Special Type, extends(Struct) :: Struct_Type_A Real(rkdbl), Allocatable :: Mat1(:,:), Mat2(:,:) Integer(ikxl) :: dim1, dim2 Contains Procedure, Pass, Public :: Mult => SubMultSigma Procedure, Pass, Public :: Init => SubInit End type Struct_Type_A Private :: SubMultSigma, SubInit contains Subroutine SubInit(this,dim1,dim2,What) Class(Struct_Type_A), Intent(InOut) :: this Integer(Ikxl), Intent(In) :: dim1 Integer(Ikxl), Intent(In), optional :: dim2 Character(len=*), Intent(In), optional :: What outer:block if(.not.present(dim2)) Then write(*,*) "error"; exit outer end if this%dim1=dim1;this%dim2=dim2 Allocate(this%Mat1(0:dim1,dim2),this%Mat2(0:dim1,dim2)) if(present(what)) Then Select Case(trim(adjustL(what))) Case("3D") Allocate(Sigma_Type_3D::this%sigma) call this%sigma%init(dim1=dim2,dim2=dim1) Case("SPECIAL") Allocate(Sigma_Type_Special::this%sigma) call this%sigma%init(dim1=dim2,dim2=dim1) End Select Else Allocate(Sigma_Type_3D::this%sigma) call this%sigma%init(dim1=dim2,dim2=dim1) End if End block outer End Subroutine SubInit Subroutine SubMultSigma(this) Implicit None Class(Struct_Type_A), Intent(InOut), Target :: this this%Sigma%MPreMult=>this%Mat1(1:this%dim1,:) this%Sigma%MPostMult=>this%Mat2(1:this%dim1,:) call this%Sigma%Mult() End Subroutine SubMultSigma End Module Mod_Struct_A Module Mod_Struct_B use Data_Kind use Mod_Sigma_Type_1D, only: Sigma_Type_1D use Mod_Struct, only: Struct Type, extends(Struct) :: Struct_Type_B Real(rkdbl), Allocatable :: Mat1(:), Mat2(:) Integer(ikxl) :: dim1 Contains Procedure, Pass, Public :: Mult => SubMultSigma Procedure, Pass, Public :: Init => SubInit End type Struct_Type_B Private :: SubMultSigma, SubInit contains Subroutine SubInit(this,dim1,dim2,What) Class(Struct_Type_B), Intent(InOut) :: this Integer(Ikxl), Intent(In) :: dim1 Integer(Ikxl), Intent(In), optional :: dim2 Character(len=*), Intent(In), optional :: what outer:block this%dim1=dim1 Allocate(this%Mat1(0:dim1),this%Mat2(0:dim1)) Allocate(Sigma_Type_1D::this%sigma) call this%sigma%init(dim1=dim1) End block outer End Subroutine SubInit Subroutine SubMultSigma(this) Implicit None Class(Struct_Type_B), Intent(InOut), Target :: this this%Sigma%VPreMult=>this%Mat1(1:this%dim1) this%Sigma%VPostMult=>this%Mat2(1:this%dim1) call this%Sigma%Mult() End Subroutine SubMultSigma End Module Mod_Struct_B Program Test !$ use omp_lib use Data_Kind use Mod_Struct use Mod_Struct_A use Mod_Struct_B Type(Struct_Type_B), Target :: fi Type(Struct_Type_A), Target :: ge, geG, sxh, pe Class(StructPt), Allocatable :: x(:) Integer :: i Integer(Ikxl), Parameter :: nFi=162469876, nLGG=794, nLSxH=533346, nLPE=1564626 Integer(Ikxl), Parameter :: nFGen=57, nFGGS=57, nFSxH=49, nFPE=4 Integer(Ikxl), Parameter :: nLGen=4343921 call fi%init(dim1=nFi) !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ !!@@@@ lower ram demand but seg-fault call ge%init(dim1=nLGen,dim2=nFGen,what="SPECIAL") !!@@@@ higher ram demand but running !call ge%init(dim1=nLGen,dim2=nFGen,what="3D") !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ call geG%init(dim1=nLGG,dim2=nFGen,what="3D") call pe%init(dim1=nLPE,dim2=nFPE,what="3D") call sxh%init(dim1=nLSxH,dim2=nFSxH,what="3D") Allocate(x(5)) x(1)%pt=>fi x(2)%pt=>pe;x(3)%pt=>sxh;x(4)%pt=>geG x(5)%pt=>ge !$ call omp_set_nested(.TRUE.) !$OMP PARALLEL DO PRIVATE(i) Do i=1,size(x) write(*,*) i call x(i)%pt%Mult() write(*,*) i End Do !$OMP END PARALLEL DO End Program Test
When "ge" is initialized with "what=3D", which is much more RAM demanding, the pogram runs. But when "ge" is initialized with "what=SPECIAL", which should be very RAM economical, the pogram crashes with a seg-fault. Running it in gdb gives:
[New LWP 430243] Program received signal SIGSEGV, Segmentation fault. [Switching to LWP 430243] 0x0000000000576c7a in __intel_avx_rep_memset () (gdb) bt #0 0x0000000000576c7a in __intel_avx_rep_memset () #1 0x0000000000402c71 in mod_sigma_type_special_mp_submult_ () #2 0x0000000000402b32 in MAIN__ () #3 0x000000000046ae23 in __kmp_invoke_microtask () #4 0x0000000000423a90 in __kmp_invoke_task_func () #5 0x0000000000422d55 in __kmp_launch_thread () #6 0x000000000046b211 in _INTERNAL_24_______src_z_Linux_util_c_54df53be::__kmp_launch_worker(void*) () #7 0x0000000000587914 in start_thread (arg=0x2aaf2ffff900) at pthread_create.c:312 #8 0x00000000005f81e9 in clone () (gdb)
Compiler command was
ifort -warn nounused -warn declarations -O3 -static -warn interface -mkl=parallel -qopenmp tmp.f90
The pogram ran on a machine with 56 "Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz" processors and 512GB RAM.
ifort version is 17.01, stack size is set to unlimited, omp_stacksize to 32MB.
Thanks for any idea??
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are some potential, and glaring issues.
1) you are using nested parallel regions, nothing wrong with this in particular, however your code is not making any considerations for oversubscription issue. This could result in your outer region establishing a thread pool of 56 threads, and each inner region having each thread establish its own thread pool of 56 threads. IOW 56 * 56 threads, or 3,136 threads.
2) Your compile line option is using -mkl=parallel. When a serial program uses MKL you specify -mkl=parallel to get the advantage of parallelization (in your serial program) for your calls into MKL. Unless you take special programming considerations, parallel programs should use -mkl=serial. Not taking such programming considerations could result in each thread calling MKL (via matmul calling MKL) could result in that call for that thread generating an additional 56 thread pool. IOW potentially 56*56*56 threads (175,616). The actual number of thread generated may be affected by additional default parameters and/or environment variables
Suggestions:
Change your ifort option to -mkl=serial, and balance the number of threads using the outer loop with the number of threads spawned for each thread in the inner loop.
Examples
2 threads on outer loop, 28 threads on inner loop (2 * 28 = 56)
4 threads on outer loop, 14 threads on inner loop ( 4 * 14 = 56)
Once you do this, and if it runs successfully, then you might experiment with some additional tuning.
MKL_NUM_THREADS=4
-mkl=parallel
2 threads outer region, 7 threads inner region (2*7*4 = 56)
Then you can experiment with slight oversubscription (at one of the layers at a time).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim.
I am(was) aware of the oversubscription problem. Solving it seems only trivial for the example because in reality all these hard coded numbers are variables. Thus, not only the length of the outer loop is defined by the input data (the number of objects), also dimensions of all arrays in each direction are unknown at compile time. Moreover, in reality there might be more than two levels of loops. I made versions using mkl=serial for the original program, and as far as I remember speed was reduced. I'll post numbers about that soon.
I am still struggleing to understand why the segfault occurs when changing object classes. The oversubscription problem is virulent invariable of the object class. Moreover, when reducing the first dimension of some of the Mat1 and Mat2 arrays, which does NOT change number of theards called because they are derived from the second dimension, the pogram runs. Finally, when incorporating the "sigma" structure into "struct", abandoning the "sigma" class (which is just a helper to make stuff flexible and clear), and putting the multiplications in some "if allocated" statements, the program runs as well, of coures with the same level of oversubscription.
So how does this all fit together??? I was more certain about an omp bug (thats why I called for intel help in the thread title but no one bothered ...... hm?)!!
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you get the segfault to occur while debugging *** with the debugger set to trap on segfault. Then see if you can examine the disassembly code to determine what address is being reference, by which thread, then walk up the call stack. At some point you will (may) reach a Fortran source statement (or some meaningfull address in a support library). This information, in the right hands, may yield a solution (coding error) or strong indication of a bug inside the Fortran runtime, C runtime, and/or library code.
Your original problem (#1) had segfault in __kmp_hyper_barrier_release
This latest problem occurs in __intel_avx_rep_memset
Something has changed??
Also, you are compiling (linking) with -static. The Intel OpenMP library is distributed only as a shared library. What you need to check is to as if the linker is linking the Intel shared OpenMP library... or if the linker is linking in gcc's static OpenMP library. Mixing vendors threading libraries may be problematic.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been trying your test case in post #7 but with limited progress. My systems lack static libs necessary to link with –static and while I can link w/o –static and run as is with a single thread, it does not run when initialized with “ge” as you indicated, so I did not feel I was even close to replicating the issue. I am inquiring w/others for assistance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
the example code is just an example, the original debugging output is from the original, not posted example. However, the original program crashes at a structural component 100% similar to that in the example program, and the remedy for the orginal program is the same as for the example: use the same object class or incorporate the object characteristics and abandon that class. However, I'll follow up your advice on the linker and try to get my head around your debugging option although this is uncharted territory for me at the moment.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kevin,
thanks for the post. This gives me a bit of hope because people at intel looking into it as well. I don't know whether I will be successfull following Jim's proposal because of personal incapacity. As I mentioned in one of my earlier comments I could you supply an executable (or with several thousand lines of source code and a makefile) and a data set which triggers the fault. Let me know.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A little more progress. For the test case in post #7, using either initialization method (3D or SPECIAL), our internal development compiler flags a condition about a pointer with CONTIGUOUS attribute being made to a non-contiguous target.
If I remove the CONTIGUOUS attribute (lines 22 & 23) the test case then runs with 17.0.1.
I don't have much familiarity with this. Is that a helpful information?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi kevin,
thanks for the info. I had a thought in that direction already. The problem is that the matrices mat1 and mat2 have their first dimension from 0 to N where the pointer is associated from 1 to N. This should be contiguous with regard to the array because it is a contiguous section, but not contiguous with regard how the array is stored. From my understanding the contiguous attribute relates to the array and to its storage, but I am not sure about it. Maybe you or Jim can clarify.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kevin,
I removed the contiguous attribute but on my server it's still crashing. It is also crashing when removing "-static". Intrerestingly then I get this output:
*** longjmp causes uninitialized stack frame ***: ./a.out terminated ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x2b59ed3f729f] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x2b59ed49242c] /lib/x86_64-linux-gnu/libc.so.6(+0x10e33d)[0x2b59ed49233d] /lib/x86_64-linux-gnu/libc.so.6(__longjmp_chk+0x29)[0x2b59ed492299] ./a.out[0x488bd2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x2b59ed176330] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0xb5241)[0x2b59ecb72241] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x5595c)[0x2b59ecb1295c] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x571d8)[0x2b59ecb141d8] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0x80110)[0x2b59ecb3d110] /opt/intel/compilers_and_libraries_2017.1.132/linux/compiler/lib/intel64/libiomp5.so(+0xb1193)[0x2b59ecb6e193] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x2b59ed16e184] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b59ed481bed]
Maybe that is of any help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
when using "-mkl=sequential" the crash is still virulent. With "-static" removed the crash report is:
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source a.out 0000000000489D61 Unknown Unknown Unknown a.out 0000000000487E9B Unknown Unknown Unknown a.out 0000000000450CE4 Unknown Unknown Unknown a.out 0000000000450AF6 Unknown Unknown Unknown a.out 00000000004320A9 Unknown Unknown Unknown a.out 0000000000407416 Unknown Unknown Unknown libpthread-2.19.s 00002B0E0BD2F330 Unknown Unknown Unknown a.out 0000000000494E3A Unknown Unknown Unknown a.out 0000000000403E64 Unknown Unknown Unknown a.out 0000000000403D22 Unknown Unknown Unknown libiomp5.so 00002B0E0BA2CD13 __kmp_invoke_micr Unknown Unknown libiomp5.so 00002B0E0B9FCB17 Unknown Unknown Unknown libiomp5.so 00002B0E0B9FC1C5 Unknown Unknown Unknown libiomp5.so 00002B0E0BA2D193 Unknown Unknown Unknown libpthread-2.19.s 00002B0E0BD27184 Unknown Unknown Unknown libc-2.19.so 00002B0E0C03ABED clone Unknown Unknown
Moreover, the crash remains even with all omp flags removed except the one around the outer loop in the main program. This should rule out the oversubscription problem as a possible cause. The only remedy is still to change the object class for sigma in object "ge". I can imagine that it more and more boils down to a library bug somewhere.
Thanks for any comment.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim and Kevin,
I changed to "-mkl=sequential" and removed "-static" with no luck. Still crashing. Interestingly it also crashes when removing all omp flags except those in the main program. This also rules out the oversubscription as a possible cause. The seg fault report was:
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source a.out 0000000000489811 Unknown Unknown Unknown a.out 000000000048794B Unknown Unknown Unknown a.out 0000000000450794 Unknown Unknown Unknown a.out 00000000004505A6 Unknown Unknown Unknown a.out 0000000000431B59 Unknown Unknown Unknown a.out 0000000000406EC6 Unknown Unknown Unknown libpthread-2.19.s 00002AFD84EB3330 Unknown Unknown Unknown a.out 00000000004948FA Unknown Unknown Unknown a.out 0000000000403D47 Unknown Unknown Unknown a.out 0000000000403C22 Unknown Unknown Unknown libiomp5.so 00002AFD84BB0D13 __kmp_invoke_micr Unknown Unknown libiomp5.so 00002AFD84B80B17 Unknown Unknown Unknown libiomp5.so 00002AFD84B801C5 Unknown Unknown Unknown libiomp5.so 00002AFD84BB1193 Unknown Unknown Unknown libpthread-2.19.s 00002AFD84EAB184 Unknown Unknown Unknown libc-2.19.so 00002AFD851BEBED clone Unknown Unknown
compiler command was:
ifort -warn nounused -warn declarations -O3 -warn interface -mkl=sequential -qopenmp tmp.f90
It ran on a machine with 32 "Intel(R) Xeon(R) E5-2630 v3" processors and 256G of RAM (Note that with omp flags removed the program will only have 5 threads). Still the only remedy is to change class.
Thanks for any comment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In you code example you have:
this%Sigma%PreMult=>this%Mat1
this%Sigma%PostMult=>this%Mat2
Where Mat1 and Mat2 are allocatables, but nowhere have they been allocated. Your actual code may have performed the allocations.
A pointer can have a stride other than 1. While removing contiguous would resolve a misunderstanding with the Fortran code, it may present an issue with MKL Which requires contiguous array sections.
You could quickly test this out:
Subroutine SubMult(this) Implicit None Class(Sigma_Type_A), Intent(In) :: this Integer :: i real, allocatable :: tempPostMult(:), tempPpremult(:) if(size(this%PostMult,1) /= size(this%Mat,3)) stop if(size(this%PostMult,1) /= size(this%PreMult,1)) stop if(size(this%PostMult,2) /= size(this%PreMult,2)) stop Do i=1,size(this%Mat,3) tempPremult = this%PreMult(i,:) tempPostMult = matmul(tempPremult,this%Mat(:,:,i)) this%PostMult(i,:)=tempPostMult End Do End Subroutine SubMult
And do the same thing in the parallel regions using matmul (remembering to make private the temp arrays).
The intention of the code is to help identify (eliminate) potential conflicts, and not as a recommended solution.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Mat1 and Mat2 are allocated in line 223 and 267 of the working example. With regard to the contiguous attribute which is in the working example I wondered in #15 about it and I am still not sure whethe I am using it correctly. As written in #15, Mat1 and Mat2 have their first dimension starting at zero and ending at N, whereas the pointer is associated from 1 to N. From my understanding that should be ok for a pointer with the contiguous attribute.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, your example is with regard to the 3D matrix, which is perfectly fine. However, the actual solution to that problem is to reshape the 3D Matrix into a 1D vector. The multiplication with a 2D Matrix is then done by iterating over a product loop using columns of the 2D Matrix and contiguous sections of the vectorized 3D matrix. Since the number of multiplications and additons remains the same, depending on the hardware this gives a speed up between 0 and 20% (if I remember correctly) For the sake of simplicity have not posted that code here (its rather linear algebra than compiler related). If you are interested let me know I can get you a copy.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page