- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
one of my programs is crashing when runnig a threaded version. When running it inside gdb the output left me helpless:
[New LWP 397493]
Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 397493]
0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()
gdb bt yielded:
#0 0x0000000001dad557 in _INTERNAL_25_______src_kmp_barrier_cpp_5de9139b::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) ()
#1 0x0000000001dae38b in __kmp_fork_barrier(int, int) ()
#2 0x0000000001d150c0 in __kmp_launch_thread ()
#3 0x0000000001d5d341 in _INTERNAL_26_______src_z_Linux_util_cpp_47afea4b::__kmp_launch_worker(void*) ()
#4 0x0000000001eb3ff7 in start_thread ()
#5 0x0000000001f2507b in clone ()
To get an idea about parts of the structure of the program a code snippet which mimics what the program is doing is given below. However, this is just for examplification, I have not tested whether the snippet will produce the same segfaut.
Module Mod_Root Implicit none Type :: root End type root End Module Mod_Root Module Mod_Sigma use Mod_Root, only: root Implicit None Type, abstract, extends(root) :: Sigma Real, Pointer, contiguous :: PreMult(:,:), PostMult(:,:) contains Procedure(SubMult), PAss, Public, Deferred :: Mult end type Sigma Abstract Interface Subroutine SubMult(this) Import Sigma Class(Sigma), Intent(In) :: this End Subroutine SubMult End Interface Private :: SubMult End Module Mod_Sigma Module Mod_Sigma_Type_A use Mod_Sigma, only: Sigma Type, extends(Sigma) :: Sigma_Type_A Real, Allocatable :: Mat(:,:,:) contains Procedure, Pass, Public :: Mult=>SubMult End type Sigma_Type_A Private :: SubMult contains Subroutine SubMult(this) Implicit None Class(Sigma_Type_A), Intent(In) :: this Integer :: i Do i=1,size(this%Mat,3) this%PostMult(i,:)=matmul(this%PreMult(i,:),this%Mat(:,:,i)) End Do End Subroutine SubMult End Module Mod_Sigma_Type_A Module Mod_Sigma_Type_B use Mod_Sigma, only: Sigma Type, extends(Sigma) :: Sigma_Type_B Real, Allocatable :: Mat(:,:) contains Procedure, Pass, Public :: Mult=>SubMult End type Sigma_Type_B Private :: SubMult contains Subroutine SubMult(this) Implicit None Class(Sigma_Type_B), Intent(In) :: this this%PostMult=matmul(this%PreMult,this%Mat) End Subroutine SubMult End Module Mod_Sigma_Type_B Module Mod_Struct use Mod_Root, only: root use Mod_Sigma, only: sigma Type,extends(root), abstract :: Struct Class(Sigma), Allocatable :: Sigma Contains Procedure(SubMult), Public, PAss, Deferred :: Mult End type Struct Type :: StructPt CLass(Struct), Pointer :: pt end type StructPt Abstract interface Subroutine SubMult(this) Import Struct Class(Struct), Intent(InOut), Target :: this end Subroutine SubMult End interface End Module Mod_Struct Module Mod_Struct_A use Mod_Struct Type, extends(Struct) :: Struct_Type_A Real, Allocatable :: Mat1(:,:), Mat2(:,:) Contains Procedure, Pass, Public :: Mult => SubMultSigma End type Struct_Type_A Private :: SubMultSigma contains Subroutine SubMultSigma(this) Implicit None Class(Struct_Type_A), Intent(InOut), Target :: this this%Sigma%PreMult=>this%Mat1 this%Sigma%PostMult=>this%Mat2 call this%Sigma%Mult() End Subroutine SubMultSigma End Module Mod_Struct_A Program Test use Mod_Struct use Mod_Struct_A use Mod_Sigma_Type_A use Mod_Sigma_Type_B Type(Struct_Type_A), Target :: a, b Class(StructPt), Allocatable :: x(:) Integer :: i allocate(Sigma_Type_A::a%sigma) allocate(Sigma_Type_B::b%sigma) Allocate(x(2)) x(1)%pt=>a;x(2)%pt=>b !$OMP PARALLEL DO PRIVATE(i) Do i=1,2 call x(i)%pt%Mult() End Do !$OMP END PARALLEL DO End Program Test
The segfault in my progrram occurs in a location similar to when calling x(i)%pt%Mult, but only if b%sigma has been allocated as type "Sigma_Type_B". If both, a and b, has been allocated as type "Sigma_Type_A", the program runs fine invaribaly of the size of the relevant arrays. Moreover, threaded or unthreaded the pogram always runs when the involved arrays are small. However, when arrays occupy up to 200GB of RAM and different type allocations are used, it crashes.
ifort version is 17.01, linux version is centos 7 kerner 3.10, stack size is set to unlimited, omp_stacksize to 32MB.
compiler flags were
-assume byterecl -warn nounused -warn declarations -O0 -static -check all -traceback -warn interface -check noarg_temp_created -mkl=parallel -qopenmp
Neither at compile time nor at run time any errors or warnings occured. The pogram ran on a machine with 56 "Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz" processors and 512GB RAM.
Given the compiler flags I used and running the program inside gdb I am running out of ideas at this point. It would be great if one form Intel could look into this. I could suppliy an executable and a data set which triggers the segfault.
Thanks a lot.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure you can reconcile "contiguous" with
this%Sigma%MPreMult=>this%Mat1(1:this%dim1,:)
this%Sigma%MPostMult=>this%Mat2(1:this%dim1,:)
if the lower bound of Mat1 and Mat2 is zero. There's a gap between each column.
Incidentally, you are not calling MKL, so the -mkl setting is irrelevant. You are calling the Fortran intrinsic matmul(). This won't be threaded unless you compile with -O3 -parallel or -qopt-matmul.
But most important, you need a much larger value of the thread stack size. I was able to build and run both variants successfully with 8 threads and OMP_STACKSIZE=5000M . I didn't try to determine optimum values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Martyn,
thanks for the comment. But I am struggeling to understand why one version runs while the other fails. I assume that both have the same ram, stack and omp_stack demand.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FWIW The OMP_STACKSIZE setting affects the additional OpenMP threads stack sizes but not the main thread stack size.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, but the main thread stack size was already set to unlimited. I found that the test program ran with OMP_STACKSIZE=2000M but failed with 1800M.
This style of modern Fortran code can sometimes result in the compiler making a lot of temporary array copies. For example, the "contiguous" keyword doesn't simply assert that an assumed shape array or pointer is contiguous; it requires the compiler to ensure that is so. If the compiler isn't sure that the assumed shape array or pointer will be contiguous in all circumstances, it will generate a temporary copy that is contiguous, normally on the stack. If this happens within an OpenMP thread, it goes on the thread stack. Plus any automatic objects that need to be private to a thread will go onto each thread stack. This may be why removing the "contiguous" keyword makes the program work. I think it plausible, though I don't know for sure, that the call involving "SPECIAL" is causing an temporary copy and so using more stack space. I note that the array size corresponding to your first printout is just under 2GB. (57 * 4343921 * 8). It's possible that the compiler worries about the contiguity of the data in the special type case. But that's too hard to figure out from reading the code, or whether the compiler could do better. The warning about contiguity reported by Kevin, which comes from the source lines I called out in my previous post, is also an indication that something like this is going on.
Incidentally, there is an alternative to increasing the thread stack size. If you compile with -heap-arrays, which causes temporary arrays to be allocated on the heap instead of on the stack, you don't need to increase OMP_STACKSIZE. The downside, for an OpenMP program, is that the synchronization required to keep all those allocations threadsafe can sometimes impact performance, especially if the number of threads is large.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Even at OMP_STACKSIZE=20M, it may be difficult to run a large number of threads (such as 180).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also keep in mind that setting the program stack size to "unlimited" really means to set it to whatever maximum is configured in the kernel. It isn't really unlimited.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »