Segmentation fault when allocating (not so) large components of derived types

fercook · ‎01-14-2011

EDIT: I have tested the reproducer in a Linux box and it also throws a segmentation fault, I have taken this issue to Premier Support. Could somebody confirm the error, or enlighten me with what I am doing wrong?

Hi, I am running into a strange problem, maybe its just some silly option I must turn on.

I have derived types with allocatable components. If I try to allocate a large (but still very reasonable) amount of memory, I get a segmentation fault, or even a crash without warning. If I allocate the same amount of memory for a normal array I have no problems whatsoever. The critical value seems to be around 2^19 complex(8) elements, that is smaller than a 1000x1000 double precision complex matrix.

Here is a reproducer, perhaps its bloated because I didnt try hard to reduce it to bones, but it should do. I also do not have other machines to reproduce it on: the problem shows up on a 2009 MacBook Pro with 4GB of RAM, and the latest version of Composer XE.

Here is the output from the console:

[bash]$ ifort driver.f90 
$ ./a.out 
      523818
 Success with       523818
 Just a number to confuse the optimizer
 (-3.00000000000000,0.000000000000000E+000)
 Normal allocation is OK with       524288
Segmentation fault
[/bash]

And here is the code

[fortran]module Tensor_Class

  implicit none

  complex(8),parameter :: II=(0.0d0,1.0d0)

  type,private :: Tensor
  	private
  	integer :: Initialized=.false.
  contains
    procedure,public :: getDimensions => getDimensions_Of_Tensor
  end type Tensor

  type,public,extends(Tensor) :: Vector
  	private
  	complex(8),allocatable :: data(:)
  contains
  end type Vector

!###############################
!#####  Operators and methods
!###############################

  interface new_Tensor
     module procedure new_Vector
  end interface

  interface assignment (=)
     module procedure  new_Tensor_fromAssignment
  end interface

 contains

!######################################################################################
!#####                           Creation operators
!######################################################################################

   function new_Vector (dim1) result (this)
     integer,intent(in) :: dim1
     type(Vector) :: this
     real(8) :: randomnumbersR(dim1) ,randomnumbersI(dim1)
     if(this%Initialized) deallocate(this%data)
     allocate(this%data(dim1))

     Call random_number(randomnumbersR)
     call random_number(randomnumbersI)

     This%data=randomnumbersR+II*randomnumbersI

     this%Initialized=.true.

   end function new_Vector

 !##################################################################

   subroutine new_Tensor_fromAssignment(lhs,rhs)
     class(Vector),intent(out) :: lhs
     type(Vector),intent(in) :: rhs

     if(lhs%Initialized) deallocate(lhs%data)
     allocate(lhs%data(size(rhs%data,1)))
     lhs%data=rhs%data
     lhs%Initialized=.true.

   end subroutine new_Tensor_fromAssignment

!##################################################################
!#################   Polymorphic function         ################
!##################################################################

    function getDimensions_Of_Tensor(this) result(Dims)
        class(Tensor),intent(IN) :: this
!       type(Tensor1) :: Dims
        integer,allocatable :: Dims(:)

        select type (Typed_this => this)
            class is (Vector)
                allocate(Dims(1))
                Dims=shape(Typed_this%data)
            class is (Tensor)
                print *,'Dimensions not defined'
                return
        end select

   end function getDimensions_Of_Tensor

end module Tensor_Class
module Tensor_Class_fun

 use Tensor_Class

implicit none

 contains

 subroutine type_creation(dims)

  type(Vector) :: aVector
  integer :: dims,resultdims(1)

  CALL random_seed()
  aVector=new_Tensor(dims)
  resultdims=aVector%GetDimensions()
  print *,resultdims

 end subroutine type_creation

end module Tensor_Class_fun




program main

use Tensor_Class_fun

implicit none
integer :: dims
complex(8),allocatable:: localVector(:)

dims=2**19-470
call type_creation(dims)

Print *,'Success with ',dims

dims=2**19

allocate(localVector(dims))
localVector=1.0d0
localVector(1000)=localVector(dims)-5.0d0*localVector(1000)
print *,'Just a number to confuse the optimizer',localVector(1000)+localvector(2999)
Print *,'Normal allocation is OK with ',dims
deallocate(localVector) !Free mem to keep it clean

call type_creation(dims)

Print *,'Success?'

end program main
[/fortran]

fercook · ‎01-17-2011

I got super fast-and-friendly response from the Premier Support team (thanks Roland), who found that the problem seems to be in the size of the stack.

In Linux, you get rid of this problem by issuing 'ulimit -s unlimited', but on the Mac there is a hard 64MB limit in the kernel that you cannot overcome. The only workaround is to compile using the -heap-arrays option, which makes the compiler use the heap instead of the stack for storing temporary arrays. With this option, the problem goes away (seehttp://software.intel.com/en-us/articles/intel-fortran-compiler-increased-stack-usage-of-80-or-higher-compilers-causes-segmentation-fault/ )

I asked the Support team, but I also wanted to ask here: Is there too high a performance penalty for using the heap instead of the stack? Or, more important for my project, is it bad that I am using too much stack space?

Reinspecting my code, I see that I create the new object through an assignment, which probably creates a temporary (thus using the stack). Am I right? The main reason to do this was code cleanliness, but I am totally willing to get rig of it for performance reasons...so my questions are:

If I created new objects via a subroutine, would I get rid of the stack usage? And if so, would that make the program faster? I know I could just do it and test, but it is some Klines of code that I have to change, so I would appreciate a lot advice from somebody more experienced on this.

On the other hand, if using GBs of stack is ok under Linux, I can live with the code as is, since the production program will end up in a Linux cluster, while I write and test on the Mac using -heap-arrays.

TimP · ‎01-17-2011

Quoting fercook

Is there too high a performance penalty for using the heap instead of the stack? Or, more important for my project, is it bad that I am using too much stack space?
Reinspecting my code, I see that I create the new object through an assignment, which probably creates a temporary (thus using the stack). Am I right? The main reason to do this was code cleanliness, but I am totally willing to get rig of it for performance reasons...so my questions are:
If I created new objects via a subroutine, would I get rid of the stack usage? And if so, would that make the program faster? I know I could just do it and test, but it is some Klines of code that I have to change, so I would appreciate a lot advice from somebody more experienced on this.
On the other hand, if using GBs of stack is ok under Linux, I can live with the code as is, since the production program will end up in a Linux cluster, while I write and test on the Mac using -heap-arrays.

-heap-arrays:n (heap allocation beginning with size n) should help avoid performance problems with frequent heap allocation of small arrays, in those cases where the compiler knows the size of allocations.
Heap allocation is expected to cost more time than stack, but should not have significant cost if done outside inner loops. You want to avoid forcing users of your program unnecessarily to deal with stack settings.
Threading compatible options, such as RECURSIVE procedure declaration, or compile options with equivalent effect (-openmp, -auto, ....) switch implied allocation of local arrays upon procedure entry from heap to stack. This is required for thread safety. So you can use subroutine entry creation of objects to imply heap only if the subroutine will not undergo threading, thus there may be bigger performance implications. To use, cluster computing leads to considering threading as well.

martinotte · ‎01-20-2011

Hi,

On Apple OSX, you can increase the stack size of an executable at compile-time using the -stack_size argument to the linker. For example:

ifort-Wl,-stack_size -Wl,0x10000000 driver.f90

would set the stack size to 1GB. If you are compiling a 64-bit executable, you can go much larger:

ifort-Wl,-stack_size -Wl,0x40000000 driver.f90

would set the stack size to 4 GB. No need to use slower heap arrays,

Martin Otte