Solved: stack overflow even with allocatable array

Stephen_W_ · ‎09-04-2017

I don't know why the following program throw overflow, since all the large arrays are dynamic, this is just an example, y = sum(u) works, but why partial sum does not work?

program test
    implicit none
    integer, parameter :: n = 1000000, m = 10
    real(8), allocatable, dimension(:,:) :: x
    real(8) :: y
    allocate(x(n,m))
    x = 1.0d0
    call tt(x,y,n,m)
    print *, y
end

subroutine tt(x,y,n,m)
    implicit none
    integer, intent(in) :: n, m
    real(8), intent(in), dimension(n,m) :: x
    real(8), intent(out) :: y
    real(8), allocatable, dimension(:) :: z
    real(8), allocatable, dimension(:,:) :: u
    allocate(z(n),u(n,m))
    u = exp(x)
    z = sum(u,2)

    y = sum(z)

    return
end

Many thanks

mecej4 · ‎09-05-2017

This is what the documentation (link given in #2) says:

Option standard-realloc-lhs (the default), tells the compiler that when the left-hand side of an assignment is an allocatable object, it should be reallocated to the shape of the right-hand side of the assignment before the assignment occurs. This is the current Fortran Standard definition. This feature may cause extra overhead at run time. This option has the same effect as option assume realloc_lhs.

I think that what happens is that a temporary array is allocated on the stack to hold the array-valued expression to the right of the '=' in the assignment statement. Under F2008 rules, the variable in question is deallocated, reallocated with the correct size (it is irrelevant whether or not the previous size was already correct), the temporary array is copied to the newly allocated variable, and the temporary array is marked for possible deletion during a subsequent garbage collection.

Only the compiler authors can tell us the details, and users discouraged from asking for such details (because a revision of the compiler can make the response invalid).

View solution in original post

mecej4 · ‎09-04-2017

See the description of the effects of using the /assume:norealloc_lhs option at https://software.intel.com/en-us/node/678232 .

Stephen_W_ · ‎09-04-2017

mecej4 wrote:

See the description of the effects of using the /assume:norealloc_lhs option at https://software.intel.com/en-us/node/678232 .

It works when compile with /assume:norealloc_lhs, but since z was allocated with correct shape and size, why it throws stack overflow, could you show me the details?

mecej4 · ‎09-05-2017

This is what the documentation (link given in #2) says:

Option standard-realloc-lhs (the default), tells the compiler that when the left-hand side of an assignment is an allocatable object, it should be reallocated to the shape of the right-hand side of the assignment before the assignment occurs. This is the current Fortran Standard definition. This feature may cause extra overhead at run time. This option has the same effect as option assume realloc_lhs.

I think that what happens is that a temporary array is allocated on the stack to hold the array-valued expression to the right of the '=' in the assignment statement. Under F2008 rules, the variable in question is deallocated, reallocated with the correct size (it is irrelevant whether or not the previous size was already correct), the temporary array is copied to the newly allocated variable, and the temporary array is marked for possible deletion during a subsequent garbage collection.

Only the compiler authors can tell us the details, and users discouraged from asking for such details (because a revision of the compiler can make the response invalid).

Lorri_M_Intel · ‎09-05-2017

Mecj4 is correct; the compiler does, in fact, create a temp for the call at line 21. And, it creates the temp on the stack by default. You can override the "create on the stack" behavior by using the /heap-arrays command line switch.

--Lorri

jimdempseyatthecove · ‎09-05-2017

>>it creates the temp on the stack by default

IMHO this is a deficiency in the compiler. When the output array has the correct shape for the output of SUM(array,dim), then no temporary should be created at all. The same should hold true for conforming array expressions. IOW when it is not necessary to perform a reallocation, the deallocation & allocation should be bypassed. Performing the deallocation & allocation in this case is an unnecessary trip through a critical section (twice). In a multi-threaded application this potentially creates a bottleneck.

I suggest that your compiler optimization team construct a multi-threaded test (say on 256-thread system) whereby the reallocation of left hand side is performed.

Jim Dempsey

JVanB · ‎09-05-2017

I noticed that if you make array Z fixed shape the problem doesn't occur, but if you try

z(:) = sum(u,2)

stack overflow still happens on ifort 16.0. To me, this behavior is an issue because assigning to an array section should turn off the effects of /assume:realloc_lhs for this assignment statement, so no temporary array should be necessary.

I don't see where the Fortran standard requires deallocation of the variable. It does of course say that if the variable and its subobjects don't all have the same dynamic type and shape as the expression, deallocation must occur. Does this wording in the standard mean that if a complete dynamic type and shape match occurs, no deallocation should happen? I tried also looking at the section on pointer association and it didn't say that any pointers whose target was the allocatable variable before the assignment would have undefined association status in the case where deallocation is not necessary. It didn't say that such pointers remained associated in this case, either.

You can't necessarily tell until you have evaluated the expression whether deallocation might be necessary. Consider the case where the expression has a reference to a function with an allocatable result or a reference to a transformational intrinsic with shape determined by expressions that must be evaluated at runtime. This makes it harder for the compiler to always use the storage space of the variable to build the result of the expression.

JVanB · ‎09-05-2017

In spite of the issues noted in my last post, I found a syntax that avoids a temporary array:

program test
    implicit none
    integer, parameter :: n = 1000000, m = 10
    real(8), allocatable, dimension(:,:) :: x
    real(8) :: y
    allocate(x(n,m))
    x = 1.0d0
    call tt(x,y,n,m)
    print *, y
end

subroutine tt(x,y,n,m)
    implicit none
    integer, intent(in) :: n, m
    real(8), intent(in), dimension(n,m) :: x
    real(8), intent(out) :: y
    real(8), allocatable, dimension(:) :: z
    real(8), allocatable, dimension(:,:) :: u
    allocate(z(n),u(n,m))
    u = exp(x)
!    z = sum(u,2)
    call ImLike(z,u,n,m)

    y = sum(z)

    return

    contains
        subroutine ImLike(a,b,i,j)
            integer i, j
            real(kind(z)) a(i)
            real(kind(u)) b(i,j)
            a = sum(b,2)

            return
        end subroutine ImLike
end

Stephen_W_ · ‎09-17-2017

Compiling with "/nostandard-realloc-lhs" works, but when combine with the /Qopenmp option, it throws stackoverflow again

mecej4 wrote:

This is what the documentation (link given in #2) says:

Option standard-realloc-lhs (the default), tells the compiler that when the left-hand side of an assignment is an allocatable object, it should be reallocated to the shape of the right-hand side of the assignment before the assignment occurs. This is the current Fortran Standard definition. This feature may cause extra overhead at run time. This option has the same effect as option assume realloc_lhs.

I think that what happens is that a temporary array is allocated on the stack to hold the array-valued expression to the right of the '=' in the assignment statement. Under F2008 rules, the variable in question is deallocated, reallocated with the correct size (it is irrelevant whether or not the previous size was already correct), the temporary array is copied to the newly allocated variable, and the temporary array is marked for possible deletion during a subsequent garbage collection.

Only the compiler authors can tell us the details, and users discouraged from asking for such details (because a revision of the compiler can make the response invalid).

Steve_Lionel · ‎09-17-2017

/nostandard-realloc-lhs has no effect whatsoever on stack usage. All it does is disable a Fortran 2003 feature to do automatic (re)allocation of allocatable arrays in an assignment, assuming that you have already allocated it properly. It has no effect on use of temporaries and just adds a bit of performance to apps that don't need the reallocation check.

OpenMP imposes its own demands on stack usage, not all of which can be helped with /heap-arrays (though that does help some.) You may need to play with both stack reserve size and the OMP_STACKSIZE environment variable if you encounter stack overflows in OpenMP applications.

John_Campbell · ‎09-18-2017

y = SUM (u) also works, without the need for "z"

Stephen_W_ · ‎09-20-2017

sure, this is just an example, to show something like sum(x,2)

John Campbell wrote:

y = SUM (u) also works, without the need for "z"

John_Campbell · ‎09-24-2017

As a more serious question: would the ifort default response to the following code example be to create temporary copies of u and z ?

subroutine tt(x,y,n,m)
    implicit none
    integer, intent(in) :: n, m
    real(8), intent(in), dimension(n,m) :: x
    real(8), intent(out) :: y
    real(8), allocatable, dimension(:) :: z
    real(8), allocatable, dimension(:,:) :: u

    allocate (u(n,m))
    u = exp(x)

    allocate (z(n))
    z = sum(u,2)

    y = sum(z)

    return
end

If it would, this is indicating a problem with any F90 approach to use of ALLOCATE.

I would have hoped that if the last action for the array was an ALLOCATE, then creating a temporary copy should not be considered. It is not a big leap to also consider if there is no change to the size of the array.

Or am I misunderstanding this thread?

jimdempseyatthecove · ‎09-24-2017

To carry John's question one step further:

Consider if the answer to if "u = exp(x)" does indeed not only create a temporary and performs a reallocation of left hand side, what happens then with:

allocate(u(1-1234:n-1234,1-5678:m-5678))
u = exp(x)

Then what are the bounds of the subscripts???

The excess of all good things is mischievous

Jim Dempsey

Steve_Lionel · ‎09-24-2017

The compiler, theoretically, could generate a separate code path that avoids a temp, but I am fairly certain ifort doesn't do that.

As for Jim's question, the result of exp(x) always has 1 as the lower bound for each dimension - same as for any array expression. The shape (rank and extents) match the argument. This is how it must be in the light of the argument possibly being an array section or having vector subscripts.

jimdempseyatthecove · ‎09-24-2017

Steve,

My concern is not that the result of exp (prior to =) has subscripts origin'd at 1, but rather that due to unnecessary rls the resultant array gets re-origined. In older code this did not happen. While my example subscripts may have been silly, the coder may reasonably desire to use 0-based indexing.

allocate(u(0:n-1,0:m-1))
u = exp(x)

IMHO too many of your compiler engineering optimization strategies are based on an assumption that these allocations are stack based (absent of critical section) as opposed to heap based (with critical section). It looks like they may have taken a shortcut and repurposed MOVE_ALLOC(exp(x), u).

Jim Dempsey

Steve_Lionel · ‎09-24-2017

Jim, would you please show a complete example with output that illustrates your point? I'm not getting it.

The standard says that in the case you show, the result of exp(x) has a lower bound of 1 for each dimension. However, in the assignment to u (in your example), if the SHAPE of u matches the SHAPE of exp(x), u does not get reallocated and whatever lower bounds it had before remain the same. Keep in mind that SHAPE is rank (number of dimensions) and extent (number of elements) - the lower bound doesn't enter into it. There is no MOVE_ALLOC done - it's a copy of data. A sufficiently clever compiler could store the exp values directly into the result (after reallocation if required) rather than creating a temp first and then doing an array copy. I don't know if ifort is there yet.

However, if u has a SHAPE different from exp(x), then it will get reallocated with 1 as the lower bound for each dimension.

You're correct that the default is to use the stack for temps, because it's faster. You're also correct that this is problematic for large temps, which is why I have promoted the use of /heap-arrays for years, and argued in favor of making that the default.

jimdempseyatthecove · ‎09-24-2017

Steve,

The compiler is actually doing

subroutine tt(x,y,n,m)
    implicit none
    integer, intent(in) :: n, m
    real(8), intent(in), dimension(n,m) :: x
    real(8), intent(out) :: y
    real(8), allocatable, dimension(:,:) :: u

    allocate (u(0:n-1,0:m-1))
    print *,lbound(u,1),lbound(u,2)
    u = exp(x)
    print *,lbound(u,1),lbound(u,2)
    deallocate(u)
    u = exp(x)
    print *,lbound(u,1),lbound(u,2)
    deallocate(u)
    allocate (u(0:m-1,0:n-1))   ! non-matching sizes
    u = exp(x)
    print *,lbound(u,1),lbound(u,2)
    return
end

program rls_issue
    implicit none
    integer, parameter :: n=11
    integer, parameter :: m=22
    real(8) :: x(n,m), y
    call RANDOM_NUMBER(x)
    call tt(x,y,n,m)
    print *,y
end program rls_issue

Output:
 0 0
 0 0
 1 1
 1 1
 -9.255963134931783E+061

what I would believe is correct given the circumstances:

Reallocation (and re-indexbasing) did not occur when rls not required.

While when reallocating in the last case it did not maintain the 0 based index, I had no expectation that it should (in that case).

I haven't tested V18, haven't installed yet, there appears to be issues with unallocated arrays.

Jim Dempsey

Jim

Steve_Lionel · ‎09-24-2017

Version 18 gives the same result for the bounds. Your subroutine never assigns to y, so its value is undefined.

John_Campbell · ‎09-24-2017

Does ifort Ver2018 use the stack for any of the following examples ?
I would consider most stack overflow errors to be a compiler bug, or at least not a smart compiler.

subroutine jim (x,y,n,m)
    implicit none
    integer, intent(in) :: n, m
    real(8), intent(in), dimension(n,m) :: x
    real(8), intent(out) :: y
    real(8), allocatable, dimension(:,:) :: u

    print *,'jim'
    allocate (u(0:n-1,0:m-1))
    print *,lbound(u,1),lbound(u,2),ubound(u,1),ubound(u,2)
    u = exp(x)
    print *,lbound(u,1),lbound(u,2),ubound(u,1),ubound(u,2)
    deallocate(u)

    u = exp(x)
    print *,lbound(u,1),lbound(u,2),ubound(u,1),ubound(u,2)
    deallocate(u)

    allocate (u(0:m-1,0:n-1))   ! non-matching sizes
    u = exp(x)
    print *,lbound(u,1),lbound(u,2),ubound(u,1),ubound(u,2)
    y = sum (u)
    return
end subroutine jim

subroutine john (x,y,n,m)
    implicit none
    integer, intent(in) :: n, m
    real(8), intent(in), dimension(n,m) :: x
    real(8), intent(out) :: y
    real(8), allocatable, dimension(:) :: z
    real(8), allocatable, dimension(:,:) :: u

    print *,'john'
    allocate (u(n,m))
    u = exp(x)
    print *,lbound(u,1),lbound(u,2),ubound(u,1),ubound(u,2)

    allocate (z(n))
    z = sum(u,2)
    print *,lbound(z,1),ubound(z,1)

    y = sum(z)

    return
end subroutine john

subroutine default (x,y,n,m)
    implicit none
    integer, intent(in) :: n, m
    real(8), intent(in), dimension(n,m) :: x
    real(8), intent(out) :: y
    real(8), allocatable, dimension(:) :: z
    real(8), allocatable, dimension(:,:) :: u

    print *,'default'
    u = exp(x)
    print *,lbound(u,1),lbound(u,2),ubound(u,1),ubound(u,2)

    z = sum(u,2)
    print *,lbound(z,1),ubound(z,1)

    y = sum(z)

    return
end subroutine default

program rls_issue
    implicit none
    integer, parameter :: n=11000
    integer, parameter :: m=2200
    real(8) :: x(n,m), y

    call RANDOM_NUMBER(x)
    call jim (x,y,n,m)
    print *,y

    call john (x,y,n,m)
    print *,y

    call default (x,y,n,m)
    print *,y

end program rls_issue

!Output:
! jim
!           0           0       10999        2199
!           0           0       10999        2199
!           1           1       11000        2200
!           1           1       11000        2200
!   41579331.634002581     
! john
!           1           1       11000        2200
!           1       11000
!   41579331.634005338     
! default
!           1           1       11000        2200
!           1       11000
!   41579331.634005338

Steve_Lionel · ‎09-25-2017

The only stack temp usage I spotted was for the assignments of sum(u,2). Everything else seemed to be done "in place".