Solved: Stack overflow on array copy

Benedikt_R_ · ‎01-29-2015

Hi

I'm using "Intel(R) Visual Fortran Compiler XE for applications running on IA-32, Version 15.0.0.108 Build 20140726"

My program craches with an stack-overflow. On standard-error there's a stack-trace pointing to a codeline, where a part of an array is copied:

          arr(i0: j0)= arr(i1: j1)

I read somewhere, that the compilter has to create a copy of the copied portion because the compiler cannot dtermine, wether source- and the target-memory do overlap (https://software.intel.com/en-us/node/524873).

Actually I do know, that they do not overlap. Is there a way to give this "promise" to the compiler to force the comiler to creatinon-copy-code?

Benedikt

JVanB · ‎01-29-2015

DO CONCURRENT (I = i0:j0)

arr(I) = arr(i1-i0+I)

END DO

View solution in original post

Steven_L_Intel1 · ‎01-29-2015

I am not aware of ways to tell the compiler to avoid the copy in this case. I suggest compiling with /heap-arrays (Fortran > Optimization > Heap Arrays > 0)

JVanB · ‎01-29-2015

DO CONCURRENT (I = i0:j0)

arr(I) = arr(i1-i0+I)

END DO

Steven_L_Intel1 · ‎01-29-2015

The DO CONCURRENT helps the compiler decide that the loop is safe to vectorize.

FortranFan · ‎01-29-2015

The following should also work:

      Assoc_Arr: ASSOCIATE ( tmp1 => array(i0:j0), tmp2 => array(i1:j1) )
         tmp1 = tmp2
      END ASSOCIATE Assoc_Arr

JVanB · ‎01-29-2015

Hmmm... I didn't think that ASSOCIATE had rules about aliasing like procedures do. Let's try an experiment:

program P
   implicit none
   integer arr(10)
   integer i
   integer, parameter :: arr0(size(arr)) = [(i,i=1,size(arr))]
   integer i0,j0,i1,j1

   write(*,'(a)') 'Array assignment'
   arr = arr0
   call set1(i0,j0,i1,j1)
   arr(i0:j0) = arr(i1:j1)
   write(*,5) arr

   arr = arr0
   call set2(i0,j0,i1,j1)
   arr(i0:j0) = arr(i1:j1)
   write(*,5) arr

   write(*,'(a)') 'Associate'
   arr = arr0
   call set1(i0,j0,i1,j1)
   associate(T0 => arr(i0:j0), T1 => arr(i1:j1))
      T0 = T1
   end associate
   write(*,5) arr

   arr = arr0
   call set2(i0,j0,i1,j1)
   associate(T0 => arr(i0:j0), T1 => arr(i1:j1))
      T0 = T1
   end associate
   write(*,5) arr

   write(*,'(a)') 'Subroutine'
   arr = arr0
   call set1(i0,j0,i1,j1)
   call copy(arr(i0:j0),arr(i1:j1),j0-i0+1)
   write(*,5) arr

   arr = arr0
   call set2(i0,j0,i1,j1)
   call copy(arr(i0:j0),arr(i1:j1),j0-i0+1)
   write(*,5) arr

   write(*,'(a)') 'Forall'
   arr = arr0
   call set1(i0,j0,i1,j1)
   forall(i=i0:j0) arr(i) = arr(i1-i0+i)
   write(*,5) arr

   arr = arr0
   call set2(i0,j0,i1,j1)
   forall(i=i0:j0) arr(i) = arr(i1-i0+i)
   write(*,5) arr

   write(*,'(a)') 'Do concurrent'
   arr = arr0
   call set1(i0,j0,i1,j1)
   do concurrent(i=i0:j0)
      arr(i) = arr(i1-i0+i)
   end do
   write(*,5) arr

   arr = arr0
   call set2(i0,j0,i1,j1)
   do concurrent(i=i0:j0)
      arr(i) = arr(i1-i0+i)
   end do
   write(*,5) arr

5 format(*(i0:1x))
   contains
      subroutine copy(arr0,arr1,n)
         integer n
         integer arr0(n),arr1(n)
         arr0 = arr1
      end subroutine copy
end program P

subroutine set1(i0,j0,i1,j1)
   i0 = 3
   j0 = 5
   i1 = 4
   j1 = 6
end subroutine set1

subroutine set2(i0,j0,i1,j1)
   i0 = 5
   j0 = 7
   i1 = 4
   j1 = 6
end subroutine set2

Output with ifort:

Array assignment
1 2 4 5 6 6 7 8 9 10
1 2 3 4 4 5 6 8 9 10
Associate
1 2 4 5 6 6 7 8 9 10
1 2 3 4 4 4 4 8 9 10
Subroutine
1 2 4 5 6 6 7 8 9 10
1 2 3 4 4 4 4 8 9 10
Forall
1 2 4 5 6 6 7 8 9 10
1 2 3 4 4 5 6 8 9 10
Do concurrent
1 2 4 5 6 6 7 8 9 10
1 2 3 4 4 4 4 8 9 10

So ASSOCIATE has the same result as the subroutine and DO CONCURRENT which do have aliasing rules. I get similar results for gfortran. Where does it talk about aliasing rules for the ASSOCIATE construct in the standard?

FortranFan · ‎01-29-2015

Repeat Offender wrote:

...

So ASSOCIATE has the same result as the subroutine and DO CONCURRENT which do have aliasing rules. I get similar results for gfortran. Where does it talk about aliasing rules for the ASSOCIATE construct in the standard?

I assume you've referred to J3/04-007 Latest Working Draft of the Fortran 2003 standard, May 10 2014: sections 8.1.4 and 16.4.1.5. Now I've a tough time reading these standards documents. so it's The Fortran 2003 Handbook by Adams et al. to the rescue: section 8.2.2 of this book says about the association during the execution of the ASSOCIATE construct, "This process is somewhat similar to what happens in a procedure call with the associate name taking the role of the dummy argument." In the context in this thread, I personally would prefer ASSOCIATE over DO CONCURRENT.

andrew_4619 · ‎01-31-2015

Should you compile with Qparallel or is that implicit in Qmkl? If you're not enabling auto-parallel, there is no benefit to DO CONCURRENT I believe.

FortranFan · ‎01-31-2015

Here's another test using floating point variables you can consider:

   PROGRAM p

      USE, INTRINSIC :: ISO_FORTRAN_ENV, ONLY : I4 => INT32, DP => REAL64

      !..
      IMPLICIT NONE

      !..
      INTEGER(I4), PARAMETER :: MAXARR = 2**28
      INTEGER(I4), PARAMETER :: MAXREPEAT = 5
      INTEGER(I4) :: i0
      INTEGER(I4) :: j0
      INTEGER(I4) :: i1
      INTEGER(I4) :: j1
      INTEGER(I4) :: Istat
      INTEGER(I4) :: I
      INTEGER(I4) :: Counter
      INTEGER(I4) :: Overlap
      REAL(DP), PARAMETER :: EPSILON_DP = EPSILON(1.0_dp)
      REAL(DP), ALLOCATABLE :: array(:)
      REAL(DP) :: Start_Time = 0_dp
      REAL(DP) :: End_Time = 0_dp
      REAL(DP) :: CpuTimes_Assign(MAXREPEAT)
      REAL(DP) :: CpuTimes_DoConcurrent(MAXREPEAT)
      REAL(DP) :: CpuTimes_Associate(MAXREPEAT)
      CHARACTER(LEN=*), PARAMETER :: FMT_CPU = "(A, T40, F8.3, A)"
      CHARACTER(LEN=2048) :: ErrorAlloc

      !..
      ALLOCATE(array(MAXARR), SOURCE=0.0_dp, STAT=Istat, ERRMSG=ErrorAlloc)
      IF (Istat /= 0) THEN
         PRINT *, " Allocation of array failed: ", ErrorAlloc(1:LEN_TRIM(ErrorAlloc))
         STOP
      END IF

      !..
      Overlap = 0
      PRINT *, " ** With Overlap of ", Overlap
      i0 = 1
      j0 = MAXARR/2
      i1 = j0 + 1 - Overlap
      j1 = i1 + MAXARR/2 - 1

      CpuTimes_Assign = 0.0_dp
      CpuTimes_DoConcurrent = 0.0_dp
      CpuTimes_Associate = 0.0_dp

      PRINT *, "Array assignment:"
      Loop_Repeat_Assign: DO Counter = 1, MAXREPEAT

         PRINT *, "   Trial ", Counter

         !.. Initialize the array using random numbers
         CALL RANDOM_NUMBER(array)

         !..
         CALL CPU_TIME(Start_Time)

         !..
         array(i0:j0) = array(i1:j1)

         CALL CPU_TIME(End_Time)

         !..
         CpuTimes_Assign(Counter) = (End_Time - Start_Time)

         !..
         IF (ABS(array(j0)-array(j1)) > EPSILON_DP) THEN
            PRINT *, " Copy failed."
            CYCLE Loop_Repeat_Assign
         END IF

         IF (Counter == 1) THEN
            PRINT *, " array(i0) = ", array(i0)
            PRINT *, " array(j0) = ", array(j0)
            PRINT *, " array(i1) = ", array(i1)
            PRINT *, " array(j1) = ", array(j1)
         END IF
         
         WRITE(*, FMT=FMT_CPU) "   CPU Time: ", CpuTimes_Assign(Counter), " seconds."

      END DO Loop_Repeat_Assign

      PRINT *, "DO CONCURRENT:"
      Loop_Repeat_DO: DO Counter = 1, MAXREPEAT

         PRINT *, "   Trial ", Counter

         !.. Initialize the array using random numbers
         CALL RANDOM_NUMBER(array)

         !..
         CALL CPU_TIME(Start_Time)

         !..
         DO CONCURRENT ( I = i0:j0 )
            array(I) = array(i1 - i0 + I)
         END DO

         CALL CPU_TIME(End_Time)

         !..
         CpuTimes_DoConcurrent(Counter) = (End_Time - Start_Time)

         !..
         IF (ABS(array(j0)-array(j1)) > EPSILON_DP) THEN
            PRINT *, " Copy failed."
            CYCLE Loop_Repeat_DO
         END IF

         IF (Counter == 1) THEN
            PRINT *, " array(i0) = ", array(i0)
            PRINT *, " array(j0) = ", array(j0)
            PRINT *, " array(i1) = ", array(i1)
            PRINT *, " array(j1) = ", array(j1)
         END IF
         
         WRITE(*, FMT=FMT_CPU) "   CPU Time: ", CpuTimes_DoConcurrent(Counter), " seconds."

      END DO Loop_Repeat_DO

      PRINT *, "ASSOCIATE:"
      Loop_Repeat_Assoc: DO Counter = 1, MAXREPEAT

         PRINT *, "   Trial ", Counter

         !.. Initialize the array using random numbers
         CALL RANDOM_NUMBER(array)

         !..
         CALL CPU_TIME(Start_Time)

         Assoc_Arr: ASSOCIATE ( tmp1 => array(i0:j0), tmp2 => array(i1:j1) )
            tmp1 = tmp2
         END ASSOCIATE Assoc_Arr

         CALL CPU_TIME(End_Time)

         !..
         CpuTimes_Associate(Counter) = (End_Time - Start_Time)

         !..
         IF (ABS(array(j0)-array(j1)) > EPSILON_DP) THEN
            PRINT *, " Copy failed."
            CYCLE Loop_Repeat_Assoc
         END IF

         IF (Counter == 1) THEN
            PRINT *, " array(i0) = ", array(i0)
            PRINT *, " array(j0) = ", array(j0)
            PRINT *, " array(i1) = ", array(i1)
            PRINT *, " array(j1) = ", array(j1)
         END IF
         
         WRITE(*, FMT=FMT_CPU) "   CPU Time: ", CpuTimes_Associate(Counter), " seconds."

      END DO Loop_Repeat_Assoc

      !..
      WRITE(*, FMT=FMT_CPU) "Array Assignment: Average CPU Time ",                                  &
                            SUM(CpuTimes_Assign)/REAL(MAXREPEAT, KIND=DP),  " seconds."

      !..
      WRITE(*, FMT=FMT_CPU) "DO CONCURRENT:    Average CPU Time ",                                  &
                           SUM(CpuTimes_DoConcurrent)/REAL(MAXREPEAT, KIND=DP),                     &
                           " seconds."

      !..
      WRITE(*, FMT=FMT_CPU) "ASSOCIATE:        Average CPU Time ",                                  &
                           SUM(CpuTimes_Associate)/REAL(MAXREPEAT, KIND=DP),                        &
                           " seconds."

      !..
      DEALLOCATE(array, STAT=Istat, ERRMSG=ErrorAlloc)
      IF (Istat /= 0) THEN
         PRINT *, " Deallocation of array failed. ", ErrorAlloc(1:LEN_TRIM(ErrorAlloc))
         STOP
      END IF

      !..
      STOP

   END PROGRAM p

The results I observe:

  ** With Overlap of  0
 Array assignment:
    Trial  1
  array(i0) =  2.649255667434409E-003
  array(j0) =  0.829067121479056
  array(i1) =  2.649255667434409E-003
  array(j1) =  0.829067121479056
   CPU Time:                              0.499 seconds.
    Trial  2
   CPU Time:                              0.499 seconds.
    Trial  3
   CPU Time:                              0.499 seconds.
    Trial  4
   CPU Time:                              0.499 seconds.
    Trial  5
   CPU Time:                              0.484 seconds.
 DO CONCURRENT:
    Trial  1
  array(i0) =  0.657694300591952
  array(j0) =  0.556066471741371
  array(i1) =  0.657694300591952
  array(j1) =  0.556066471741371
   CPU Time:                              0.640 seconds.
    Trial  2
   CPU Time:                              0.624 seconds.
    Trial  3
   CPU Time:                              0.640 seconds.
    Trial  4
   CPU Time:                              0.655 seconds.
    Trial  5
   CPU Time:                              0.593 seconds.
 ASSOCIATE:
    Trial  1
  array(i0) =  0.865167717234815
  array(j0) =  0.555618009170298
  array(i1) =  0.865167717234815
  array(j1) =  0.555618009170298
   CPU Time:                              0.203 seconds.
    Trial  2
   CPU Time:                              0.203 seconds.
    Trial  3
   CPU Time:                              0.218 seconds.
    Trial  4
   CPU Time:                              0.203 seconds.
    Trial  5
   CPU Time:                              0.203 seconds.
Array Assignment: Average CPU Time        0.496 seconds.
DO CONCURRENT:    Average CPU Time        0.630 seconds.
ASSOCIATE:        Average CPU Time        0.206 seconds.

Compiled with:

ifort /c /nologo /O3 /Qparallel /heap-arrays0 /standard-semantics /stand:f08
/traceback /libs:static /threads

My observations generally have been that in order for DO CONCURRENT to be effective, the computational intensity or the array sizes need to be above a certain threshold; otherwise, the overhead of "setting up" the parallel operations can overwhelm the benefits.

jimdempseyatthecove · ‎02-01-2015

To post #13 program I added

...
      
      SUBROUTINE COPY8(b0,e0,b1,e1)
      INTEGER b0,e0,b1,e1
      INTEGER i,b
      b = b1-b0
!DIR$ IVDEP
!DIR$ VECTOR NONTEMPORAL
      DO I = b0,e0
        Arr(i) = Arr(i+b)
      END DO
      END SUBROUTINE
...
        DO K=1,rep
          CALL Copy8(1,e0,b1,i)
          CALL Copy8(b1,i,1,e0)
        END DO
        Call STOPWATCH('LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL')

Results:

 Size:     8388608
      0.00 INIT
      0.00 Array Operator
      0.12 DO CONCURRENT
      0.62 Classical Loop
      0.41 Array Operator: Different Arrays
      0.16 BLAS
      0.12 LOOP WITH !DIR$ IVDEP
      0.13 associate
      0.08 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL
 Size:    16777216
      0.00 INIT
      0.00 Array Operator
      0.24 DO CONCURRENT
      1.21 Classical Loop
      0.78 Array Operator: Different Arrays
      0.23 BLAS
      0.24 LOOP WITH !DIR$ IVDEP
      0.24 associate
      0.16 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL
 Size:    33554432
      0.01 INIT
      0.00 Array Operator
      0.49 DO CONCURRENT
      2.43 Classical Loop
      1.55 Array Operator: Different Arrays
      0.46 BLAS
      0.47 LOOP WITH !DIR$ IVDEP
      0.49 associate
      0.31 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL
 Size:    67108864
      0.01 INIT
      0.00 Array Operator
      0.96 DO CONCURRENT
      4.85 Classical Loop
      3.07 Array Operator: Different Arrays
      0.93 BLAS
      0.92 LOOP WITH !DIR$ IVDEP
      0.97 associate
      0.63 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL

Clearly a winner

Jim Dempsey

jimdempseyatthecove · ‎02-01-2015

The above timings was without /Qparallel and with the sequential MKL

The following is with /Qparallel and the parallel MKL

 Size:     8388608
      0.00 INIT
      0.00 Array Operator
      0.12 DO CONCURRENT
      0.64 Classical Loop
      0.40 Array Operator: Different Arrays
      0.16 BLAS
      0.12 LOOP WITH !DIR$ IVDEP
      0.12 associate
      0.08 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL
 Size:    16777216
      0.00 INIT
      0.00 Array Operator
      0.24 DO CONCURRENT
      1.31 Classical Loop
      0.77 Array Operator: Different Arrays
      0.24 BLAS
      0.22 LOOP WITH !DIR$ IVDEP
      0.24 associate
      0.16 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL
 Size:    33554432
      0.01 INIT
      0.00 Array Operator
      0.48 DO CONCURRENT
      2.45 Classical Loop
      1.54 Array Operator: Different Arrays
      0.46 BLAS
      0.47 LOOP WITH !DIR$ IVDEP
      0.48 associate
      0.31 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL
 Size:    67108864
      0.01 INIT
      0.00 Array Operator
      0.97 DO CONCURRENT
      4.93 Classical Loop
      3.07 Array Operator: Different Arrays
      0.91 BLAS
      0.92 LOOP WITH !DIR$ IVDEP
      0.97 associate
      0.63 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL

BLAS marginally improved, DO CONCURRENT is inconclusive, IVDEP with NONTEMPORAL is the winner at 1.44x faster than BLAS.

Note, the above result is not to be taken as a generalization, rather it is for the specific conditions of the test program.

Jim Dempsey