- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I'm using "Intel(R) Visual Fortran Compiler XE for applications running on IA-32, Version 15.0.0.108 Build 20140726"
My program craches with an stack-overflow. On standard-error there's a stack-trace pointing to a codeline, where a part of an array is copied:
arr(i0: j0)= arr(i1: j1)
I read somewhere, that the compilter has to create a copy of the copied portion because the compiler cannot dtermine, wether source- and the target-memory do overlap (https://software.intel.com/en-us/node/524873).
Actually I do know, that they do not overlap. Is there a way to give this "promise" to the compiler to force the comiler to creatinon-copy-code?
Benedikt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not aware of ways to tell the compiler to avoid the copy in this case. I suggest compiling with /heap-arrays (Fortran > Optimization > Heap Arrays > 0)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DO CONCURRENT (I = i0:j0)
arr(I) = arr(i1-i0+I)
END DO
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The DO CONCURRENT helps the compiler decide that the loop is safe to vectorize.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following should also work:
Assoc_Arr: ASSOCIATE ( tmp1 => array(i0:j0), tmp2 => array(i1:j1) )
tmp1 = tmp2
END ASSOCIATE Assoc_Arr
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmmm... I didn't think that ASSOCIATE had rules about aliasing like procedures do. Let's try an experiment:
program P implicit none integer arr(10) integer i integer, parameter :: arr0(size(arr)) = [(i,i=1,size(arr))] integer i0,j0,i1,j1 write(*,'(a)') 'Array assignment' arr = arr0 call set1(i0,j0,i1,j1) arr(i0:j0) = arr(i1:j1) write(*,5) arr arr = arr0 call set2(i0,j0,i1,j1) arr(i0:j0) = arr(i1:j1) write(*,5) arr write(*,'(a)') 'Associate' arr = arr0 call set1(i0,j0,i1,j1) associate(T0 => arr(i0:j0), T1 => arr(i1:j1)) T0 = T1 end associate write(*,5) arr arr = arr0 call set2(i0,j0,i1,j1) associate(T0 => arr(i0:j0), T1 => arr(i1:j1)) T0 = T1 end associate write(*,5) arr write(*,'(a)') 'Subroutine' arr = arr0 call set1(i0,j0,i1,j1) call copy(arr(i0:j0),arr(i1:j1),j0-i0+1) write(*,5) arr arr = arr0 call set2(i0,j0,i1,j1) call copy(arr(i0:j0),arr(i1:j1),j0-i0+1) write(*,5) arr write(*,'(a)') 'Forall' arr = arr0 call set1(i0,j0,i1,j1) forall(i=i0:j0) arr(i) = arr(i1-i0+i) write(*,5) arr arr = arr0 call set2(i0,j0,i1,j1) forall(i=i0:j0) arr(i) = arr(i1-i0+i) write(*,5) arr write(*,'(a)') 'Do concurrent' arr = arr0 call set1(i0,j0,i1,j1) do concurrent(i=i0:j0) arr(i) = arr(i1-i0+i) end do write(*,5) arr arr = arr0 call set2(i0,j0,i1,j1) do concurrent(i=i0:j0) arr(i) = arr(i1-i0+i) end do write(*,5) arr 5 format(*(i0:1x)) contains subroutine copy(arr0,arr1,n) integer n integer arr0(n),arr1(n) arr0 = arr1 end subroutine copy end program P subroutine set1(i0,j0,i1,j1) i0 = 3 j0 = 5 i1 = 4 j1 = 6 end subroutine set1 subroutine set2(i0,j0,i1,j1) i0 = 5 j0 = 7 i1 = 4 j1 = 6 end subroutine set2
Output with ifort:
Array assignment 1 2 4 5 6 6 7 8 9 10 1 2 3 4 4 5 6 8 9 10 Associate 1 2 4 5 6 6 7 8 9 10 1 2 3 4 4 4 4 8 9 10 Subroutine 1 2 4 5 6 6 7 8 9 10 1 2 3 4 4 4 4 8 9 10 Forall 1 2 4 5 6 6 7 8 9 10 1 2 3 4 4 5 6 8 9 10 Do concurrent 1 2 4 5 6 6 7 8 9 10 1 2 3 4 4 4 4 8 9 10
So ASSOCIATE has the same result as the subroutine and DO CONCURRENT which do have aliasing rules. I get similar results for gfortran. Where does it talk about aliasing rules for the ASSOCIATE construct in the standard?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Repeat Offender wrote:
...
So ASSOCIATE has the same result as the subroutine and DO CONCURRENT which do have aliasing rules. I get similar results for gfortran. Where does it talk about aliasing rules for the ASSOCIATE construct in the standard?
I assume you've referred to J3/04-007 Latest Working Draft of the Fortran 2003 standard, May 10 2014: sections 8.1.4 and 16.4.1.5. Now I've a tough time reading these standards documents. so it's The Fortran 2003 Handbook by Adams et al. to the rescue: section 8.2.2 of this book says about the association during the execution of the ASSOCIATE construct, "This process is somewhat similar to what happens in a procedure call with the associate name taking the role of the dummy argument." In the context in this thread, I personally would prefer ASSOCIATE over DO CONCURRENT.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Should you compile with Qparallel or is that implicit in Qmkl? If you're not enabling auto-parallel, there is no benefit to DO CONCURRENT I believe.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's another test using floating point variables you can consider:
PROGRAM p
USE, INTRINSIC :: ISO_FORTRAN_ENV, ONLY : I4 => INT32, DP => REAL64
!..
IMPLICIT NONE
!..
INTEGER(I4), PARAMETER :: MAXARR = 2**28
INTEGER(I4), PARAMETER :: MAXREPEAT = 5
INTEGER(I4) :: i0
INTEGER(I4) :: j0
INTEGER(I4) :: i1
INTEGER(I4) :: j1
INTEGER(I4) :: Istat
INTEGER(I4) :: I
INTEGER(I4) :: Counter
INTEGER(I4) :: Overlap
REAL(DP), PARAMETER :: EPSILON_DP = EPSILON(1.0_dp)
REAL(DP), ALLOCATABLE :: array(:)
REAL(DP) :: Start_Time = 0_dp
REAL(DP) :: End_Time = 0_dp
REAL(DP) :: CpuTimes_Assign(MAXREPEAT)
REAL(DP) :: CpuTimes_DoConcurrent(MAXREPEAT)
REAL(DP) :: CpuTimes_Associate(MAXREPEAT)
CHARACTER(LEN=*), PARAMETER :: FMT_CPU = "(A, T40, F8.3, A)"
CHARACTER(LEN=2048) :: ErrorAlloc
!..
ALLOCATE(array(MAXARR), SOURCE=0.0_dp, STAT=Istat, ERRMSG=ErrorAlloc)
IF (Istat /= 0) THEN
PRINT *, " Allocation of array failed: ", ErrorAlloc(1:LEN_TRIM(ErrorAlloc))
STOP
END IF
!..
Overlap = 0
PRINT *, " ** With Overlap of ", Overlap
i0 = 1
j0 = MAXARR/2
i1 = j0 + 1 - Overlap
j1 = i1 + MAXARR/2 - 1
CpuTimes_Assign = 0.0_dp
CpuTimes_DoConcurrent = 0.0_dp
CpuTimes_Associate = 0.0_dp
PRINT *, "Array assignment:"
Loop_Repeat_Assign: DO Counter = 1, MAXREPEAT
PRINT *, " Trial ", Counter
!.. Initialize the array using random numbers
CALL RANDOM_NUMBER(array)
!..
CALL CPU_TIME(Start_Time)
!..
array(i0:j0) = array(i1:j1)
CALL CPU_TIME(End_Time)
!..
CpuTimes_Assign(Counter) = (End_Time - Start_Time)
!..
IF (ABS(array(j0)-array(j1)) > EPSILON_DP) THEN
PRINT *, " Copy failed."
CYCLE Loop_Repeat_Assign
END IF
IF (Counter == 1) THEN
PRINT *, " array(i0) = ", array(i0)
PRINT *, " array(j0) = ", array(j0)
PRINT *, " array(i1) = ", array(i1)
PRINT *, " array(j1) = ", array(j1)
END IF
WRITE(*, FMT=FMT_CPU) " CPU Time: ", CpuTimes_Assign(Counter), " seconds."
END DO Loop_Repeat_Assign
PRINT *, "DO CONCURRENT:"
Loop_Repeat_DO: DO Counter = 1, MAXREPEAT
PRINT *, " Trial ", Counter
!.. Initialize the array using random numbers
CALL RANDOM_NUMBER(array)
!..
CALL CPU_TIME(Start_Time)
!..
DO CONCURRENT ( I = i0:j0 )
array(I) = array(i1 - i0 + I)
END DO
CALL CPU_TIME(End_Time)
!..
CpuTimes_DoConcurrent(Counter) = (End_Time - Start_Time)
!..
IF (ABS(array(j0)-array(j1)) > EPSILON_DP) THEN
PRINT *, " Copy failed."
CYCLE Loop_Repeat_DO
END IF
IF (Counter == 1) THEN
PRINT *, " array(i0) = ", array(i0)
PRINT *, " array(j0) = ", array(j0)
PRINT *, " array(i1) = ", array(i1)
PRINT *, " array(j1) = ", array(j1)
END IF
WRITE(*, FMT=FMT_CPU) " CPU Time: ", CpuTimes_DoConcurrent(Counter), " seconds."
END DO Loop_Repeat_DO
PRINT *, "ASSOCIATE:"
Loop_Repeat_Assoc: DO Counter = 1, MAXREPEAT
PRINT *, " Trial ", Counter
!.. Initialize the array using random numbers
CALL RANDOM_NUMBER(array)
!..
CALL CPU_TIME(Start_Time)
Assoc_Arr: ASSOCIATE ( tmp1 => array(i0:j0), tmp2 => array(i1:j1) )
tmp1 = tmp2
END ASSOCIATE Assoc_Arr
CALL CPU_TIME(End_Time)
!..
CpuTimes_Associate(Counter) = (End_Time - Start_Time)
!..
IF (ABS(array(j0)-array(j1)) > EPSILON_DP) THEN
PRINT *, " Copy failed."
CYCLE Loop_Repeat_Assoc
END IF
IF (Counter == 1) THEN
PRINT *, " array(i0) = ", array(i0)
PRINT *, " array(j0) = ", array(j0)
PRINT *, " array(i1) = ", array(i1)
PRINT *, " array(j1) = ", array(j1)
END IF
WRITE(*, FMT=FMT_CPU) " CPU Time: ", CpuTimes_Associate(Counter), " seconds."
END DO Loop_Repeat_Assoc
!..
WRITE(*, FMT=FMT_CPU) "Array Assignment: Average CPU Time ", &
SUM(CpuTimes_Assign)/REAL(MAXREPEAT, KIND=DP), " seconds."
!..
WRITE(*, FMT=FMT_CPU) "DO CONCURRENT: Average CPU Time ", &
SUM(CpuTimes_DoConcurrent)/REAL(MAXREPEAT, KIND=DP), &
" seconds."
!..
WRITE(*, FMT=FMT_CPU) "ASSOCIATE: Average CPU Time ", &
SUM(CpuTimes_Associate)/REAL(MAXREPEAT, KIND=DP), &
" seconds."
!..
DEALLOCATE(array, STAT=Istat, ERRMSG=ErrorAlloc)
IF (Istat /= 0) THEN
PRINT *, " Deallocation of array failed. ", ErrorAlloc(1:LEN_TRIM(ErrorAlloc))
STOP
END IF
!..
STOP
END PROGRAM p
The results I observe:
** With Overlap of 0
Array assignment:
Trial 1
array(i0) = 2.649255667434409E-003
array(j0) = 0.829067121479056
array(i1) = 2.649255667434409E-003
array(j1) = 0.829067121479056
CPU Time: 0.499 seconds.
Trial 2
CPU Time: 0.499 seconds.
Trial 3
CPU Time: 0.499 seconds.
Trial 4
CPU Time: 0.499 seconds.
Trial 5
CPU Time: 0.484 seconds.
DO CONCURRENT:
Trial 1
array(i0) = 0.657694300591952
array(j0) = 0.556066471741371
array(i1) = 0.657694300591952
array(j1) = 0.556066471741371
CPU Time: 0.640 seconds.
Trial 2
CPU Time: 0.624 seconds.
Trial 3
CPU Time: 0.640 seconds.
Trial 4
CPU Time: 0.655 seconds.
Trial 5
CPU Time: 0.593 seconds.
ASSOCIATE:
Trial 1
array(i0) = 0.865167717234815
array(j0) = 0.555618009170298
array(i1) = 0.865167717234815
array(j1) = 0.555618009170298
CPU Time: 0.203 seconds.
Trial 2
CPU Time: 0.203 seconds.
Trial 3
CPU Time: 0.218 seconds.
Trial 4
CPU Time: 0.203 seconds.
Trial 5
CPU Time: 0.203 seconds.
Array Assignment: Average CPU Time 0.496 seconds.
DO CONCURRENT: Average CPU Time 0.630 seconds.
ASSOCIATE: Average CPU Time 0.206 seconds.
Compiled with:
ifort /c /nologo /O3 /Qparallel /heap-arrays0 /standard-semantics /stand:f08 /traceback /libs:static /threads
My observations generally have been that in order for DO CONCURRENT to be effective, the computational intensity or the array sizes need to be above a certain threshold; otherwise, the overhead of "setting up" the parallel operations can overwhelm the benefits.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To post #13 program I added
...
SUBROUTINE COPY8(b0,e0,b1,e1)
INTEGER b0,e0,b1,e1
INTEGER i,b
b = b1-b0
!DIR$ IVDEP
!DIR$ VECTOR NONTEMPORAL
DO I = b0,e0
Arr(i) = Arr(i+b)
END DO
END SUBROUTINE
...
DO K=1,rep
CALL Copy8(1,e0,b1,i)
CALL Copy8(b1,i,1,e0)
END DO
Call STOPWATCH('LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL')
Results:
Size: 8388608 0.00 INIT 0.00 Array Operator 0.12 DO CONCURRENT 0.62 Classical Loop 0.41 Array Operator: Different Arrays 0.16 BLAS 0.12 LOOP WITH !DIR$ IVDEP 0.13 associate 0.08 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL Size: 16777216 0.00 INIT 0.00 Array Operator 0.24 DO CONCURRENT 1.21 Classical Loop 0.78 Array Operator: Different Arrays 0.23 BLAS 0.24 LOOP WITH !DIR$ IVDEP 0.24 associate 0.16 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL Size: 33554432 0.01 INIT 0.00 Array Operator 0.49 DO CONCURRENT 2.43 Classical Loop 1.55 Array Operator: Different Arrays 0.46 BLAS 0.47 LOOP WITH !DIR$ IVDEP 0.49 associate 0.31 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL Size: 67108864 0.01 INIT 0.00 Array Operator 0.96 DO CONCURRENT 4.85 Classical Loop 3.07 Array Operator: Different Arrays 0.93 BLAS 0.92 LOOP WITH !DIR$ IVDEP 0.97 associate 0.63 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL
Clearly a winner
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The above timings was without /Qparallel and with the sequential MKL
The following is with /Qparallel and the parallel MKL
Size: 8388608 0.00 INIT 0.00 Array Operator 0.12 DO CONCURRENT 0.64 Classical Loop 0.40 Array Operator: Different Arrays 0.16 BLAS 0.12 LOOP WITH !DIR$ IVDEP 0.12 associate 0.08 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL Size: 16777216 0.00 INIT 0.00 Array Operator 0.24 DO CONCURRENT 1.31 Classical Loop 0.77 Array Operator: Different Arrays 0.24 BLAS 0.22 LOOP WITH !DIR$ IVDEP 0.24 associate 0.16 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL Size: 33554432 0.01 INIT 0.00 Array Operator 0.48 DO CONCURRENT 2.45 Classical Loop 1.54 Array Operator: Different Arrays 0.46 BLAS 0.47 LOOP WITH !DIR$ IVDEP 0.48 associate 0.31 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL Size: 67108864 0.01 INIT 0.00 Array Operator 0.97 DO CONCURRENT 4.93 Classical Loop 3.07 Array Operator: Different Arrays 0.91 BLAS 0.92 LOOP WITH !DIR$ IVDEP 0.97 associate 0.63 LOOP WITH !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL
BLAS marginally improved, DO CONCURRENT is inconclusive, IVDEP with NONTEMPORAL is the winner at 1.44x faster than BLAS.
Note, the above result is not to be taken as a generalization, rather it is for the specific conditions of the test program.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page