Solved: Now you show cases using the

e745200 · ‎06-10-2015

Hi, all

I'd like very much the capability of working with array sections in assignment such as

a(k:k+l) = b(j:j+l)

After some tests, however, I am surprised that this notation leads to much less efficient code than the good old equivalent loops:

do i = 0,l
   a(k+i) = b(j+i)
end do

I guess that the former generates some temporary before the assignment, and this operation consumes time.

Is there some option, directive or trick that can be used not to lose in performance ?

Otherwise, what's worth using such a cooler notation if the old construct (in my simple experiment) has a speedup of 1.5/2.3 over it ?

Thanks in advance for any hint.

PS. Here is my little example.

      implicit none
      real(8), allocatable :: a(:)
      real :: t0,t1,t2
      integer :: i, n, m, maxtimes
      maxtimes = 10000
      n = 100000
      do while ( n <= 10000000)
         print *, 'n = ', n
         allocate(a(n))
         if ( .not. allocated(a) ) stop
         call cpu_time(t0)
         do m = 1, maxtimes
            do i = 1,n/2
                a(i) = a(n-i+1)
            end do
         end do
         call cpu_time(t1)
         do m = 1, maxtimes
            a(1:n/2) = a(n:n/2+1:-1)
         end do
         call cpu_time(t2)
         deallocate(a)
         write(*,'(I8,3F10.3)') n, t1-t0, t2-t1, (t2-t1)/(t1-t0)
         n = n * 2
      end do  
      end

TimP · ‎06-10-2015

Now you show cases using the same stride for source and destination. These have to be implemented differently from the case with different (including positive vs. negative) strides.

mkl dcopy could use parallel (threaded) code if you link the mkl:parallel and the case is large enough. That won't necessarily improve performance unless running a large enough case on multiple CPUs, depending somewhat on memory locality. Last time I looked dcopy was unrolled more aggressively than other alternatives, to optimize performance for bigger arrays.

When you call the subroutine you are asserting (according to Fortran standard) there is no overlap. The compiler may make an intel_fast_memcpy substitution, including automatic decision at run time whether to use nontemporal stores. optreport would show this (no report about vectorization).

View solution in original post

TimP · ‎06-10-2015

Unfortunately, as you surmised, ifort does allocate a temporary and perform a double copy for array section assignments within a given array, and doesn't perform much analysis to determine whether there is actual possibility of overlap. Ideally, the allocation and deallocation would fall outside your test loop so you wouldn't see the time penalty for that.

The code you show wasn't optimized by ifort until fairly recently; AVX2 offers better ISA support for it.

Prior to ifort 16.0 beta, many such cases required !dir$ simd for optimization. !$omp simd didn't help (and isn't intended to work with array assignments). If you set the directive but there is actual overlap, it could produce wrong results.

If the array section is big enough (at least 4KB) you might look at optimization report to see whether -O3 gave you streaming store (you could force that by !dir$ vector nontemporal, which may work even with array assignment).

Many of us avoid choosing syntax for "coolness" although it's desirable for the code which is most readable not to give up performance.

Multi-rank array assignments are notorious for not generating optimum code (besides not working with OpenMP), so I have many rank 1 array assignments inside DO loops.

e745200 · ‎06-10-2015

Thanks a lot, Tim !

I'll study if the directive you suggested can help not losing performance in my case.

In the meanwhile, I've found that calling a subroutine which simply makes an assignment on full arrays (as seen from inside the subroutine) gives the best performances,

! a(k:k+l) = b(j:j+l)
  call sub1(a(k), b(j), l+1)
  ...

  subroutine sub1(tgt, src, n)
  integer :: n
  real(8) :: src(n)
  real(8) :: tgt(n)
  tgt = src
  return
  end

At this point, the final syntax is not different from using the BLAS dcopy, which was the initial stage of the code I was trying to improve ... just on e less dependency...

TimP · ‎06-10-2015

Now you show cases using the same stride for source and destination. These have to be implemented differently from the case with different (including positive vs. negative) strides.

mkl dcopy could use parallel (threaded) code if you link the mkl:parallel and the case is large enough. That won't necessarily improve performance unless running a large enough case on multiple CPUs, depending somewhat on memory locality. Last time I looked dcopy was unrolled more aggressively than other alternatives, to optimize performance for bigger arrays.

When you call the subroutine you are asserting (according to Fortran standard) there is no overlap. The compiler may make an intel_fast_memcpy substitution, including automatic decision at run time whether to use nontemporal stores. optreport would show this (no report about vectorization).

e745200 · ‎06-21-2015

Thanks again, Tim.

Yes, the strides in the first example were different but not for a special choice, actually. Now I see that perhaps that was not a happy choice, for what you said. A good chance to explore the optimizer behavior, though ;-)

is it worth using array section assignment ?