- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, all
I'd like very much the capability of working with array sections in assignment such as
a(k:k+l) = b(j:j+l)
After some tests, however, I am surprised that this notation leads to much less efficient code than the good old equivalent loops:
do i = 0,l a(k+i) = b(j+i) end do
I guess that the former generates some temporary before the assignment, and this operation consumes time.
Is there some option, directive or trick that can be used not to lose in performance ?
Otherwise, what's worth using such a cooler notation if the old construct (in my simple experiment) has a speedup of 1.5/2.3 over it ?
Thanks in advance for any hint.
PS. Here is my little example.
implicit none real(8), allocatable :: a(:) real :: t0,t1,t2 integer :: i, n, m, maxtimes maxtimes = 10000 n = 100000 do while ( n <= 10000000) print *, 'n = ', n allocate(a(n)) if ( .not. allocated(a) ) stop call cpu_time(t0) do m = 1, maxtimes do i = 1,n/2 a(i) = a(n-i+1) end do end do call cpu_time(t1) do m = 1, maxtimes a(1:n/2) = a(n:n/2+1:-1) end do call cpu_time(t2) deallocate(a) write(*,'(I8,3F10.3)') n, t1-t0, t2-t1, (t2-t1)/(t1-t0) n = n * 2 end do end
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now you show cases using the same stride for source and destination. These have to be implemented differently from the case with different (including positive vs. negative) strides.
mkl dcopy could use parallel (threaded) code if you link the mkl:parallel and the case is large enough. That won't necessarily improve performance unless running a large enough case on multiple CPUs, depending somewhat on memory locality. Last time I looked dcopy was unrolled more aggressively than other alternatives, to optimize performance for bigger arrays.
When you call the subroutine you are asserting (according to Fortran standard) there is no overlap. The compiler may make an intel_fast_memcpy substitution, including automatic decision at run time whether to use nontemporal stores. optreport would show this (no report about vectorization).
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, as you surmised, ifort does allocate a temporary and perform a double copy for array section assignments within a given array, and doesn't perform much analysis to determine whether there is actual possibility of overlap. Ideally, the allocation and deallocation would fall outside your test loop so you wouldn't see the time penalty for that.
The code you show wasn't optimized by ifort until fairly recently; AVX2 offers better ISA support for it.
Prior to ifort 16.0 beta, many such cases required !dir$ simd for optimization. !$omp simd didn't help (and isn't intended to work with array assignments). If you set the directive but there is actual overlap, it could produce wrong results.
If the array section is big enough (at least 4KB) you might look at optimization report to see whether -O3 gave you streaming store (you could force that by !dir$ vector nontemporal, which may work even with array assignment).
Many of us avoid choosing syntax for "coolness" although it's desirable for the code which is most readable not to give up performance.
Multi-rank array assignments are notorious for not generating optimum code (besides not working with OpenMP), so I have many rank 1 array assignments inside DO loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot, Tim !
I'll study if the directive you suggested can help not losing performance in my case.
In the meanwhile, I've found that calling a subroutine which simply makes an assignment on full arrays (as seen from inside the subroutine) gives the best performances,
! a(k:k+l) = b(j:j+l) call sub1(a(k), b(j), l+1) ... subroutine sub1(tgt, src, n) integer :: n real(8) :: src(n) real(8) :: tgt(n) tgt = src return end
At this point, the final syntax is not different from using the BLAS dcopy, which was the initial stage of the code I was trying to improve ... just on e less dependency...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now you show cases using the same stride for source and destination. These have to be implemented differently from the case with different (including positive vs. negative) strides.
mkl dcopy could use parallel (threaded) code if you link the mkl:parallel and the case is large enough. That won't necessarily improve performance unless running a large enough case on multiple CPUs, depending somewhat on memory locality. Last time I looked dcopy was unrolled more aggressively than other alternatives, to optimize performance for bigger arrays.
When you call the subroutine you are asserting (according to Fortran standard) there is no overlap. The compiler may make an intel_fast_memcpy substitution, including automatic decision at run time whether to use nontemporal stores. optreport would show this (no report about vectorization).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks again, Tim.
Yes, the strides in the first example were different but not for a special choice, actually. Now I see that perhaps that was not a happy choice, for what you said. A good chance to explore the optimizer behavior, though ;-)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page