- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi every one,
I am working on sparse algorithms' optimization using Intel's Fortran compiler. After applying different optimization features I want to make suitable use of data prefetching and cache utilization. In order to do that I tested several probable configurations of prefetching directives and intrinsic functions on both Intel Corei7 and AMD APU processors. But I don't get expected results. But in a specific case I think I get a real prefetching which gives me a 3-4 times speed up.
Following is the faster code:
[fortran]
DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, X, TEMP
DOUBLE PRECISION :: SUM
INTEGER :: SIZE, I, J, COUNT, BLS, I0
SIZE = 1000000
BLS = 21 * 25
ALLOCATE(A2D(0:BLS * SIZE - 1))
ALLOCATE(X(0:SIZE - 1))
ALLOCATE(TEMP(0:BLS - 1))
!DEC$ SIMD
DO J = 0, SIZE - 1
DO I = 0, BLS - 1
A2D(BLS * J + I) = I + J
END DO
END DO
DO COUNT = 0, 50
!$OMP PARALLEL SHARED(A2D, X, SIZE, BLS)
!$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0)
!DEC$ SIMD
DO J = 0, SIZE - 1
I0 = BLS * J
DO I = 0, BLS - 1
TEMP(I) = A2D(I0 + I)
END DO
SUM = 0.D0
DO I = 0, BLS - 1
SUM = SUM + TEMP(I) * 2.D0
END DO
X(J) = SUM
END DO
!$OMP END DO
!$OMP END PARALLEL
END DO
[/fortran]
And the following is the code I expect to be correct but is around 4 times slower (I think because the prefetch directive does not work):
[fortran]
DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, x
DOUBLE PRECISION :: SUM
INTEGER :: SIZE, I, J, COUNT, BLS, I0
SIZE = 1000000
BLS = 21 * 25
ALLOCATE(A2D(0:BLS * SIZE - 1))
ALLOCATE(X(0:SIZE - 1))
!DEC$ SIMD
DO J = 0, SIZE - 1
DO I = 0, BLS - 1
A2D(BLS * J + I) = I + J
END DO
END DO
DO COUNT = 0, 50
!$OMP PARALLEL SHARED(A2D, X, SIZE, BLS)
!$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0, J_CACHE)
!DEC$ PREFETCH A2D
DO J = 0, SIZE - 1
I0 = BLS * J
SUM = 0.D0
!DEC$ SIMD
DO I = 0, BLS - 1
SUM = SUM + A2D(I0 + I) * 2.D0
END DO
X(J) = SUM
END DO
!$OMP END DO
!$OMP END PARALLEL
END DO
[/fortran]
I am really confused and need your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sina,
Using:
Intel(R) Visual Fortran Composer XE 2013 SP1 Update 2 Integration for Microsoft Visual Studio* 2010, 14.0.0086.2010, Copyright (C) 2002-2014 Intel Corporation
and your original code without the temp array the !DEC$ PREFETCH A2D had no effect (not Itanium processor)
The compiler did unroll the loop and SIMD reduction'd the SUM without the REDUCTION(+:SUM) clause:
.B1.16:: ; Preds .B1.16 .B1.15 movaps xmm15, XMMWORD PTR [r14] ;40.29 add r15, 16 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm14, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [16+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm2, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [32+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm4, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [48+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm5, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [64+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm11, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [80+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm12, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [96+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm13, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [112+r14] ;40.29 add r14, 128 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm3, xmm15 ;40.17 cmp r15, r10 ;39.13 jb .B1.16 ; Prob 82% ;39.13
In the above, the main portion of the inner loop was unrolled 8 times.
*** the above code is not optimal because xmm15 is used for all memory fetches, and immediately followed by a mulpd that s dependent on the completion of the movaps. It would have been more efficient to have alternated registers, and then interleaved the unrolls.
Alternating registers without interleaving (no performance gain, merely used here for illustration):
.B1.16:: ; Preds .B1.16 .B1.15 movaps xmm15, XMMWORD PTR [r14] ;40.29 add r15, 16 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm14, xmm15 ;40.17 movaps xmm6, XMMWORD PTR [16+r14] ;40.29 mulpd xmm6, xmm1 ;40.41 addpd xmm2, xmm6 ;40.17 movaps xmm7, XMMWORD PTR [32+r14] ;40.29 mulpd xmm7, xmm1 ;40.41 addpd xmm4, xmm7 ;40.17 movaps xmm8, XMMWORD PTR [48+r14] ;40.29 mulpd xmm8, xmm1 ;40.41 addpd xmm5, xmm8 ;40.17 movaps xmm15, XMMWORD PTR [64+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm11, xmm15 ;40.17 movaps xmm6, XMMWORD PTR [80+r14] ;40.29 mulpd xmm6, xmm1 ;40.41 addpd xmm12, xmm6 ;40.17 movaps xmm7, XMMWORD PTR [96+r14] ;40.29 mulpd xmm7, xmm1 ;40.41 addpd xmm13, xmm7 ;40.17 movaps xmm8, XMMWORD PTR [112+r14] ;40.29 add r14, 128 ;39.13 mulpd xmm8, xmm1 ;40.41 addpd xmm3, xmm8 ;40.17 cmp r15, r10 ;39.13 jb .B1.16 ; Prob 82% ;39.13
Note, xmm15 alternates with xmm6, xmm7, xmm8 (assuming that these have been made available for use).
Now using 4-way interleaving of an 8 unrolled loop:
movaps xmm15, XMMWORD PTR [r14] ;40.29 movaps xmm6, XMMWORD PTR [16+r14] ;40.29 movaps xmm7, XMMWORD PTR [32+r14] ;40.29 .B1.16:: ; Preds .B1.16 .B1.15 movaps xmm8, XMMWORD PTR [48+r14] ;40.29 add r15, 16 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm14, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [64+r14] ;40.29 mulpd xmm6, xmm1 ;40.41 addpd xmm2, xmm6 ;40.17 movaps xmm6, XMMWORD PTR [80+r14] ;40.29 mulpd xmm7, xmm1 ;40.41 addpd xmm4, xmm7 ;40.17 movaps xmm7, XMMWORD PTR [96+r14] ;40.29 mulpd xmm8, xmm1 ;40.41 addpd xmm5, xmm8 ;40.17 movaps xmm8, XMMWORD PTR [112+r14] ;40.29 add r14, 128 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm11, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [r14] ;40.29 mulpd xmm6, xmm1 ;40.41 addpd xmm12, xmm6 ;40.17 movaps xmm6, XMMWORD PTR [16+r14] ;40.29 mulpd xmm7, xmm1 ;40.41 addpd xmm13, xmm7 ;40.17 movaps xmm7, XMMWORD PTR [32+r14] ;40.29 mulpd xmm8, xmm1 ;40.41 addpd xmm3, xmm8 ;40.17 cmp r15, r10 ;39.13 jb .B1.16 ; Prob 82% ;39.13
Note now how the memory latencies of the movaps instructions are filled by ~10 instructions that are non-dependent of the move latency.
The optimization gurus at Intel should really look at your test case for two reasons:
It looks simple enough to likely occur in many places eleswhere
And, low hanging fruit (should be easy enough to fix).
Using MM_PREFETCH
!DEC$ SIMD DO I = 0, BLS - 1 CALL MM_PREFETCH(A2D(I0 + I + 128)) SUM = SUM + A2D(I0 + I) * 2.D0 END DO
Also produces non-optimal code
.B1.17:: ; Preds .B1.17 .B1.16 prefetcht0 BYTE PTR [1024+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm13, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1040+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [16+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm2, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1056+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [32+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm3, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1072+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [48+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm4, xmm15 ;41.17 prefetcht0 BYTE PTR [1088+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [64+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm5, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1104+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [80+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm10, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1120+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [96+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm11, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1136+rbp+r15*8] ;40.22 add r15, 16 ;39.13 movaps xmm15, XMMWORD PTR [112+r14] ;41.29 add r14, 128 ;39.13 mulpd xmm15, xmm1 ;41.41 addpd xmm12, xmm15 ;41.17 cmp r15, r13 ;39.13 jb .B1.17 ; Prob 82% ;39.13
Note, the compiler unrolled the loop 8-times as before (and did not interleave), but more importantly, it did not strip the prefetches marked with **. Those prefetches are easily enough to be determined to lay within the same cache line as the non-** marked line. Removing those unnecessary prefetches would retrieve a few clock cycles.
Interleaving would be the biggest gain.
Even using /QxHost on system with AVX, the compiler unrolled but did not interleave
.B1.54:: ; Preds .B1.54 .B1.53 ; Infreq vmulpd ymm15, ymm1, YMMWORD PTR [rsi+rbx*8] ;43.41 vaddpd ymm3, ymm3, ymm15 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [32+rsi+rbx*8] ;43.41 vaddpd ymm2, ymm15, ymm2 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [64+rsi+rbx*8] ;43.41 vaddpd ymm4, ymm15, ymm4 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [96+rsi+rbx*8] ;43.41 vaddpd ymm5, ymm15, ymm5 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [128+rsi+rbx*8] ;43.41 vaddpd ymm13, ymm15, ymm13 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [160+rsi+rbx*8] ;43.41 vaddpd ymm14, ymm15, ymm14 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [192+rsi+rbx*8] ;43.41 vaddpd ymm12, ymm15, ymm12 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [224+rsi+rbx*8] ;43.41 add rbx, 32 ;41.13 vaddpd ymm11, ymm15, ymm11 ;43.17 cmp rbx, r15 ;41.13 jb .B1.54 ; Prob 82% ;41.13
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you want the prefetch on the inner most loop. However, in cases like yours (streaming loads) the hardware prefetcher typically does better than software prefetching. Try the 2nd code with no prefetching (and no temp array).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Jim.
It seems unlikely you would choose to write this with so much verbosity, as all the stuff with SUM is equivalent to
X(J) = SUM( A2D(I0:I0+BLS-1)*2 )
If you want to learn about the prefetch directives, including some examples where it's likely to help in sparse matrix operations, see
https://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-5.pdf
You could skip down to the Fortran examples on indirect prefetch, which would be valid for all CPUs supported by ifort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I described above the first code is 3-4 times faster and in my opinion it should be because of a memory caching issue. I have read the specified document on prefetching and tried all the possible configurations. PREFETCH directive and MM_PREFETCH intrinsic do not work at all. In fact now I just want to know why exactly the first code runs faster than any other possible combination.
Thanks for your notice and looking forward to hearing from you as soon as possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IVF has options to generate vectorization reports. These reports may give insight as to what is going on.
The following assumes compiler options and runtime environment are the same (IOW you have already have assured they are running with equal footing).
In the first piece of code you have the extra step of copying the data to a local array, but then reusing that data in the summation loop.
In the second piece of code you omit the copy to local array and fetch the data as you perform the summation.
Your BLS is 21 * 25 = 525 which is not a multiple of 2 (SSE) or 4 (AVX).
The first code with the copy loop will efficiently copy aligned or unaligned data from A2D to an aligned TEMP array then use known aligned data for the summation loop (and data which is contained within L1 cache.
The second code will generate unaligned instructions (due to unknown nature of input data) and incur penalties when data is not aligned.
The above are my presumptions, VTune will tell you what is really going on.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim
Thanks for your useful reply. I think so and as you said it should happen because of L1 cache usage. In this way I should conclude that using local variables and utilization of SIMD directive for memory fetch makes compiler save the prescribed variable in L1 cache. I wonder whether there might be another better approach for bringing a specific range of data from memory to cache. In fact I thought features like PREFETCH directive and MM_PREFETCH intrinsic function had to force some similar behavior but it seems they have no specific influence in the program's memory management.
Thanks again for your replay and looking forward to hearing from you as soon as possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> I wonder whether there might be another better approach for bringing a specific range of data from memory to cache.
Well - maybe there is.
If your CPU has the TSX extensions (I do not have one that does). Then in the first example code you presented, the one with the copy to temp array loop, then if your set your block size smaller than the number of bufferable writes supported by your processor's TSX, then by placing the blocking loop into a transactional protected region, the copy to temp array (at the time of the copy) will not perform writes to RAM. What it is not known (to me) is what happens in a transactional regions when you invalidate cache lines. IIF you can invalidate cache lines .AND. if it eliminates the write of TEMP array to RAM, then you can insert a cache line invalidation loop for the temp array at the bottom of the block loop (inside the transactional region).
Note, if the above fully works, then it would reduce the memory bandwidth requirements of this example by 30%. (low hanging fruit for CPU optimizations.
An alternative (not available) would be to extend the instruction set to permit something along the line of:
mov rcx, count; lea rsi, address; rep newPrefetch [rsi]
What the above will do is (depending on implementation): lift the prefetch instruction(s) out of the loop (reducing instruction count in loop), and/or inform the hardware prefetcher of anticipated fetch pattern such that hardware prefetching can occur earlier.
The new instruction sequence could also be
newPrefetch [someAddress], count
where count is either immediate or register
An alternate form could be
newPrefetch [someAddress], stride << 32 + count;
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim
Just I don't know what is the effect of PREFETCH directive in memory management? It seems It does really nothing but in Intel's documentations it is said to force such behavior.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For detailed information on software prefetching consult the IVF documentation, and google search
software prefetch optimization site:intel.com
Then select a few of the promising hits.
In general terms, as you have programmed in your sample code, the DO loop will have been vectorized and the iterations advance vector at a time (with possible exception for unaligned data at front and residual data at end of loop). When software prefetching is used, the compiler inserts into your code stream, either preceding or following:
someRegister = [someAddress]
a prefetch instruction
prefetch [someAddress + someOffset]
In the !DEC$... prefetch lacking any further directives, the compiler determines an appropriate offset. This may be a few cache lines ahead. You can (may) specify the distance ahead (in terms of cache lines) and stride in event your reads are not adjacent.
Using software prefetching is going to cost you at least one clock cycle. Therefor if under the circumstances, if prefetching is unwarranted using it will slow you down a bit. Also, when used inappropriately, it can unnecessarily consume memory bandwidth. Experiment with using it and learn under what circumstance it works, and does not work. There is no panacea with prefetching.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There's so much information in the references I gave earlier that I don't see us guessing what information you were looking for and didn't find.
Further to Jim's hint, you would want to turn on the relevant opt-report (the specific numbers change among ifort versions) about what the compiler chose to do with your prefetch directives (and what is the effect of various levels of /Qopt-prefetch).
You would probably want to work with an example relevant to your project where there is s reasonable expectation of prefetch making a difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim for your suitable help. I have studied the documents before and checked so many examples but finally I don't get reasonable results.
For Tim's information if you try several test examples you will get to the result that !DEC$ PREFETCH does nothing.
I will use my own approach of local variables' usage and thanks all for your contribution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Depending on the version of your compiler it may indeed be ignored.
Alternatively there is the intrinsic subroutine MM_PREFETCH (look in IFV index under "prefetches of data").
subroutine spread_lf (a, b) PARAMETER (n = 1025) real*8 a(n,n), b(n,n), c(n) do j = 1,n do i = 1,100 a(i, j) = b(i-1, j) + b(i+1, j) call mm_prefetch (a(i+20, j), 1) call mm_prefetch (b(i+21, j), 1) enddo enddo print *, a(2, 567) stop end
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim
I know the intrinsic function but it also does not work at all and it seems compiler ignores the intrinsic function call.
Thanks again for your notice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler doesn't ignore the call. What makes you think it does?
;;; a(i, j) = b(i-1, j) + b(i+1, j) ;;; call mm_prefetch (a(i+20, j), 1) prefetcht1 BYTE PTR [160+edx] ;8.11 mov esi, DWORD PTR [12+ebp] ;7.16 ;;; call mm_prefetch (b(i+21, j), 1) mov edi, DWORD PTR [16+esp] ;9.11 movsd xmm0, QWORD PTR [-8+eax+esi] ;7.16 addsd xmm0, QWORD PTR [8+eax+esi] ;7.6 movsd QWORD PTR [edx], xmm0 ;7.6 prefetcht1 BYTE PTR [168+edi] ;9.11 mov esi, 1 ;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I prescribed above I have tried various configurations of using PREFETCH directive and MM_PREFETCH intrinsic function in my FORTRAN code considering all the details described in Intel's documentations but they have no influence on the CPU time while using a local variable (in the way shown in my first code example) makes a noticeable decreases in the run time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sina,
I found this:
There are two primary methods of issuing prefetch instructions. One is by using compiler directives and the other is by using compiler intrinsics.
PREFETCH and NOPREFETCH Directives
The PREFETCH and NOPREFETCH directives are supported by Itanium® processors only.
Therefore, unless the compiler has changed your Core i7 Intel64/A32 won't compile to use prefetch. The compiler may have changed to support !DEC$ PREFETCH, Tim or Steve might have more to add on this.
If the prefetches, using MM_PREFETCH are being inserted into your code .AND. if you see no beneficial effect, then a potential cause is your prefetch is not far enough ahead of the execution.
Can you show how you are using MM_PREFETCH and using the debug disassembly window or VTune capture the text of the assembly code of the inner loop (just the inner loop).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can also compile with the option to generate an assembly listing. From the command line this is /FAs - it's also in the Output Files property page in Visual Studio.
The PREFETCH directive is currently supported for Intel Xeon Phi only. MM_PREFETCH is supported on all processors.
Prefetching doesn't necessarily improve performance, especially on the last several generations of Intel processors. (An exception is Xeon Phi). It was of more use on Pentium III and Pentium 4. Using VTune Amplifier XE to analyze application performance and looking for memory stalls is a far better use of your time than a scattershot guessing approach.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sina
Try SIMD REDUCTION
SUM = 0.D0 !DEC$ SIMD REDUCTION(+:SUM) DO I = 0, BLS - 1 SUM = SUM + A2D(I0 + I) * 2.D0 END DO
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sina,
Using:
Intel(R) Visual Fortran Composer XE 2013 SP1 Update 2 Integration for Microsoft Visual Studio* 2010, 14.0.0086.2010, Copyright (C) 2002-2014 Intel Corporation
and your original code without the temp array the !DEC$ PREFETCH A2D had no effect (not Itanium processor)
The compiler did unroll the loop and SIMD reduction'd the SUM without the REDUCTION(+:SUM) clause:
.B1.16:: ; Preds .B1.16 .B1.15 movaps xmm15, XMMWORD PTR [r14] ;40.29 add r15, 16 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm14, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [16+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm2, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [32+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm4, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [48+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm5, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [64+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm11, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [80+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm12, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [96+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm13, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [112+r14] ;40.29 add r14, 128 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm3, xmm15 ;40.17 cmp r15, r10 ;39.13 jb .B1.16 ; Prob 82% ;39.13
In the above, the main portion of the inner loop was unrolled 8 times.
*** the above code is not optimal because xmm15 is used for all memory fetches, and immediately followed by a mulpd that s dependent on the completion of the movaps. It would have been more efficient to have alternated registers, and then interleaved the unrolls.
Alternating registers without interleaving (no performance gain, merely used here for illustration):
.B1.16:: ; Preds .B1.16 .B1.15 movaps xmm15, XMMWORD PTR [r14] ;40.29 add r15, 16 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm14, xmm15 ;40.17 movaps xmm6, XMMWORD PTR [16+r14] ;40.29 mulpd xmm6, xmm1 ;40.41 addpd xmm2, xmm6 ;40.17 movaps xmm7, XMMWORD PTR [32+r14] ;40.29 mulpd xmm7, xmm1 ;40.41 addpd xmm4, xmm7 ;40.17 movaps xmm8, XMMWORD PTR [48+r14] ;40.29 mulpd xmm8, xmm1 ;40.41 addpd xmm5, xmm8 ;40.17 movaps xmm15, XMMWORD PTR [64+r14] ;40.29 mulpd xmm15, xmm1 ;40.41 addpd xmm11, xmm15 ;40.17 movaps xmm6, XMMWORD PTR [80+r14] ;40.29 mulpd xmm6, xmm1 ;40.41 addpd xmm12, xmm6 ;40.17 movaps xmm7, XMMWORD PTR [96+r14] ;40.29 mulpd xmm7, xmm1 ;40.41 addpd xmm13, xmm7 ;40.17 movaps xmm8, XMMWORD PTR [112+r14] ;40.29 add r14, 128 ;39.13 mulpd xmm8, xmm1 ;40.41 addpd xmm3, xmm8 ;40.17 cmp r15, r10 ;39.13 jb .B1.16 ; Prob 82% ;39.13
Note, xmm15 alternates with xmm6, xmm7, xmm8 (assuming that these have been made available for use).
Now using 4-way interleaving of an 8 unrolled loop:
movaps xmm15, XMMWORD PTR [r14] ;40.29 movaps xmm6, XMMWORD PTR [16+r14] ;40.29 movaps xmm7, XMMWORD PTR [32+r14] ;40.29 .B1.16:: ; Preds .B1.16 .B1.15 movaps xmm8, XMMWORD PTR [48+r14] ;40.29 add r15, 16 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm14, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [64+r14] ;40.29 mulpd xmm6, xmm1 ;40.41 addpd xmm2, xmm6 ;40.17 movaps xmm6, XMMWORD PTR [80+r14] ;40.29 mulpd xmm7, xmm1 ;40.41 addpd xmm4, xmm7 ;40.17 movaps xmm7, XMMWORD PTR [96+r14] ;40.29 mulpd xmm8, xmm1 ;40.41 addpd xmm5, xmm8 ;40.17 movaps xmm8, XMMWORD PTR [112+r14] ;40.29 add r14, 128 ;39.13 mulpd xmm15, xmm1 ;40.41 addpd xmm11, xmm15 ;40.17 movaps xmm15, XMMWORD PTR [r14] ;40.29 mulpd xmm6, xmm1 ;40.41 addpd xmm12, xmm6 ;40.17 movaps xmm6, XMMWORD PTR [16+r14] ;40.29 mulpd xmm7, xmm1 ;40.41 addpd xmm13, xmm7 ;40.17 movaps xmm7, XMMWORD PTR [32+r14] ;40.29 mulpd xmm8, xmm1 ;40.41 addpd xmm3, xmm8 ;40.17 cmp r15, r10 ;39.13 jb .B1.16 ; Prob 82% ;39.13
Note now how the memory latencies of the movaps instructions are filled by ~10 instructions that are non-dependent of the move latency.
The optimization gurus at Intel should really look at your test case for two reasons:
It looks simple enough to likely occur in many places eleswhere
And, low hanging fruit (should be easy enough to fix).
Using MM_PREFETCH
!DEC$ SIMD DO I = 0, BLS - 1 CALL MM_PREFETCH(A2D(I0 + I + 128)) SUM = SUM + A2D(I0 + I) * 2.D0 END DO
Also produces non-optimal code
.B1.17:: ; Preds .B1.17 .B1.16 prefetcht0 BYTE PTR [1024+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm13, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1040+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [16+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm2, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1056+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [32+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm3, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1072+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [48+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm4, xmm15 ;41.17 prefetcht0 BYTE PTR [1088+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [64+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm5, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1104+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [80+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm10, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1120+rbp+r15*8] ;40.22 movaps xmm15, XMMWORD PTR [96+r14] ;41.29 mulpd xmm15, xmm1 ;41.41 addpd xmm11, xmm15 ;41.17 ** prefetcht0 BYTE PTR [1136+rbp+r15*8] ;40.22 add r15, 16 ;39.13 movaps xmm15, XMMWORD PTR [112+r14] ;41.29 add r14, 128 ;39.13 mulpd xmm15, xmm1 ;41.41 addpd xmm12, xmm15 ;41.17 cmp r15, r13 ;39.13 jb .B1.17 ; Prob 82% ;39.13
Note, the compiler unrolled the loop 8-times as before (and did not interleave), but more importantly, it did not strip the prefetches marked with **. Those prefetches are easily enough to be determined to lay within the same cache line as the non-** marked line. Removing those unnecessary prefetches would retrieve a few clock cycles.
Interleaving would be the biggest gain.
Even using /QxHost on system with AVX, the compiler unrolled but did not interleave
.B1.54:: ; Preds .B1.54 .B1.53 ; Infreq vmulpd ymm15, ymm1, YMMWORD PTR [rsi+rbx*8] ;43.41 vaddpd ymm3, ymm3, ymm15 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [32+rsi+rbx*8] ;43.41 vaddpd ymm2, ymm15, ymm2 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [64+rsi+rbx*8] ;43.41 vaddpd ymm4, ymm15, ymm4 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [96+rsi+rbx*8] ;43.41 vaddpd ymm5, ymm15, ymm5 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [128+rsi+rbx*8] ;43.41 vaddpd ymm13, ymm15, ymm13 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [160+rsi+rbx*8] ;43.41 vaddpd ymm14, ymm15, ymm14 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [192+rsi+rbx*8] ;43.41 vaddpd ymm12, ymm15, ymm12 ;43.17 vmulpd ymm15, ymm1, YMMWORD PTR [224+rsi+rbx*8] ;43.41 add rbx, 32 ;41.13 vaddpd ymm11, ymm15, ymm11 ;43.17 cmp rbx, r15 ;41.13 jb .B1.54 ; Prob 82% ;41.13
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page