Solved: Dear Jim

Sina_M_ · ‎06-11-2014

Hi every one,

I am working on sparse algorithms' optimization using Intel's Fortran compiler. After applying different optimization features I want to make suitable use of data prefetching and cache utilization. In order to do that I tested several probable configurations of prefetching directives and intrinsic functions on both Intel Corei7 and AMD APU processors. But I don't get expected results. But in a specific case I think I get a real prefetching which gives me a 3-4 times speed up.

Following is the faster code:

[fortran]

    DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, X, TEMP
    DOUBLE PRECISION :: SUM
    INTEGER :: SIZE, I, J, COUNT, BLS, I0

    SIZE = 1000000
    BLS = 21 * 25

    ALLOCATE(A2D(0:BLS * SIZE - 1))
    ALLOCATE(X(0:SIZE - 1))
    ALLOCATE(TEMP(0:BLS - 1))
    !DEC$ SIMD
    DO J = 0, SIZE - 1
        DO I = 0, BLS - 1
            A2D(BLS * J + I) = I + J
        END DO
    END DO
    DO COUNT = 0, 50
        !$OMP PARALLEL SHARED(A2D, X, SIZE, BLS)
            !$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0)
        !DEC$ SIMD
        DO J = 0, SIZE - 1
            I0 = BLS * J
            DO I = 0, BLS - 1
                TEMP(I) = A2D(I0 + I)
            END DO
            SUM = 0.D0
            DO I = 0, BLS - 1
                SUM = SUM + TEMP(I) * 2.D0
            END DO
            X(J) = SUM
        END DO
            !$OMP END DO
        !$OMP END PARALLEL
    END DO

[/fortran]

And the following is the code I expect to be correct but is around 4 times slower (I think because the prefetch directive does not work):

[fortran]

    DOUBLE PRECISION, DIMENSION(:), ALLOCATABLE :: A2D, x
    DOUBLE PRECISION :: SUM
    INTEGER :: SIZE, I, J, COUNT, BLS, I0

    SIZE = 1000000
    BLS = 21 * 25

    ALLOCATE(A2D(0:BLS * SIZE - 1))
    ALLOCATE(X(0:SIZE - 1))
    !DEC$ SIMD
    DO J = 0, SIZE - 1
        DO I = 0, BLS - 1
            A2D(BLS * J + I) = I + J
        END DO
    END DO
    DO COUNT = 0, 50
        !$OMP PARALLEL SHARED(A2D, X, SIZE, BLS)
            !$OMP DO SCHEDULE(STATIC) PRIVATE(J, I, SUM, TEMP, I0, J_CACHE)
        !DEC$ PREFETCH A2D
        DO J = 0, SIZE - 1
            I0 = BLS * J
            SUM = 0.D0
            !DEC$ SIMD
            DO I = 0, BLS - 1
                SUM = SUM + A2D(I0 + I) * 2.D0
            END DO
            X(J) = SUM
        END DO
            !$OMP END DO
        !$OMP END PARALLEL
    END DO

[/fortran]

I am really confused and need your help.

jimdempseyatthecove · ‎06-17-2014

Sina,

Using:

and your original code without the temp array the !DEC$ PREFETCH A2D had no effect (not Itanium processor)

The compiler did unroll the loop and SIMD reduction'd the SUM without the REDUCTION(+:SUM) clause:

.B1.16::                        ; Preds .B1.16 .B1.15
        movaps    xmm15, XMMWORD PTR [r14]                      ;40.29
        add       r15, 16                                       ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm14, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [16+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm2, xmm15                                   ;40.17
        movaps    xmm15, XMMWORD PTR [32+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm4, xmm15                                   ;40.17
        movaps    xmm15, XMMWORD PTR [48+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm5, xmm15                                   ;40.17
        movaps    xmm15, XMMWORD PTR [64+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm11, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [80+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm12, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [96+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm13, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [112+r14]                  ;40.29
        add       r14, 128                                      ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm3, xmm15                                   ;40.17
        cmp       r15, r10                                      ;39.13
        jb        .B1.16        ; Prob 82%                      ;39.13

In the above, the main portion of the inner loop was unrolled 8 times.

*** the above code is not optimal because xmm15 is used for all memory fetches, and immediately followed by a mulpd that s dependent on the completion of the movaps. It would have been more efficient to have alternated registers, and then interleaved the unrolls.

Alternating registers without interleaving (no performance gain, merely used here for illustration):

.B1.16::                        ; Preds .B1.16 .B1.15
        movaps    xmm15, XMMWORD PTR [r14]                      ;40.29
        add       r15, 16                                       ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm14, xmm15                                  ;40.17

        movaps    xmm6, XMMWORD PTR [16+r14]                   ;40.29
        mulpd     xmm6, xmm1                                   ;40.41
        addpd     xmm2, xmm6                                   ;40.17

        movaps    xmm7, XMMWORD PTR [32+r14]                   ;40.29
        mulpd     xmm7, xmm1                                   ;40.41
        addpd     xmm4, xmm7                                   ;40.17

        movaps    xmm8, XMMWORD PTR [48+r14]                   ;40.29
        mulpd     xmm8, xmm1                                   ;40.41
        addpd     xmm5, xmm8                                   ;40.17

        movaps    xmm15, XMMWORD PTR [64+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm11, xmm15                                  ;40.17

        movaps    xmm6, XMMWORD PTR [80+r14]                   ;40.29
        mulpd     xmm6, xmm1                                   ;40.41
        addpd     xmm12, xmm6                                  ;40.17

        movaps    xmm7, XMMWORD PTR [96+r14]                   ;40.29
        mulpd     xmm7, xmm1                                   ;40.41
        addpd     xmm13, xmm7                                  ;40.17

        movaps    xmm8, XMMWORD PTR [112+r14]                  ;40.29
        add       r14, 128                                      ;39.13
        mulpd     xmm8, xmm1                                   ;40.41
        addpd     xmm3, xmm8                                   ;40.17
        cmp       r15, r10                                      ;39.13
        jb        .B1.16        ; Prob 82%                      ;39.13

Note, xmm15 alternates with xmm6, xmm7, xmm8 (assuming that these have been made available for use).

Now using 4-way interleaving of an 8 unrolled loop:

        movaps    xmm15, XMMWORD PTR [r14]                      ;40.29
        movaps    xmm6, XMMWORD PTR [16+r14]                   ;40.29
        movaps    xmm7, XMMWORD PTR [32+r14]                   ;40.29
.B1.16::                        ; Preds .B1.16 .B1.15
        movaps    xmm8, XMMWORD PTR [48+r14]                   ;40.29
        add       r15, 16                                       ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm14, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [64+r14]                   ;40.29
        mulpd     xmm6, xmm1                                   ;40.41
        addpd     xmm2, xmm6                                   ;40.17
        movaps    xmm6, XMMWORD PTR [80+r14]                   ;40.29
        mulpd     xmm7, xmm1                                   ;40.41
        addpd     xmm4, xmm7                                   ;40.17
        movaps    xmm7, XMMWORD PTR [96+r14]                   ;40.29
        mulpd     xmm8, xmm1                                   ;40.41
        addpd     xmm5, xmm8                                   ;40.17
        movaps    xmm8, XMMWORD PTR [112+r14]                  ;40.29
        add       r14, 128                                      ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm11, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [r14]                      ;40.29
        mulpd     xmm6, xmm1                                   ;40.41
        addpd     xmm12, xmm6                                  ;40.17
        movaps    xmm6, XMMWORD PTR [16+r14]                   ;40.29
        mulpd     xmm7, xmm1                                   ;40.41
        addpd     xmm13, xmm7                                  ;40.17
        movaps    xmm7, XMMWORD PTR [32+r14]                   ;40.29
        mulpd     xmm8, xmm1                                   ;40.41
        addpd     xmm3, xmm8                                   ;40.17
        cmp       r15, r10                                      ;39.13
        jb        .B1.16        ; Prob 82%                      ;39.13

Note now how the memory latencies of the movaps instructions are filled by ~10 instructions that are non-dependent of the move latency.

The optimization gurus at Intel should really look at your test case for two reasons:

It looks simple enough to likely occur in many places eleswhere

And, low hanging fruit (should be easy enough to fix).

Using MM_PREFETCH

            !DEC$ SIMD
            DO I = 0, BLS - 1
                CALL MM_PREFETCH(A2D(I0 + I + 128))
                SUM = SUM + A2D(I0 + I) * 2.D0
            END DO

Also produces non-optimal code

.B1.17::                        ; Preds .B1.17 .B1.16
        prefetcht0 BYTE PTR [1024+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [r14]                      ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm13, xmm15                                  ;41.17
**      prefetcht0 BYTE PTR [1040+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [16+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm2, xmm15                                   ;41.17
**      prefetcht0 BYTE PTR [1056+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [32+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm3, xmm15                                   ;41.17
**      prefetcht0 BYTE PTR [1072+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [48+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm4, xmm15                                   ;41.17
        prefetcht0 BYTE PTR [1088+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [64+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm5, xmm15                                   ;41.17
**      prefetcht0 BYTE PTR [1104+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [80+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm10, xmm15                                  ;41.17
**      prefetcht0 BYTE PTR [1120+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [96+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm11, xmm15                                  ;41.17
**      prefetcht0 BYTE PTR [1136+rbp+r15*8]                    ;40.22
        add       r15, 16                                       ;39.13
        movaps    xmm15, XMMWORD PTR [112+r14]                  ;41.29
        add       r14, 128                                      ;39.13
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm12, xmm15                                  ;41.17
        cmp       r15, r13                                      ;39.13
        jb        .B1.17        ; Prob 82%                      ;39.13

Note, the compiler unrolled the loop 8-times as before (and did not interleave), but more importantly, it did not strip the prefetches marked with **. Those prefetches are easily enough to be determined to lay within the same cache line as the non-** marked line. Removing those unnecessary prefetches would retrieve a few clock cycles.

Interleaving would be the biggest gain.

Even using /QxHost on system with AVX, the compiler unrolled but did not interleave

.B1.54::                        ; Preds .B1.54 .B1.53           ; Infreq
        vmulpd    ymm15, ymm1, YMMWORD PTR [rsi+rbx*8]          ;43.41
        vaddpd    ymm3, ymm3, ymm15                             ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [32+rsi+rbx*8]       ;43.41
        vaddpd    ymm2, ymm15, ymm2                             ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [64+rsi+rbx*8]       ;43.41
        vaddpd    ymm4, ymm15, ymm4                             ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [96+rsi+rbx*8]       ;43.41
        vaddpd    ymm5, ymm15, ymm5                             ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [128+rsi+rbx*8]      ;43.41
        vaddpd    ymm13, ymm15, ymm13                           ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [160+rsi+rbx*8]      ;43.41
        vaddpd    ymm14, ymm15, ymm14                           ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [192+rsi+rbx*8]      ;43.41
        vaddpd    ymm12, ymm15, ymm12                           ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [224+rsi+rbx*8]      ;43.41
        add       rbx, 32                                       ;41.13
        vaddpd    ymm11, ymm15, ymm11                           ;43.17
        cmp       rbx, r15                                      ;41.13
        jb        .B1.54        ; Prob 82%                      ;41.13

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎06-11-2014

I think you want the prefetch on the inner most loop. However, in cases like yours (streaming loads) the hardware prefetcher typically does better than software prefetching. Try the 2nd code with no prefetching (and no temp array).

Jim Dempsey

TimP · ‎06-11-2014

I agree with Jim.

It seems unlikely you would choose to write this with so much verbosity, as all the stuff with SUM is equivalent to

X(J) = SUM( A2D(I0:I0+BLS-1)*2 )

If you want to learn about the prefetch directives, including some examples where it's likely to help in sparse matrix operations, see

https://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-5.pdf

You could skip down to the Fortran examples on indirect prefetch, which would be valid for all CPUs supported by ifort.

Sina_M_ · ‎06-11-2014

As I described above the first code is 3-4 times faster and in my opinion it should be because of a memory caching issue. I have read the specified document on prefetching and tried all the possible configurations. PREFETCH directive and MM_PREFETCH intrinsic do not work at all. In fact now I just want to know why exactly the first code runs faster than any other possible combination.

Thanks for your notice and looking forward to hearing from you as soon as possible.

jimdempseyatthecove · ‎06-12-2014

IVF has options to generate vectorization reports. These reports may give insight as to what is going on.

The following assumes compiler options and runtime environment are the same (IOW you have already have assured they are running with equal footing).

In the first piece of code you have the extra step of copying the data to a local array, but then reusing that data in the summation loop.

In the second piece of code you omit the copy to local array and fetch the data as you perform the summation.

Your BLS is 21 * 25 = 525 which is not a multiple of 2 (SSE) or 4 (AVX).

The first code with the copy loop will efficiently copy aligned or unaligned data from A2D to an aligned TEMP array then use known aligned data for the summation loop (and data which is contained within L1 cache.

The second code will generate unaligned instructions (due to unknown nature of input data) and incur penalties when data is not aligned.

The above are my presumptions, VTune will tell you what is really going on.

Jim Dempsey

Sina_M_ · ‎06-14-2014

Dear Jim

Thanks for your useful reply. I think so and as you said it should happen because of L1 cache usage. In this way I should conclude that using local variables and utilization of SIMD directive for memory fetch makes compiler save the prescribed variable in L1 cache. I wonder whether there might be another better approach for bringing a specific range of data from memory to cache. In fact I thought features like PREFETCH directive and MM_PREFETCH intrinsic function had to force some similar behavior but it seems they have no specific influence in the program's memory management.

Thanks again for your replay and looking forward to hearing from you as soon as possible.

jimdempseyatthecove · ‎06-14-2014

>> I wonder whether there might be another better approach for bringing a specific range of data from memory to cache.

Well - maybe there is.

If your CPU has the TSX extensions (I do not have one that does). Then in the first example code you presented, the one with the copy to temp array loop, then if your set your block size smaller than the number of bufferable writes supported by your processor's TSX, then by placing the blocking loop into a transactional protected region, the copy to temp array (at the time of the copy) will not perform writes to RAM. What it is not known (to me) is what happens in a transactional regions when you invalidate cache lines. IIF you can invalidate cache lines .AND. if it eliminates the write of TEMP array to RAM, then you can insert a cache line invalidation loop for the temp array at the bottom of the block loop (inside the transactional region).

Note, if the above fully works, then it would reduce the memory bandwidth requirements of this example by 30%. (low hanging fruit for CPU optimizations.

An alternative (not available) would be to extend the instruction set to permit something along the line of:

mov rcx, count; lea rsi, address; rep newPrefetch [rsi]

What the above will do is (depending on implementation): lift the prefetch instruction(s) out of the loop (reducing instruction count in loop), and/or inform the hardware prefetcher of anticipated fetch pattern such that hardware prefetching can occur earlier.

The new instruction sequence could also be

newPrefetch [someAddress], count

where count is either immediate or register

An alternate form could be

newPrefetch [someAddress], stride << 32 + count;

Jim Dempsey

Sina_M_ · ‎06-14-2014

Jim, thanks for your help!

Sina_M_ · ‎06-14-2014

Dear Jim

Just I don't know what is the effect of PREFETCH directive in memory management? It seems It does really nothing but in Intel's documentations it is said to force such behavior.

jimdempseyatthecove · ‎06-15-2014

For detailed information on software prefetching consult the IVF documentation, and google search

software prefetch optimization site:intel.com

Then select a few of the promising hits.

In general terms, as you have programmed in your sample code, the DO loop will have been vectorized and the iterations advance vector at a time (with possible exception for unaligned data at front and residual data at end of loop). When software prefetching is used, the compiler inserts into your code stream, either preceding or following:

someRegister = [someAddress]

a prefetch instruction

prefetch [someAddress + someOffset]

In the !DEC$... prefetch lacking any further directives, the compiler determines an appropriate offset. This may be a few cache lines ahead. You can (may) specify the distance ahead (in terms of cache lines) and stride in event your reads are not adjacent.

Using software prefetching is going to cost you at least one clock cycle. Therefor if under the circumstances, if prefetching is unwarranted using it will slow you down a bit. Also, when used inappropriately, it can unnecessarily consume memory bandwidth. Experiment with using it and learn under what circumstance it works, and does not work. There is no panacea with prefetching.

Jim Dempsey

TimP · ‎06-15-2014

There's so much information in the references I gave earlier that I don't see us guessing what information you were looking for and didn't find.

Further to Jim's hint, you would want to turn on the relevant opt-report (the specific numbers change among ifort versions) about what the compiler chose to do with your prefetch directives (and what is the effect of various levels of /Qopt-prefetch).

You would probably want to work with an example relevant to your project where there is s reasonable expectation of prefetch making a difference.

Sina_M_ · ‎06-15-2014

Thanks Jim for your suitable help. I have studied the documents before and checked so many examples but finally I don't get reasonable results.

For Tim's information if you try several test examples you will get to the result that !DEC$ PREFETCH does nothing.

I will use my own approach of local variables' usage and thanks all for your contribution.

jimdempseyatthecove · ‎06-16-2014

Depending on the version of your compiler it may indeed be ignored.

Alternatively there is the intrinsic subroutine MM_PREFETCH (look in IFV index under "prefetches of data").

subroutine spread_lf (a, b)
 PARAMETER (n = 1025)

 real*8 a(n,n), b(n,n), c(n)
 do j = 1,n
   do i = 1,100
     a(i, j) = b(i-1, j) + b(i+1, j)
     call mm_prefetch (a(i+20, j), 1)
     call mm_prefetch (b(i+21, j), 1)
   enddo
 enddo

 print *, a(2, 567)

 stop
 end

Jim Dempsey

Sina_M_ · ‎06-16-2014

Dear Jim

I know the intrinsic function but it also does not work at all and it seems compiler ignores the intrinsic function call.

Thanks again for your notice.

Steven_L_Intel1 · ‎06-16-2014

The compiler doesn't ignore the call. What makes you think it does?

;;;      a(i, j) = b(i-1, j) + b(i+1, j)
;;;      call mm_prefetch (a(i+20, j), 1)

        prefetcht1 BYTE PTR [160+edx]                           ;8.11
        mov       esi, DWORD PTR [12+ebp]                       ;7.16

;;;      call mm_prefetch (b(i+21, j), 1)

        mov       edi, DWORD PTR [16+esp]                       ;9.11
        movsd     xmm0, QWORD PTR [-8+eax+esi]                  ;7.16
        addsd     xmm0, QWORD PTR [8+eax+esi]                   ;7.6
        movsd     QWORD PTR [edx], xmm0                         ;7.6
        prefetcht1 BYTE PTR [168+edi]                           ;9.11
        mov       esi, 1                                        ;

Sina_M_ · ‎06-16-2014

As I prescribed above I have tried various configurations of using PREFETCH directive and MM_PREFETCH intrinsic function in my FORTRAN code considering all the details described in Intel's documentations but they have no influence on the CPU time while using a local variable (in the way shown in my first code example) makes a noticeable decreases in the run time.

jimdempseyatthecove · ‎06-17-2014

Sina,

I found this:

There are two primary methods of issuing prefetch instructions. One is by using compiler directives and the other is by using compiler intrinsics.

PREFETCH and NOPREFETCH Directives

The PREFETCH and NOPREFETCH directives are supported by Itanium® processors only.

Therefore, unless the compiler has changed your Core i7 Intel64/A32 won't compile to use prefetch. The compiler may have changed to support !DEC$ PREFETCH, Tim or Steve might have more to add on this.

If the prefetches, using MM_PREFETCH are being inserted into your code .AND. if you see no beneficial effect, then a potential cause is your prefetch is not far enough ahead of the execution.

Can you show how you are using MM_PREFETCH and using the debug disassembly window or VTune capture the text of the assembly code of the inner loop (just the inner loop).

Jim Dempsey

Steven_L_Intel1 · ‎06-17-2014

You can also compile with the option to generate an assembly listing. From the command line this is /FAs - it's also in the Output Files property page in Visual Studio.

The PREFETCH directive is currently supported for Intel Xeon Phi only. MM_PREFETCH is supported on all processors.

Prefetching doesn't necessarily improve performance, especially on the last several generations of Intel processors. (An exception is Xeon Phi). It was of more use on Pentium III and Pentium 4. Using VTune Amplifier XE to analyze application performance and looking for memory stalls is a far better use of your time than a scattershot guessing approach.

jimdempseyatthecove · ‎06-17-2014

Sina

Try SIMD REDUCTION

SUM = 0.D0
!DEC$ SIMD REDUCTION(+:SUM)
DO I = 0, BLS - 1
  SUM = SUM + A2D(I0 + I) * 2.D0
END DO

Jim Dempsey

jimdempseyatthecove · ‎06-17-2014

Sina,

Using:

and your original code without the temp array the !DEC$ PREFETCH A2D had no effect (not Itanium processor)

The compiler did unroll the loop and SIMD reduction'd the SUM without the REDUCTION(+:SUM) clause:

.B1.16::                        ; Preds .B1.16 .B1.15
        movaps    xmm15, XMMWORD PTR [r14]                      ;40.29
        add       r15, 16                                       ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm14, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [16+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm2, xmm15                                   ;40.17
        movaps    xmm15, XMMWORD PTR [32+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm4, xmm15                                   ;40.17
        movaps    xmm15, XMMWORD PTR [48+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm5, xmm15                                   ;40.17
        movaps    xmm15, XMMWORD PTR [64+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm11, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [80+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm12, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [96+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm13, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [112+r14]                  ;40.29
        add       r14, 128                                      ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm3, xmm15                                   ;40.17
        cmp       r15, r10                                      ;39.13
        jb        .B1.16        ; Prob 82%                      ;39.13

In the above, the main portion of the inner loop was unrolled 8 times.

*** the above code is not optimal because xmm15 is used for all memory fetches, and immediately followed by a mulpd that s dependent on the completion of the movaps. It would have been more efficient to have alternated registers, and then interleaved the unrolls.

Alternating registers without interleaving (no performance gain, merely used here for illustration):

.B1.16::                        ; Preds .B1.16 .B1.15
        movaps    xmm15, XMMWORD PTR [r14]                      ;40.29
        add       r15, 16                                       ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm14, xmm15                                  ;40.17

        movaps    xmm6, XMMWORD PTR [16+r14]                   ;40.29
        mulpd     xmm6, xmm1                                   ;40.41
        addpd     xmm2, xmm6                                   ;40.17

        movaps    xmm7, XMMWORD PTR [32+r14]                   ;40.29
        mulpd     xmm7, xmm1                                   ;40.41
        addpd     xmm4, xmm7                                   ;40.17

        movaps    xmm8, XMMWORD PTR [48+r14]                   ;40.29
        mulpd     xmm8, xmm1                                   ;40.41
        addpd     xmm5, xmm8                                   ;40.17

        movaps    xmm15, XMMWORD PTR [64+r14]                   ;40.29
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm11, xmm15                                  ;40.17

        movaps    xmm6, XMMWORD PTR [80+r14]                   ;40.29
        mulpd     xmm6, xmm1                                   ;40.41
        addpd     xmm12, xmm6                                  ;40.17

        movaps    xmm7, XMMWORD PTR [96+r14]                   ;40.29
        mulpd     xmm7, xmm1                                   ;40.41
        addpd     xmm13, xmm7                                  ;40.17

        movaps    xmm8, XMMWORD PTR [112+r14]                  ;40.29
        add       r14, 128                                      ;39.13
        mulpd     xmm8, xmm1                                   ;40.41
        addpd     xmm3, xmm8                                   ;40.17
        cmp       r15, r10                                      ;39.13
        jb        .B1.16        ; Prob 82%                      ;39.13

Note, xmm15 alternates with xmm6, xmm7, xmm8 (assuming that these have been made available for use).

Now using 4-way interleaving of an 8 unrolled loop:

        movaps    xmm15, XMMWORD PTR [r14]                      ;40.29
        movaps    xmm6, XMMWORD PTR [16+r14]                   ;40.29
        movaps    xmm7, XMMWORD PTR [32+r14]                   ;40.29
.B1.16::                        ; Preds .B1.16 .B1.15
        movaps    xmm8, XMMWORD PTR [48+r14]                   ;40.29
        add       r15, 16                                       ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm14, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [64+r14]                   ;40.29
        mulpd     xmm6, xmm1                                   ;40.41
        addpd     xmm2, xmm6                                   ;40.17
        movaps    xmm6, XMMWORD PTR [80+r14]                   ;40.29
        mulpd     xmm7, xmm1                                   ;40.41
        addpd     xmm4, xmm7                                   ;40.17
        movaps    xmm7, XMMWORD PTR [96+r14]                   ;40.29
        mulpd     xmm8, xmm1                                   ;40.41
        addpd     xmm5, xmm8                                   ;40.17
        movaps    xmm8, XMMWORD PTR [112+r14]                  ;40.29
        add       r14, 128                                      ;39.13
        mulpd     xmm15, xmm1                                   ;40.41
        addpd     xmm11, xmm15                                  ;40.17
        movaps    xmm15, XMMWORD PTR [r14]                      ;40.29
        mulpd     xmm6, xmm1                                   ;40.41
        addpd     xmm12, xmm6                                  ;40.17
        movaps    xmm6, XMMWORD PTR [16+r14]                   ;40.29
        mulpd     xmm7, xmm1                                   ;40.41
        addpd     xmm13, xmm7                                  ;40.17
        movaps    xmm7, XMMWORD PTR [32+r14]                   ;40.29
        mulpd     xmm8, xmm1                                   ;40.41
        addpd     xmm3, xmm8                                   ;40.17
        cmp       r15, r10                                      ;39.13
        jb        .B1.16        ; Prob 82%                      ;39.13

Note now how the memory latencies of the movaps instructions are filled by ~10 instructions that are non-dependent of the move latency.

The optimization gurus at Intel should really look at your test case for two reasons:

It looks simple enough to likely occur in many places eleswhere

And, low hanging fruit (should be easy enough to fix).

Using MM_PREFETCH

            !DEC$ SIMD
            DO I = 0, BLS - 1
                CALL MM_PREFETCH(A2D(I0 + I + 128))
                SUM = SUM + A2D(I0 + I) * 2.D0
            END DO

Also produces non-optimal code

.B1.17::                        ; Preds .B1.17 .B1.16
        prefetcht0 BYTE PTR [1024+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [r14]                      ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm13, xmm15                                  ;41.17
**      prefetcht0 BYTE PTR [1040+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [16+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm2, xmm15                                   ;41.17
**      prefetcht0 BYTE PTR [1056+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [32+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm3, xmm15                                   ;41.17
**      prefetcht0 BYTE PTR [1072+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [48+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm4, xmm15                                   ;41.17
        prefetcht0 BYTE PTR [1088+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [64+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm5, xmm15                                   ;41.17
**      prefetcht0 BYTE PTR [1104+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [80+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm10, xmm15                                  ;41.17
**      prefetcht0 BYTE PTR [1120+rbp+r15*8]                    ;40.22
        movaps    xmm15, XMMWORD PTR [96+r14]                   ;41.29
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm11, xmm15                                  ;41.17
**      prefetcht0 BYTE PTR [1136+rbp+r15*8]                    ;40.22
        add       r15, 16                                       ;39.13
        movaps    xmm15, XMMWORD PTR [112+r14]                  ;41.29
        add       r14, 128                                      ;39.13
        mulpd     xmm15, xmm1                                   ;41.41
        addpd     xmm12, xmm15                                  ;41.17
        cmp       r15, r13                                      ;39.13
        jb        .B1.17        ; Prob 82%                      ;39.13

Note, the compiler unrolled the loop 8-times as before (and did not interleave), but more importantly, it did not strip the prefetches marked with **. Those prefetches are easily enough to be determined to lay within the same cache line as the non-** marked line. Removing those unnecessary prefetches would retrieve a few clock cycles.

Interleaving would be the biggest gain.

Even using /QxHost on system with AVX, the compiler unrolled but did not interleave

.B1.54::                        ; Preds .B1.54 .B1.53           ; Infreq
        vmulpd    ymm15, ymm1, YMMWORD PTR [rsi+rbx*8]          ;43.41
        vaddpd    ymm3, ymm3, ymm15                             ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [32+rsi+rbx*8]       ;43.41
        vaddpd    ymm2, ymm15, ymm2                             ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [64+rsi+rbx*8]       ;43.41
        vaddpd    ymm4, ymm15, ymm4                             ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [96+rsi+rbx*8]       ;43.41
        vaddpd    ymm5, ymm15, ymm5                             ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [128+rsi+rbx*8]      ;43.41
        vaddpd    ymm13, ymm15, ymm13                           ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [160+rsi+rbx*8]      ;43.41
        vaddpd    ymm14, ymm15, ymm14                           ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [192+rsi+rbx*8]      ;43.41
        vaddpd    ymm12, ymm15, ymm12                           ;43.17
        vmulpd    ymm15, ymm1, YMMWORD PTR [224+rsi+rbx*8]      ;43.41
        add       rbx, 32                                       ;41.13
        vaddpd    ymm11, ymm15, ymm11                           ;43.17
        cmp       rbx, r15                                      ;41.13
        jb        .B1.54        ; Prob 82%                      ;41.13

Jim Dempsey

Data Prefetching using Fortran Directives