The grep listing is no good

Wentao_Z_ · ‎08-07-2014

Hi,

I have a question about AVX instruction. I compiled my code using ifort 13 with -O2 and -xHost. I want to enable 256-bit wide AVX to perform four 64-bit floating point operations per cycle.

Here is my first code piece:

 623 !DIR$ SIMD
 624         do ii = 1, Nc
 625 ! diagonal components first
 626           StrnRt(ii,1) =   JAC(ii)   * (                &
 627                            MT1(ii,1) * VelGrad1st(ii,1) &
 628                          + MT1(ii,2) * VelGrad1st(ii,3) )
			...
 640          end do

The assembly files show that the following instructions were generated for line 627:

vmulsd    8(%r8,%r14,8), %xmm1, %xmm3                   #627.38
vmulpd    %xmm6, %xmm5, %xmm11                          #627.38
vmulpd    %ymm5, %ymm4, %ymm10                          #627.38

I understand why I got vmulsd. My question is why vmulpd %xmm6, %xmm5, %xmm11 was generated and what does it stand for? I think vmulpd should be an AVX instruction and should use ymm to have 256-bit wide vectorization.

For the second code piece:

 643 !DIR$ SIMD
 644         do ii = 1, Nc
 645 ! diagonal components first
 646           StrnRt(ii,1) =   JAC(ii)   * (                &
 647                            MT1(ii,1) * VelGrad1st(ii,1) &
 648                          + MT1(ii,2) * VelGrad1st(ii,4) &
 649                          + MT1(ii,3) * VelGrad1st(ii,7) )
 			...
 685         end do

The assembly files show that the following instructions were generated for line 647:

vmulsd    (%r12), %xmm4, %xmm6                          #647.38
vmulpd    %xmm11, %xmm10, %xmm0                         #647.38

Here again I got vmulpd with xmm. I even did NOT get vmulpd with ymm. I am worrying that this code piece is only performing two 64-bit floating point operations per cycle, rather than four.

I truly appreciate your help.

Best regards,
Wentao

TimP · ‎08-07-2014

In order to generate AVX instructions, you need -mAVX or -xHost.

jimdempseyatthecove · ‎08-07-2014

When the compiler does not know the alignment of the arrays it might revert to using xmm registers. Often a cachline split load, which may be permitted on some processor architectures, is slower than performing two (four) operations in sequence. IOW the processor doesn't stall.

IIF the first dimension (Nc) is a multiple of your vector size .and. you know JAC, MT1 and VelGrad1st are vector aligned (for AVX), then try using:

!DIR SMD VECTORLENGTH(4)

Jim Dempsey

TimP · ‎08-07-2014

Another possibility is use of avx128 in remainder loop along with avx256 in main loop body.

Wentao_Z_ · ‎08-07-2014

Thank you for your reply.

Your explanations help me to understand the behavior in the first code piece.

I have found out why I did not get avx ymm in the second code piece. It is because my loop body is too large (I only show part of it here), which probably makes the ymm registers not enough. Later after I divided the big loop body into two smaller loops, I got avx ymm generated for the second code piece.

Best regards,
Wentao

jimdempseyatthecove · ‎08-07-2014

Have you looked at the complete code of the loop?

On loops were the alignment is not known, there usually is code, called peel code, that runs the loop in smaller vector widths until alignment is attained, then it continues on with the wider vector widths. Is this the case of you examining the peel portion of the code?

Jim Dempsey

Wentao_Z_ · ‎08-07-2014

Hi Jim,

Here is the code piece and corresponding assembly for line 647:

 643 !DIR$ SIMD
 644         do ii = 1, Nc
 645 ! diagonal components first
 646           StrnRt(ii,1) =   JAC(ii)   * (                &
 647                            MT1(ii,1) * VelGrad1st(ii,1) &
 648                          + MT1(ii,2) * VelGrad1st(ii,4) &
 649                          + MT1(ii,3) * VelGrad1st(ii,7) )
 650 
 651           StrnRt(ii,2) =   JAC(ii)   * (                &
 652                            MT1(ii,4) * VelGrad1st(ii,2) &
 653                          + MT1(ii,5) * VelGrad1st(ii,5) &
 654                          + MT1(ii,6) * VelGrad1st(ii,8) )
 655 
 656           StrnRt(ii,3) =   JAC(ii)   * (                &
 657                            MT1(ii,7) * VelGrad1st(ii,3) &
 658                          + MT1(ii,8) * VelGrad1st(ii,6) &
 659                          + MT1(ii,9) * VelGrad1st(ii,9) )
 660 
 661         end do

login4$ grep '#647' ModNavierStokesRHS.s 
        movq      344(%rsp), %rbx                               #647.40
        movq      88(%rbx), %rsi                                #647.40
        movq      %rsi, 112(%rsp)                               #647.40
        movq      %rsi, 160(%rsp)                               #647.40
        movq      (%rbx), %r10                                  #647.40
        movq      64(%rbx), %r8                                 #647.40
        movq      80(%rbx), %r9                                 #647.40
        movq      56(%rbx), %rsi                                #647.40
        movq      %r10, 208(%rsp)                               #647.40
        movq      %r9, 104(%rsp)                                #647.40
        movq      %r8, 96(%rsp)                                 #647.40
        movq      %rsi, 408(%rsp)                               #647.40
        movq      %rsi, 264(%rsp)                               #647.40
        vmovsd    8(%rcx,%r11,8), %xmm0                         #647.28
        vmulsd    8(%rdx,%r11,8), %xmm0, %xmm2                  #647.38
        vmovupd   8(%r10,%r11,8), %xmm0                         #647.28
        vmovupd   8(%r14,%r11,8), %xmm1                         #647.40
        vinsertf128 $1, 24(%r10,%r11,8), %ymm0, %ymm2           #647.28
        vinsertf128 $1, 24(%r14,%r11,8), %ymm1, %ymm3           #647.40
        vmulpd    %ymm3, %ymm2, %ymm8                           #647.38
        movq      280(%rsp), %r15                               #647.28
        vmovsd    8(%r15,%r11,8), %xmm0                         #647.28
        movq      272(%rsp), %r15                               #647.38
        vmulsd    8(%r15,%r11,8), %xmm0, %xmm2                  #647.38
        movq      296(%rsp), %r15                               #647.28
        addq      %r8, %r15                                     #647.28
        vmovsd    (%r15), %xmm4                                 #647.28
        vmovhpd   (%r15,%rdx), %xmm4, %xmm6                     #647.28
        movq      152(%rsp), %r15                               #647.40
        addq      %r9, %r15                                     #647.40
        vmovsd    (%r15), %xmm5                                 #647.40
        vmovhpd   (%r15,%r13), %xmm5, %xmm7                     #647.40
        vmulpd    %xmm7, %xmm6, %xmm12                          #647.38
        movq      304(%rsp), %r15                               #647.28
        vmovsd    (%rbx,%r15), %xmm0                            #647.28
        vmulsd    (%rdi,%rcx), %xmm0, %xmm2                     #647.38

The assembly lines for line 647 is a mixture of three things:

vmulsd (scalar)
vmulpd %xmm1, %xmm2, %xmm3
vmulpd %ymm1, %ymm2, %ymm3

I think the main loop body has been vectorized with avx256. Scalar instructions and avx128 appear here to deal with the remainder(peel) loops.

Best regards,
Wentao

TimP · ‎08-07-2014

The code shown here is AVX-256 vectorized (on only one operation), but the memory accesses are effectively scalar, repacking data into 256-bit registers, to allow for mis-alignment without penalty on early AVX platforms. If you could assure the compiler of 32-byte data alignment, you could get AVX-256 memory references. Usual ways of doing that are with combinations of -align array32byte and !dir$ vector aligned or assume_aligned assertions. You aren't getting significant benefit for AVX in this code. Apparently, the compiler sees so many possible combinations of mis-alignment that it basically gives up on useful AVX.

If you are using 32-bit compile mode, you may be losing optimization due to not having enough independent pointer registers available. That would be a case where 64-bit mode could make a big difference.

jimdempseyatthecove · ‎08-08-2014

The grep listing is no good for this purpose.

What you have above are code lines from all three potential sections of the compiled loop:

The peel section (mix of scalars), the vectorized loop (note, there may be multiples of these, one executed on given machine with particular alignment), then finally the residual section.

IOW your listing is a mix-mash of the code of interest.

Do your grep for "6[456].[.]". Or simply look for the section yourself, and copy/paste the appropriate code.

This will get the full range of the loop (and a tad more).

Jim Dempsey

Wentao_Z_ · ‎08-08-2014

Thanks for your reply. I did not show the complete picture because I worry that it is too lengthy.

Here is the code

 656 !DIR$ SIMD
 657         do ii = 1, Nc
 658           StrnRt(ii,3) =   JAC(ii)   * (                &
 659                            MT1(ii,7) * VelGrad1st(ii,3) &
 660                          + MT1(ii,8) * VelGrad1st(ii,6) &
 661                          + MT1(ii,9) * VelGrad1st(ii,9) )
 662 
 663 ! upper-half part of strain-rate tensor due to symmetry
 664           StrnRt(ii,4) =   JAC(ii)   * 0.5_rfreal * (   &
 665                            MT1(ii,4) * VelGrad1st(ii,1) &
 666                          + MT1(ii,1) * VelGrad1st(ii,2) &
 667                          + MT1(ii,5) * VelGrad1st(ii,4) &
 668                          + MT1(ii,2) * VelGrad1st(ii,5) &
 669                          + MT1(ii,6) * VelGrad1st(ii,7) &
 670                          + MT1(ii,3) * VelGrad1st(ii,8) )
 671         end do

Here is the corresponding assembly lines

        jb        ..B10.63      # Prob 81%                      #644.9
                                # LOE rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14
..B10.64:                       # Preds ..B10.63
        movq      432(%rsp), %r11                               #658.11
        movq      400(%rsp), %r14                               #659.40
        movq      200(%rsp), %rbx                               #
        movq      (%r11), %rdi                                  #658.11
        movq      64(%r11), %rsi                                #658.11
        movq      56(%r11), %r12                                #658.11
        movq      56(%r14), %r15                                #659.40
        movq      %rdi, 192(%rsp)                               #658.11
        movq      88(%r11), %r8                                 #658.11
        movq      80(%r11), %rdi                                #658.11
        movq      %rsi, 184(%rsp)                               #658.11
        movq      %r12, 280(%rsp)                               #658.11
        movq      264(%rsp), %rcx                               #
        movq      (%rsp), %r9                                   #
        movq      8(%rsp), %r10                                 #
        movq      16(%rsp), %rax                                #
        movq      (%r14), %rsi                                  #659.40
        movq      64(%r14), %r12                                #659.40
        movq      88(%r14), %r13                                #659.40
        movq      80(%r14), %r11                                #659.40
        movq      %r15, 208(%rsp)                               #659.40
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.65:                       # Preds ..B10.64 ..B10.161
        movq      152(%rsp), %r15                               #
        movq      304(%rsp), %r14                               #
        imulq     %r14, %r15                                    #
        movq      %r15, 136(%rsp)                               #
        movq      424(%rsp), %r15                               #
        imulq     %r14, %r15                                    #
        movq      %r15, 128(%rsp)                               #
        movq      %rdx, %r15                                    #
        imulq     %r14, %r15                                    #
        movq      %r15, 144(%rsp)                               #
        movq      256(%rsp), %r15                               #
        imulq     %r14, %r15                                    #
        movq      %r15, 120(%rsp)                               #
        cmpq      %rcx, %r14                                    #644.9
        jae       ..B10.69      # Prob 3%                       #644.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.66:                       # Preds ..B10.65
        movq      416(%rsp), %r14                               #
        movq      %rdx, 272(%rsp)                               #
        movq      %r13, 40(%rsp)                                #
        movq      %r10, %r13                                    #
        movq      %r11, 32(%rsp)                                #
        lea       (%r14,%rdx), %r15                             #
        imulq     408(%rsp), %rdx                               #
        imulq     %r9, %r13                                     #
        subq      %rdx, %r15                                    #
        movq      424(%rsp), %rdx                               #
        movq      168(%rsp), %r11                               #
        movq      %r8, 64(%rsp)                                 #
        movq      %rsi, 48(%rsp)                                #
        lea       (%rbx,%rdx), %r14                             #
        imulq     %rax, %rdx                                    #
        movq      %r12, 24(%rsp)                                #
        movq      176(%rsp), %r12                               #
        movq      160(%rsp), %rsi                               #
        movq      152(%rsp), %r8                                #
        imulq     %r11, %r12                                    #
        imulq     %r8, %rsi                                     #
        movq      %rcx, 264(%rsp)                               #
        movq      %rdx, %rcx                                    #
        subq      %r10, %rcx                                    #
        addq      %rsi, %r12                                    #
        negq      %rcx                                          #
        lea       (,%r11,8), %rsi                               #649.40
        addq      %r14, %rcx                                    #
        subq      %r11, %rsi                                    #649.40
        movq      %rdi, 56(%rsp)                                #
        subq      %r13, %rcx                                    #
        movq      216(%rsp), %rdi                               #
        movq      %rax, 16(%rsp)                                #
        lea       (%r10,%r10), %rax                             #648.28
        movq      %rcx, 72(%rsp)                                #
        lea       (%r10,%r10,2), %rcx                           #649.28
        negq      %rax                                          #
        negq      %rcx                                          #649.28
        addq      %rdx, %rax                                    #
        addq      %rdx, %rcx                                    #
        movq      %rbx, 200(%rsp)                               #
        lea       (%rdi,%r11), %rbx                             #
        negq      %rax                                          #
        negq      %rcx                                          #
        addq      %rdi, %rsi                                    #
        subq      %r12, %rbx                                    #
        addq      %r14, %rax                                    #
        addq      %r14, %rcx                                    #
        subq      %r12, %rsi                                    #
        addq      %r8, %rbx                                     #
        movq      %rbx, 80(%rsp)                                #
        subq      %r13, %rax                                    #
        movq      232(%rsp), %rbx                               #
        subq      %r13, %rcx                                    #
        addq      %r8, %rsi                                     #
        movq      %rax, 88(%rsp)                                #
        movq      %rcx, 328(%rsp)                               #
        movq      %rsi, 320(%rsp)                               #
        movq      240(%rsp), %rcx                               #
        movq      224(%rsp), %rax                               #
        movq      256(%rsp), %rsi                               #
        imulq     %rbx, %rcx                                    #
        imulq     %rsi, %rax                                    #
        movq      %r9, (%rsp)                                   #
        lea       (%rdi,%r11,4), %r9                            #
        subq      %r12, %r9                                     #
        addq      %rax, %rcx                                    #
        movq      248(%rsp), %rax                               #
        addq      %r8, %r9                                      #
        movq      %r9, 312(%rsp)                                #
        lea       (,%r10,4), %r9                                #652.28
        negq      %r9                                           #
        addq      %r15, 144(%rsp)                               #
        lea       (%rax,%rbx), %r15                             #
        addq      %rdx, %r9                                     #
        subq      %rcx, %r15                                    #
        negq      %r9                                           #
        addq      %rsi, %r15                                    #
        movq      %r15, 344(%rsp)                               #
        addq      %r14, %r9                                     #
        subq      %r13, %r9                                     #
        lea       (%rdi,%r11,2), %r15                           #
        subq      %r12, %r15                                    #
        movq      %r9, 112(%rsp)                                #
        lea       (%r10,%r10,4), %r9                            #653.28
        addq      %r8, %r15                                     #
        movq      %r15, 336(%rsp)                               #
        movq      %r9, %r15                                     #653.28
        negq      %r15                                          #653.28
        addq      %r10, %r9                                     #654.28
        addq      %rdx, %r15                                    #
        negq      %r9                                           #654.28
        negq      %r15                                          #
        addq      %r9, %rdx                                     #
        addq      %r14, %r15                                    #
        subq      %rdx, %r14                                    #
        subq      %r13, %r15                                    #
        subq      %r13, %r14                                    #
        movq      %r15, 104(%rsp)                               #
        lea       (%r11,%r11,4), %r15                           #653.40
        addq      %rdi, %r15                                    #
        lea       (%rdi,%r11,8), %rdi                           #
        subq      %r12, %r15                                    #
        subq      %r12, %rdi                                    #
        addq      %r8, %r15                                     #
        addq      %rdi, %r8                                     #
        movq      %r8, 152(%rsp)                                #
        lea       (%rax,%rbx,2), %r8                            #
        subq      %rcx, %r8                                     #
        movq      %r15, 96(%rsp)                                #
        addq      %r8, %rsi                                     #
        movq      %rsi, 256(%rsp)                               #
        movq      %r10, 8(%rsp)                                 #
        movq      152(%rsp), %r11                               #
        movq      96(%rsp), %r9                                 #
        movq      104(%rsp), %r13                               #
        movq      112(%rsp), %r8                                #
        movq      88(%rsp), %r10                                #
        movq      80(%rsp), %rax                                #
        movq      72(%rsp), %rdx                                #
        movq      144(%rsp), %rbx                               #
        movq      120(%rsp), %r12                               #
        movq      128(%rsp), %rcx                               #
        movq      136(%rsp), %rsi                               #
        movq      304(%rsp), %rdi                               #
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14
..B10.67:                       # Preds ..B10.67 ..B10.66
        movq      312(%rsp), %r15                               #648.38
        incq      %rdi                                          #644.9
        vmovsd    (%rcx,%r10), %xmm1                            #648.28
        vmovsd    (%rcx,%rdx), %xmm0                            #647.28
        vmulsd    (%rsi,%r15), %xmm1, %xmm3                     #648.38
        vmulsd    (%rsi,%rax), %xmm0, %xmm2                     #647.38
        movq      328(%rsp), %r15                               #649.28
        vaddsd    %xmm3, %xmm2, %xmm5                           #648.26
        vmovsd    (%rcx,%r15), %xmm4                            #649.28
        movq      320(%rsp), %r15                               #649.38
        vmulsd    (%rsi,%r15), %xmm4, %xmm6                     #649.38
        movq      344(%rsp), %r15                               #646.11
        vaddsd    %xmm6, %xmm5, %xmm7                           #649.26
        vmulsd    (%rbx), %xmm7, %xmm8                          #646.11
        vmovsd    %xmm8, (%r12,%r15)                            #646.11
        movq      336(%rsp), %r15                               #652.38
        vmovsd    (%rcx,%r8), %xmm9                             #652.28
        vmovsd    (%rcx,%r13), %xmm10                           #653.28
        vmulsd    (%rsi,%r15), %xmm9, %xmm11                    #652.38
        vmulsd    (%rsi,%r9), %xmm10, %xmm12                    #653.38
        vmovsd    (%rcx,%r14), %xmm13                           #654.28
        vaddsd    %xmm12, %xmm11, %xmm14                        #653.26
        vmulsd    (%rsi,%r11), %xmm13, %xmm15                   #654.38
        movq      256(%rsp), %r15                               #651.11
        vaddsd    %xmm15, %xmm14, %xmm0                         #654.26
        vmulsd    (%rbx), %xmm0, %xmm1                          #651.11
        vmovsd    %xmm1, (%r12,%r15)                            #651.11
        addq      296(%rsp), %r12                               #644.9
        addq      272(%rsp), %rbx                               #644.9
        addq      424(%rsp), %rcx                               #644.9
        addq      288(%rsp), %rsi                               #644.9
        cmpq      264(%rsp), %rdi                               #644.9
        jb        ..B10.67      # Prob 81%                      #644.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14
..B10.68:                       # Preds ..B10.55 ..B10.67
        movq      200(%rsp), %rbx                               #
        movq      264(%rsp), %rcx                               #
        movq      24(%rsp), %r12                                #
        movq      32(%rsp), %r11                                #
        movq      40(%rsp), %r13                                #
        movq      48(%rsp), %rsi                                #
        movq      56(%rsp), %rdi                                #
        movq      64(%rsp), %r8                                 #
        movq      272(%rsp), %rdx                               #
        movq      (%rsp), %r9                                   #
        movq      8(%rsp), %r10                                 #
        movq      16(%rsp), %rax                                #
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.69:                       # Preds ..B10.53 ..B10.68 ..B10.65
        movq      192(%rsp), %r14                               #658.11
        movq      184(%rsp), %r15                               #658.11
        movq      %r14, 256(%rsp)                               #658.11
        movq      %r15, 232(%rsp)                               #658.11
        movq      280(%rsp), %r14                               #658.11
        movq      208(%rsp), %r15                               #659.40
        movq      %r8, 248(%rsp)                                #658.11
        movq      %rdi, 240(%rsp)                               #658.11
        movq      %r14, 288(%rsp)                               #658.11
        movq      %r14, 312(%rsp)                               #
        movq      %rsi, 224(%rsp)                               #659.40
        movq      %r13, 216(%rsp)                               #659.40
        movq      %r11, 176(%rsp)                               #659.40
        movq      %r12, 168(%rsp)                               #659.40
        movq      %r15, 296(%rsp)                               #659.40
        movq      %r15, 304(%rsp)                               #
        cmpq      $8, %rdx                                      #657.9
        jne       ..B10.89      # Prob 50%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.70:                       # Preds ..B10.69
        cmpq      $8, 424(%rsp)                                 #657.9
        jne       ..B10.89      # Prob 50%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.71:                       # Preds ..B10.70
        cmpq      $8, 208(%rsp)                                 #657.9
        jne       ..B10.89      # Prob 50%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.72:                       # Preds ..B10.71
        cmpq      $8, 280(%rsp)                                 #657.9
        jne       ..B10.92      # Prob 50%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.73:                       # Preds ..B10.72
        cmpq      $4, %rcx                                      #657.9
        jl        ..B10.145     # Prob 10%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.74:                       # Preds ..B10.73
        movq      408(%rsp), %r14                               #657.9
        lea       (,%r14,8), %r15                               #657.9
        negq      %r15                                          #657.9
        addq      416(%rsp), %r15                               #657.9
        movq      %r15, 72(%rsp)                                #657.9
        lea       8(%r15), %r14                                 #657.9
        andq      $31, %r14                                     #657.9
        movl      %r14d, 80(%rsp)                               #657.9
        testl     %r14d, %r14d                                  #657.9
        je        ..B10.77      # Prob 50%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 esi edi r8d r11d r12d r13d r14d sil dil r8b r11b r12b r13b r14b
..B10.75:                       # Preds ..B10.74
        testb     $7, 80(%rsp)                                  #657.9
        jne       ..B10.145     # Prob 10%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 esi edi r8d r11d r12d r13d r14d sil dil r8b r11b r12b r13b r14b
..B10.76:                       # Preds ..B10.75
        negl      %r14d                                         #657.9
        addl      $32, %r14d                                    #657.9
        shrl      $3, %r14d                                     #657.9
        movl      %r14d, 80(%rsp)                               #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 esi edi r8d r11d r12d r13d r14d sil dil r8b r11b r12b r13b r14b
..B10.77:                       # Preds ..B10.76 ..B10.74
        lea       4(%r14), %r15d                                #657.9
        cmpq      %r15, %rcx                                    #657.9
        jl        ..B10.145     # Prob 10%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 esi edi r8d r11d r12d r13d r14d sil dil r8b r11b r12b r13b r14b
..B10.78:                       # Preds ..B10.77
        movq      %rcx, 264(%rsp)                               #
        movq      %rdx, 272(%rsp)                               #
        movl      %ecx, %edx                                    #657.9
        subl      %r14d, %edx                                   #657.9
        andl      $3, %edx                                      #657.9
        subl      %edx, %ecx                                    #657.9
        movq      %r10, %rdx                                    #
        imulq     %r9, %rdx                                     #
        movl      %ecx, 168(%rsp)                               #657.9
        lea       (,%rax,8), %rcx                               #
        negq      %rcx                                          #
        movq      %rbx, 200(%rsp)                               #
        addq      %rbx, %rcx                                    #
        movl      %r14d, %r14d                                  #657.9
        lea       (%r10,%r10,2), %rbx                           #669.28
        movq      $0, 128(%rsp)                                 #657.9
        lea       (%rbx,%rbx), %r15                             #669.28
        subq      %rdx, %r15                                    #
        subq      %rdx, %rbx                                    #
        addq      %rcx, %r15                                    #
        addq      %rcx, %rbx                                    #
        movq      %r15, 216(%rsp)                               #
        lea       (%r10,%r10,4), %r15                           #667.28
        subq      %rdx, %r15                                    #
        addq      %rcx, %r15                                    #
        movq      %r15, 224(%rsp)                               #
        lea       (,%r10,4), %r15                               #665.28
        subq      %rdx, %r15                                    #
        addq      %rcx, %r15                                    #
        movq      %r15, 232(%rsp)                               #
        lea       (%r10,%r10), %r15                             #668.28
        movq      %rbx, 240(%rsp)                               #
        lea       (%r10,%r10,8), %rbx                           #661.28
        subq      %rdx, %r15                                    #
        subq      %rdx, %rbx                                    #
        addq      %rcx, %r15                                    #
        addq      %rcx, %rbx                                    #
        movq      %r15, 96(%rsp)                                #
        lea       (,%r10,8), %r15                               #660.28
        movq      %rbx, 104(%rsp)                               #
        movq      %r15, %rbx                                    #
        subq      %rdx, %rbx                                    #
        subq      %r10, %r15                                    #659.28
        addq      %rcx, %rbx                                    #
        subq      %rdx, %r15                                    #
        movq      %rbx, 88(%rsp)                                #
        movq      %r10, %rbx                                    #
        subq      %rdx, %rbx                                    #
        movq      %r13, %rdx                                    #
        imulq     %r11, %rdx                                    #
        addq      %rcx, %rbx                                    #
        addq      %rcx, %r15                                    #
        movq      %rbx, 80(%rsp)                                #
        lea       (%rdx,%r12,8), %rbx                           #
        movq      %r15, 248(%rsp)                               #
        lea       (%r11,%r11,2), %rdx                           #659.40
        movq      %r14, 136(%rsp)                               #657.9
        lea       (%rsi,%rdx), %rcx                             #
        subq      %rbx, %rcx                                    #
        lea       (%rsi,%rdx,2), %r15                           #
        movq      %rcx, 112(%rsp)                               #
        lea       (%r11,%r11,8), %rdx                           #661.40
        addq      %rsi, %rdx                                    #
        lea       (%r11,%rsi), %rcx                             #
        subq      %rbx, %rdx                                    #
        subq      %rbx, %rcx                                    #
        movq      %rdx, 304(%rsp)                               #
        lea       (%rsi,%r11,2), %rdx                           #
        subq      %rbx, %rdx                                    #
        subq      %rbx, %r15                                    #
        movq      %rdx, 288(%rsp)                               #
        lea       (%r11,%r11,4), %rdx                           #668.40
        movq      %rcx, 296(%rsp)                               #
        lea       (,%r11,8), %rcx                               #669.40
        addq      %rsi, %rdx                                    #
        subq      %r11, %rcx                                    #669.40
        subq      %rbx, %rdx                                    #
        addq      %rsi, %rcx                                    #
        movq      %r15, 120(%rsp)                               #
        lea       (%rsi,%r11,4), %r15                           #
        movq      %rdx, 312(%rsp)                               #
        lea       (%rsi,%r11,8), %rdx                           #
        subq      %rbx, %r15                                    #
        subq      %rbx, %rcx                                    #
        subq      %rbx, %rdx                                    #
        movq      %r8, %rbx                                     #
        imulq     %rdi, %rbx                                    #
        movq      %rdx, 152(%rsp)                               #
        movq      184(%rsp), %rdx                               #
        movq      %r15, 256(%rsp)                               #
        movq      %rcx, 160(%rsp)                               #
        lea       (%rdi,%rdi,2), %rcx                           #658.11
        lea       (%rbx,%rdx,8), %r15                           #
        movq      192(%rsp), %rbx                               #
        addq      %rbx, %rcx                                    #
        subq      %r15, %rcx                                    #
        movq      %rcx, 176(%rsp)                               #
        lea       (%rbx,%rdi,4), %rdx                           #
        subq      %r15, %rdx                                    #
        movq      %rdx, 144(%rsp)                               #
        movq      264(%rsp), %rcx                               #657.9
        movq      200(%rsp), %rbx                               #657.9
        movq      272(%rsp), %rdx                               #657.9
        testq     %r14, %r14                                    #657.9
        jbe       ..B10.82      # Prob 3%                       #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.79:                       # Preds ..B10.78
        movq      %rbx, 200(%rsp)                               #664.38
        movq      %rcx, 264(%rsp)                               #664.38
        movq      %r12, 24(%rsp)                                #664.38
        movq      %r11, 32(%rsp)                                #664.38
        movq      %r13, 40(%rsp)                                #664.38
        movq      %rsi, 48(%rsp)                                #664.38
        movq      %rdi, 56(%rsp)                                #664.38
        movq      %r8, 64(%rsp)                                 #664.38
        movq      %rdx, 272(%rsp)                               #664.38
        movq      %r9, (%rsp)                                   #664.38
        movq      %r10, 8(%rsp)                                 #664.38
        movq      %rax, 16(%rsp)                                #664.38
        vmovsd    .L_2il0floatpacket.171(%rip), %xmm1           #664.38
        movq      312(%rsp), %r12                               #664.38
        movq      256(%rsp), %r11                               #664.38
        movq      288(%rsp), %r10                               #664.38
        movq      296(%rsp), %r14                               #664.38
        movq      304(%rsp), %r9                                #664.38
        movq      120(%rsp), %rax                               #664.38
        movq      112(%rsp), %rdx                               #664.38
        movq      80(%rsp), %rcx                                #664.38
        movq      88(%rsp), %rbx                                #664.38
        movq      96(%rsp), %rsi                                #664.38
        movq      104(%rsp), %rdi                               #664.38
        movq      128(%rsp), %r13                               #664.38
        movq      72(%rsp), %r8                                 #664.38
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1
..B10.80:                       # Preds ..B10.80 ..B10.79
        movq      248(%rsp), %r15                               #659.28
        vmovsd    8(%rbx,%r13,8), %xmm3                         #660.28
        vmulsd    8(%rax,%r13,8), %xmm3, %xmm5                  #660.38
        vmovsd    8(%r15,%r13,8), %xmm2                         #659.28
        vmulsd    8(%rdx,%r13,8), %xmm2, %xmm4                  #659.38
        vmovsd    8(%rdi,%r13,8), %xmm6                         #661.28
        vaddsd    %xmm5, %xmm4, %xmm7                           #660.26
        vmulsd    8(%r9,%r13,8), %xmm6, %xmm8                   #661.38
        movq      176(%rsp), %r15                               #658.11
        vaddsd    %xmm8, %xmm7, %xmm9                           #661.26
        vmulsd    8(%r8,%r13,8), %xmm9, %xmm10                  #658.11
        vmovsd    %xmm10, 8(%r15,%r13,8)                        #658.11
        movq      232(%rsp), %r15                               #665.28
        vmovsd    8(%rcx,%r13,8), %xmm12                        #666.28
        vmulsd    8(%r10,%r13,8), %xmm12, %xmm14                #666.38
        vmulsd    8(%r8,%r13,8), %xmm1, %xmm0                   #664.38
        vmovsd    8(%r15,%r13,8), %xmm11                        #665.28
        vmulsd    8(%r14,%r13,8), %xmm11, %xmm13                #665.38
        movq      224(%rsp), %r15                               #667.28
        vaddsd    %xmm14, %xmm13, %xmm2                         #666.26
        vmovsd    8(%r15,%r13,8), %xmm15                        #667.28
        vmulsd    8(%r11,%r13,8), %xmm15, %xmm3                 #667.38
        movq      216(%rsp), %r15                               #669.28
        vaddsd    %xmm3, %xmm2, %xmm5                           #667.26
        vmovsd    8(%rsi,%r13,8), %xmm4                         #668.28
        vmulsd    8(%r12,%r13,8), %xmm4, %xmm6                  #668.38
        vmovsd    8(%r15,%r13,8), %xmm7                         #669.28
        vaddsd    %xmm6, %xmm5, %xmm8                           #668.26
        movq      160(%rsp), %r15                               #669.38
        vmulsd    8(%r15,%r13,8), %xmm7, %xmm9                  #669.38
        movq      240(%rsp), %r15                               #670.28
        vaddsd    %xmm9, %xmm8, %xmm11                          #669.26
        vmovsd    8(%r15,%r13,8), %xmm10                        #670.28
        movq      152(%rsp), %r15                               #670.38
        vmulsd    8(%r15,%r13,8), %xmm10, %xmm12                #670.38
        movq      144(%rsp), %r15                               #664.11
        vaddsd    %xmm12, %xmm11, %xmm13                        #670.26
        vmulsd    %xmm13, %xmm0, %xmm0                          #664.11
        vmovsd    %xmm0, 8(%r15,%r13,8)                         #664.11
        incq      %r13                                          #657.9
        cmpq      136(%rsp), %r13                               #657.9
        jb        ..B10.80      # Prob 81%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1
..B10.81:                       # Preds ..B10.80
        movq      200(%rsp), %rbx                               #
        movq      264(%rsp), %rcx                               #
        movq      24(%rsp), %r12                                #
        movq      32(%rsp), %r11                                #
        movq      40(%rsp), %r13                                #
        movq      48(%rsp), %rsi                                #
        movq      56(%rsp), %rdi                                #
        movq      64(%rsp), %r8                                 #
        movq      272(%rsp), %rdx                               #
        movq      (%rsp), %r9                                   #
        movq      8(%rsp), %r10                                 #
        movq      16(%rsp), %rax                                #
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.82:                       # Preds ..B10.78 ..B10.81
        movslq    168(%rsp), %r14                               #657.9
        movq      %rbx, 200(%rsp)                               #657.9
        movq      %rcx, 264(%rsp)                               #657.9
        movq      %r12, 24(%rsp)                                #657.9
        movq      %r11, 32(%rsp)                                #657.9
        movq      %r13, 40(%rsp)                                #657.9
        movq      %rsi, 48(%rsp)                                #657.9
        movq      %rdi, 56(%rsp)                                #657.9
        movq      %r8, 64(%rsp)                                 #657.9
        movq      %r14, 128(%rsp)                               #657.9
        movq      %rdx, 272(%rsp)                               #657.9
        movq      %r9, (%rsp)                                   #657.9
        movq      %r10, 8(%rsp)                                 #657.9
        movq      %rax, 16(%rsp)                                #657.9
        vmovupd   .L_2il0floatpacket.173(%rip), %ymm1           #664.38
        movq      120(%rsp), %r12                               #657.9
        movq      112(%rsp), %r11                               #657.9
        movq      248(%rsp), %r10                               #657.9
        movq      80(%rsp), %r14                                #657.9
        movq      88(%rsp), %r9                                 #657.9
        movq      96(%rsp), %rax                                #657.9
        movq      104(%rsp), %rdx                               #657.9
        movq      240(%rsp), %rcx                               #657.9
        movq      232(%rsp), %rbx                               #657.9
        movq      224(%rsp), %rsi                               #657.9
        movq      216(%rsp), %rdi                               #657.9
        movq      136(%rsp), %r13                               #657.9
        movq      72(%rsp), %r8                                 #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 ymm1
..B10.83:                       # Preds ..B10.83 ..B10.82
        vmovupd   8(%r10,%r13,8), %xmm2                         #659.28
        vmovupd   8(%r11,%r13,8), %xmm3                         #659.40
        vmovupd   8(%r9,%r13,8), %xmm6                          #660.28
        vmovupd   8(%r12,%r13,8), %xmm7                         #660.40
        vmovupd   8(%rdx,%r13,8), %xmm12                        #661.28
        movq      304(%rsp), %r15                               #661.40
        vmovupd   8(%r15,%r13,8), %xmm13                        #661.40
        vinsertf128 $1, 24(%r10,%r13,8), %ymm2, %ymm4           #659.28
        vinsertf128 $1, 24(%r11,%r13,8), %ymm3, %ymm5           #659.40
        vinsertf128 $1, 24(%r9,%r13,8), %ymm6, %ymm8            #660.28
        vinsertf128 $1, 24(%r12,%r13,8), %ymm7, %ymm9           #660.40
        vmulpd    %ymm5, %ymm4, %ymm10                          #659.38
        vmulpd    %ymm9, %ymm8, %ymm11                          #660.38
        vaddpd    %ymm11, %ymm10, %ymm0                         #660.26
        vinsertf128 $1, 24(%rdx,%r13,8), %ymm12, %ymm14         #661.28
        vinsertf128 $1, 24(%r15,%r13,8), %ymm13, %ymm15         #661.40
        vmulpd    %ymm15, %ymm14, %ymm2                         #661.38
        vaddpd    %ymm2, %ymm0, %ymm3                           #661.26
        vmulpd    8(%r8,%r13,8), %ymm3, %ymm4                   #658.11
        movq      176(%rsp), %r15                               #658.11
        vmovupd   %xmm4, 8(%r15,%r13,8)                         #658.11
        vextractf128 $1, %ymm4, 24(%r15,%r13,8)                 #658.11
        movq      296(%rsp), %r15                               #665.40
        vmovupd   8(%rbx,%r13,8), %xmm5                         #665.28
        vmovupd   8(%r15,%r13,8), %xmm6                         #665.40
        vmovupd   8(%r14,%r13,8), %xmm9                         #666.28
        vmovupd   8(%rsi,%r13,8), %xmm15                        #667.28
        vmulpd    8(%r8,%r13,8), %ymm1, %ymm0                   #664.38
        vinsertf128 $1, 24(%r15,%r13,8), %ymm6, %ymm8           #665.40
        movq      288(%rsp), %r15                               #666.40
        vmovupd   8(%r15,%r13,8), %xmm10                        #666.40
        vinsertf128 $1, 24(%rbx,%r13,8), %ymm5, %ymm7           #665.28
        vmulpd    %ymm8, %ymm7, %ymm13                          #665.38
        vmovupd   8(%rax,%r13,8), %xmm7                         #668.28
        vinsertf128 $1, 24(%r15,%r13,8), %ymm10, %ymm12         #666.40
        movq      256(%rsp), %r15                               #667.40
        vmovupd   8(%r15,%r13,8), %xmm2                         #667.40
        vinsertf128 $1, 24(%r14,%r13,8), %ymm9, %ymm11          #666.28
        vmulpd    %ymm12, %ymm11, %ymm14                        #666.38
        vaddpd    %ymm14, %ymm13, %ymm5                         #666.26
        vmovupd   8(%rdi,%r13,8), %xmm13                        #669.28
        vinsertf128 $1, 24(%r15,%r13,8), %ymm2, %ymm4           #667.40
        movq      312(%rsp), %r15                               #668.40
        vmovupd   8(%r15,%r13,8), %xmm8                         #668.40
        vinsertf128 $1, 24(%rsi,%r13,8), %ymm15, %ymm3          #667.28
        vmulpd    %ymm4, %ymm3, %ymm6                           #667.38
        vaddpd    %ymm6, %ymm5, %ymm11                          #667.26
        vmovupd   8(%rcx,%r13,8), %xmm5                         #670.28
        vinsertf128 $1, 24(%r15,%r13,8), %ymm8, %ymm10          #668.40
        movq      160(%rsp), %r15                               #669.40
        vmovupd   8(%r15,%r13,8), %xmm14                        #669.40
        vinsertf128 $1, 24(%rax,%r13,8), %ymm7, %ymm9           #668.28
        vmulpd    %ymm10, %ymm9, %ymm12                         #668.38
        vaddpd    %ymm12, %ymm11, %ymm3                         #668.26
        vinsertf128 $1, 24(%r15,%r13,8), %ymm14, %ymm2          #669.40
        movq      152(%rsp), %r15                               #670.40
        vmovupd   8(%r15,%r13,8), %xmm6                         #670.40
        vinsertf128 $1, 24(%rdi,%r13,8), %ymm13, %ymm13         #669.28
        vmulpd    %ymm2, %ymm13, %ymm4                          #669.38
        vaddpd    %ymm4, %ymm3, %ymm9                           #669.26
        vinsertf128 $1, 24(%rcx,%r13,8), %ymm5, %ymm7           #670.28
        vinsertf128 $1, 24(%r15,%r13,8), %ymm6, %ymm8           #670.40
        vmulpd    %ymm8, %ymm7, %ymm10                          #670.38
        vaddpd    %ymm10, %ymm9, %ymm11                         #670.26
        vmulpd    %ymm11, %ymm0, %ymm0                          #664.11
        movq      144(%rsp), %r15                               #664.11
        vmovupd   %xmm0, 8(%r15,%r13,8)                         #664.11
        vextractf128 $1, %ymm0, 24(%r15,%r13,8)                 #664.11
        addq      $4, %r13                                      #657.9
        cmpq      128(%rsp), %r13                               #657.9
        jb        ..B10.83      # Prob 81%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 ymm1
..B10.84:                       # Preds ..B10.83
        movq      200(%rsp), %rbx                               #
        movq      264(%rsp), %rcx                               #
        movq      24(%rsp), %r12                                #
        movq      32(%rsp), %r11                                #
        movq      40(%rsp), %r13                                #
        movq      48(%rsp), %rsi                                #
        movq      56(%rsp), %rdi                                #
        movq      64(%rsp), %r8                                 #
        movq      272(%rsp), %rdx                               #
        movq      (%rsp), %r9                                   #
        movq      8(%rsp), %r10                                 #
        movq      16(%rsp), %rax                                #
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.85:                       # Preds ..B10.84 ..B10.145
        movl      168(%rsp), %r14d                              #657.9
        movq      $0, 176(%rsp)                                 #657.9
        lea       1(%r14), %r15d                                #657.9
        movslq    %r15d, %r15                                   #657.9
        cmpq      %r15, %rcx                                    #657.9
        jb        ..B10.101     # Prob 3%                       #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.86:                       # Preds ..B10.85
        movq      %rdx, 272(%rsp)                               #
        movq      %r10, %rdx                                    #
        imulq     %r9, %rdx                                     #
        movq      %r9, (%rsp)                                   #
        lea       (,%rax,8), %r14                               #
        negq      %r14                                          #
        lea       (%r10,%r10,2), %r9                            #670.28
        addq      %rbx, %r14                                    #
        movq      %rbx, 200(%rsp)                               #
        movslq    168(%rsp), %r15                               #
        movq      %r13, 40(%rsp)                                #
        imulq     %r11, %r13                                    #
        movq      %r8, 64(%rsp)                                 #
        lea       (%r14,%r9), %rbx                              #
        subq      %rdx, %rbx                                    #
        lea       (%r14,%r9,2), %r9                             #
        subq      %rdx, %r9                                     #
        imulq     %rdi, %r8                                     #
        movq      %rcx, 264(%rsp)                               #
        lea       (%rbx,%r15,8), %rbx                           #
        movq      %rbx, 224(%rsp)                               #
        lea       (%r9,%r15,8), %rbx                            #
        movq      %rbx, 232(%rsp)                               #
        lea       (%r14,%r10,2), %r9                            #
        subq      %rdx, %r9                                     #
        movq      %r12, 24(%rsp)                                #
        movq      %r11, 32(%rsp)                                #
        movq      %rsi, 48(%rsp)                                #
        movq      %rdi, 56(%rsp)                                #
        lea       (%r9,%r15,8), %rbx                            #
        movq      %rbx, 240(%rsp)                               #
        lea       (%r10,%r10,4), %r9                            #667.28
        addq      %r14, %r9                                     #
        subq      %rdx, %r9                                     #
        movq      %r10, 8(%rsp)                                 #
        movq      %rax, 16(%rsp)                                #
        vmovsd    .L_2il0floatpacket.171(%rip), %xmm1           #664.38
        lea       (%r9,%r15,8), %rbx                            #
        movq      %rbx, 248(%rsp)                               #
        lea       (%r10,%r14), %r9                              #
        subq      %rdx, %r9                                     #
        lea       (%r9,%r15,8), %rbx                            #
        movq      %rbx, 256(%rsp)                               #
        lea       (%r14,%r10,4), %r9                            #
        subq      %rdx, %r9                                     #
        lea       (%r9,%r15,8), %rbx                            #
        movq      %rbx, 288(%rsp)                               #
        lea       (%r10,%r10,8), %r9                            #661.28
        addq      %r14, %r9                                     #
        subq      %rdx, %r9                                     #
        lea       (%r9,%r15,8), %rbx                            #
        movq      %rbx, 72(%rsp)                                #
        lea       (%r14,%r10,8), %r9                            #
        subq      %rdx, %r9                                     #
        lea       (%r9,%r15,8), %rbx                            #
        movq      %rbx, 80(%rsp)                                #
        lea       (,%r10,8), %r9                                #659.28
        subq      %r10, %r9                                     #659.28
        addq      %r9, %r14                                     #
        movq      %rcx, %r9                                     #657.9
        subq      %rdx, %r14                                    #
        subq      %r15, %r9                                     #657.9
        movq      %r9, 304(%rsp)                                #657.9
        lea       (%r14,%r15,8), %rdx                           #
        movq      408(%rsp), %r14                               #
        movq      %rdx, 296(%rsp)                               #
        lea       (,%r14,8), %rbx                               #
        negq      %rbx                                          #
        addq      416(%rsp), %rbx                               #
        lea       (%rbx,%r15,8), %rdx                           #
        movq      %rdx, 88(%rsp)                                #
        lea       (%rsi,%r11,8), %rdx                           #
        subq      %r13, %rdx                                    #
        lea       (,%r12,8), %rbx                               #
        subq      %rbx, %rdx                                    #
        lea       (%rdx,%r15,8), %r14                           #
        movq      %r14, 96(%rsp)                                #
        lea       (,%r11,8), %rdx                               #669.40
        subq      %r11, %rdx                                    #669.40
        addq      %rsi, %rdx                                    #
        subq      %r13, %rdx                                    #
        subq      %rbx, %rdx                                    #
        lea       (%rdx,%r15,8), %r9                            #
        movq      %r9, 104(%rsp)                                #
        movq      104(%rsp), %rcx                               #
        lea       (%r11,%r11,4), %rdx                           #668.40
        addq      %rsi, %rdx                                    #
        subq      %r13, %rdx                                    #
        subq      %rbx, %rdx                                    #
        lea       (%rdx,%r15,8), %r14                           #
        movq      %r14, 112(%rsp)                               #
        lea       (%rsi,%r11,4), %rdx                           #
        subq      %r13, %rdx                                    #
        subq      %rbx, %rdx                                    #
        lea       (%rdx,%r15,8), %r9                            #
        movq      %r9, 120(%rsp)                                #
        movq      120(%rsp), %rax                               #
        lea       (%rsi,%r11,2), %rdx                           #
        subq      %r13, %rdx                                    #
        subq      %rbx, %rdx                                    #
        lea       (%rdx,%r15,8), %r14                           #
        movq      %r14, 128(%rsp)                               #
        lea       (%r11,%rsi), %rdx                             #
        subq      %r13, %rdx                                    #
        subq      %rbx, %rdx                                    #
        lea       (%rdx,%r15,8), %r9                            #
        movq      %r9, 136(%rsp)                                #
        lea       (%r11,%r11,8), %rdx                           #661.40
        addq      %rsi, %rdx                                    #
        subq      %r13, %rdx                                    #
        subq      %rbx, %rdx                                    #
        lea       (%rdx,%r15,8), %r14                           #
        movq      %r14, 144(%rsp)                               #
        movq      144(%rsp), %r10                               #
        lea       (%r11,%r11,2), %r14                           #660.40
        lea       (%rsi,%r14,2), %rdx                           #
        addq      %rsi, %r14                                    #
        subq      %r13, %rdx                                    #
        subq      %r13, %r14                                    #
        subq      %rbx, %rdx                                    #
        subq      %rbx, %r14                                    #
        movq      88(%rsp), %rsi                                #
        lea       (%rdx,%r15,8), %r9                            #
        movq      184(%rsp), %rdx                               #
        movq      %r9, 152(%rsp)                                #
        lea       (%r14,%r15,8), %r13                           #
        movq      %r13, 312(%rsp)                               #
        movq      192(%rsp), %r13                               #
        lea       (,%rdx,8), %r14                               #
        movq      152(%rsp), %r11                               #
        lea       (%rdi,%rdi,2), %rdx                           #658.11
        addq      %r13, %rdx                                    #
        subq      %r8, %rdx                                     #
        lea       (%r13,%rdi,4), %rbx                           #
        subq      %r8, %rbx                                     #
        subq      %r14, %rdx                                    #
        subq      %r14, %rbx                                    #
        movq      80(%rsp), %rdi                                #
        lea       (%rdx,%r15,8), %r14                           #
        lea       (%rbx,%r15,8), %r9                            #
        movq      %r9, 160(%rsp)                                #
        movq      %r14, 216(%rsp)                               #
        movq      160(%rsp), %r12                               #
        movq      136(%rsp), %r14                               #
        movq      128(%rsp), %r9                                #
        movq      112(%rsp), %rdx                               #
        movq      96(%rsp), %rbx                                #
        movq      72(%rsp), %r8                                 #
        movq      176(%rsp), %r13                               #
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1
..B10.87:                       # Preds ..B10.87 ..B10.86
        movq      296(%rsp), %r15                               #659.28
        vmovsd    8(%rdi,%r13,8), %xmm3                         #660.28
        vmulsd    8(%r11,%r13,8), %xmm3, %xmm5                  #660.38
        vmovsd    8(%r15,%r13,8), %xmm2                         #659.28
        movq      312(%rsp), %r15                               #659.38
        vmovsd    8(%r8,%r13,8), %xmm6                          #661.28
        vmulsd    8(%r10,%r13,8), %xmm6, %xmm8                  #661.38
        vmulsd    8(%r15,%r13,8), %xmm2, %xmm4                  #659.38
        movq      216(%rsp), %r15                               #658.11
        vaddsd    %xmm5, %xmm4, %xmm7                           #660.26
        vaddsd    %xmm8, %xmm7, %xmm9                           #661.26
        vmulsd    8(%rsi,%r13,8), %xmm9, %xmm10                 #658.11
        vmovsd    %xmm10, 8(%r15,%r13,8)                        #658.11
        movq      288(%rsp), %r15                               #665.28
        vmulsd    8(%rsi,%r13,8), %xmm1, %xmm0                  #664.38
        vmovsd    8(%r15,%r13,8), %xmm11                        #665.28
        movq      256(%rsp), %r15                               #666.28
        vmulsd    8(%r14,%r13,8), %xmm11, %xmm13                #665.38
        vmovsd    8(%r15,%r13,8), %xmm12                        #666.28
        vmulsd    8(%r9,%r13,8), %xmm12, %xmm14                 #666.38
        movq      248(%rsp), %r15                               #667.28
        vaddsd    %xmm14, %xmm13, %xmm2                         #666.26
        vmovsd    8(%r15,%r13,8), %xmm15                        #667.28
        vmulsd    8(%rax,%r13,8), %xmm15, %xmm3                 #667.38
        movq      240(%rsp), %r15                               #668.28
        vaddsd    %xmm3, %xmm2, %xmm5                           #667.26
        vmovsd    8(%r15,%r13,8), %xmm4                         #668.28
        vmulsd    8(%rdx,%r13,8), %xmm4, %xmm6                  #668.38
        movq      232(%rsp), %r15                               #669.28
        vaddsd    %xmm6, %xmm5, %xmm8                           #668.26
        vmovsd    8(%r15,%r13,8), %xmm7                         #669.28
        vmulsd    8(%rcx,%r13,8), %xmm7, %xmm9                  #669.38
        movq      224(%rsp), %r15                               #670.28
        vaddsd    %xmm9, %xmm8, %xmm11                          #669.26
        vmovsd    8(%r15,%r13,8), %xmm10                        #670.28
        vmulsd    8(%rbx,%r13,8), %xmm10, %xmm12                #670.38
        vaddsd    %xmm12, %xmm11, %xmm13                        #670.26
        vmulsd    %xmm13, %xmm0, %xmm0                          #664.11
        vmovsd    %xmm0, 8(%r12,%r13,8)                         #664.11
        incq      %r13                                          #657.9
        cmpq      304(%rsp), %r13                               #657.9
        jb        ..B10.87      # Prob 81%                      #657.9
        jmp       ..B10.100     # Prob 100%                     #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1
..B10.89:                       # Preds ..B10.69 ..B10.70 ..B10.71
        cmpq      $0, 208(%rsp)                                 #657.9
        je        ..B10.156     # Prob 10%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.90:                       # Preds ..B10.89
        cmpq      $0, 424(%rsp)                                 #657.9
        je        ..B10.156     # Prob 10%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.91:                       # Preds ..B10.90
        testq     %rdx, %rdx                                    #657.9
        je        ..B10.156     # Prob 10%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.92:                       # Preds ..B10.72 ..B10.91
        cmpq      $0, 280(%rsp)                                 #657.9
        je        ..B10.156     # Prob 10%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.93:                       # Preds ..B10.92
        cmpq      $2, %rcx                                      #657.9
        jl        ..B10.156     # Prob 10%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b
..B10.94:                       # Preds ..B10.93
        movq      416(%rsp), %r14                               #
        xorl      %r15d, %r15d                                  #657.9
        movq      %rcx, 264(%rsp)                               #
        andl      $-2, %ecx                                     #657.9
        movq      %rdx, 272(%rsp)                               #
        movslq    %ecx, %rcx                                    #657.9
        movq      %rcx, 320(%rsp)                               #657.9
        lea       (%r14,%rdx), %rcx                             #
        imulq     408(%rsp), %rdx                               #
        imulq     %r11, %r13                                    #
        imulq     %rdi, %r8                                     #
        subq      %rdx, %rcx                                    #
        movq      424(%rsp), %rdx                               #
        movq      %rbx, 200(%rsp)                               #
        movq      %r9, (%rsp)                                   #
        movq      %rcx, 72(%rsp)                                #
        lea       (%rbx,%rdx), %r14                             #
        movq      %r10, %rbx                                    #
        lea       (%r10,%r10,2), %rcx                           #670.28
        imulq     %rax, %rdx                                    #
        imulq     %r9, %rbx                                     #
        movq      %r10, %r9                                     #
        subq      %rdx, %r14                                    #
        subq      %rbx, %r9                                     #
        lea       (%r10,%r10), %rdx                             #668.28
        addq      %r14, %r9                                     #
        subq      %rbx, %rdx                                    #
        movq      %r9, 144(%rsp)                                #
        movq      %rcx, %r9                                     #
        subq      %rbx, %r9                                     #
        addq      %r14, %rdx                                    #
        addq      %r14, %r9                                     #
        addq      %rcx, %rcx                                    #669.28
        movq      %r9, 128(%rsp)                                #
        lea       (%r10,%r10,4), %r9                            #667.28
        subq      %rbx, %r9                                     #
        subq      %rbx, %rcx                                    #
        addq      %r14, %r9                                     #
        addq      %r14, %rcx                                    #
        movq      %r9, 112(%rsp)                                #
        movq      208(%rsp), %r9                                #
        imulq     %r9, %r12                                     #
        movq      %rdx, 136(%rsp)                               #
        lea       (,%r10,4), %rdx                               #665.28
        subq      %rbx, %rdx                                    #
        subq      %r9, %r12                                     #
        addq      %r14, %rdx                                    #
        lea       (%r11,%r11,2), %r9                            #659.40
        movq      %rdx, 120(%rsp)                               #
        lea       (,%r10,8), %rdx                               #659.28
        movq      %rcx, 160(%rsp)                               #
        movq      %rdx, %rcx                                    #659.28
        subq      %r10, %rcx                                    #659.28
        subq      %rbx, %rdx                                    #
        subq      %rbx, %rcx                                    #
        addq      %r14, %rdx                                    #
        addq      %r14, %rcx                                    #
        movq      %rcx, 152(%rsp)                               #
        lea       (%rsi,%r9), %rcx                              #
        movq      %rdx, 328(%rsp)                               #
        lea       (%rsi,%r9,2), %rdx                            #
        subq      %r13, %rdx                                    #
        lea       (%r10,%r10,8), %r9                            #661.28
        subq      %rbx, %r9                                     #
        lea       (%r11,%r11,8), %rbx                           #661.40
        addq      %rsi, %rbx                                    #
        subq      %r12, %rdx                                    #
        subq      %r13, %rbx                                    #
        subq      %r13, %rcx                                    #
        subq      %r12, %rbx                                    #
        subq      %r12, %rcx                                    #
        movq      %rdx, 336(%rsp)                               #
        addq      %r9, %r14                                     #
        movq      %rbx, 376(%rsp)                               #
        movq      280(%rsp), %rdx                               #
        movq      184(%rsp), %rbx                               #
        imulq     %rdx, %rbx                                    #
        movq      192(%rsp), %r9                                #
        subq      %rdx, %rbx                                    #
        movq      %rcx, 104(%rsp)                               #
        lea       (%rdi,%rdi,2), %rcx                           #658.11
        addq      %r9, %rcx                                     #
        lea       (%r11,%rsi), %rdx                             #
        subq      %r13, %rdx                                    #
        subq      %r8, %rcx                                     #
        subq      %r12, %rdx                                    #
        subq      %rbx, %rcx                                    #
        movq      %rdx, 360(%rsp)                               #
        lea       (%rsi,%r11,4), %rdx                           #
        movq      %rcx, 80(%rsp)                                #
        lea       (%rsi,%r11,2), %rcx                           #
        subq      %r13, %rdx                                    #
        subq      %r13, %rcx                                    #
        subq      %r12, %rdx                                    #
        subq      %r12, %rcx                                    #
        movq      %rdx, 344(%rsp)                               #
        lea       (,%r11,8), %rdx                               #669.40
        movq      %rcx, 352(%rsp)                               #
        lea       (%r11,%r11,4), %rcx                           #668.40
        subq      %r11, %rdx                                    #669.40
        addq      %rsi, %rcx                                    #
        addq      %rsi, %rdx                                    #
        lea       (%rsi,%r11,8), %rsi                           #
        subq      %r13, %rcx                                    #
        lea       (%r9,%rdi,4), %r11                            #
        subq      %r8, %r11                                     #
        xorl      %edi, %edi                                    #
        subq      %r13, %rdx                                    #
        subq      %r13, %rsi                                    #
        movq      %r15, 88(%rsp)                                #657.9
        subq      %rbx, %r11                                    #
        xorl      %r8d, %r8d                                    #
        subq      %r12, %rcx                                    #
        movq      %rcx, 368(%rsp)                               #
        subq      %r12, %rdx                                    #
        movq      %rdx, 384(%rsp)                               #
        subq      %r12, %rsi                                    #
        movq      %r10, 8(%rsp)                                 #
        movq      %rax, 16(%rsp)                                #
        vmovupd   .L_2il0floatpacket.172(%rip), %xmm2           #664.38
        movq      %rsi, 392(%rsp)                               #
        xorl      %esi, %esi                                    #
        movq      272(%rsp), %rdx                               #
        movq      208(%rsp), %r10                               #
        movq      280(%rsp), %r9                                #
        movq      80(%rsp), %rax                                #
        movq      72(%rsp), %rcx                                #
        movq      88(%rsp), %r12                                #
        movq      424(%rsp), %rbx                               #
        movq      %r11, 96(%rsp)                                #
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm2
..B10.95:                       # Preds ..B10.95 ..B10.94
        movq      152(%rsp), %r13                               #659.28
        lea       (%rcx,%r15), %r11                             #658.28
        vmovsd    (%r11), %xmm3                                 #658.28
        addq      $2, %r12                                      #657.9
        vmovhpd   (%r11,%rdx), %xmm3, %xmm1                     #658.28
        lea       (%r15,%rdx,2), %r15                           #657.9
        addq      %rdi, %r13                                    #659.28
        vmovsd    (%r13), %xmm4                                 #659.28
        vmovhpd   (%r13,%rbx), %xmm4, %xmm6                     #659.28
        movq      104(%rsp), %r13                               #659.40
        addq      %r8, %r13                                     #659.40
        vmovsd    (%r13), %xmm5                                 #659.40
        vmovhpd   (%r13,%r10), %xmm5, %xmm7                     #659.40
        movq      328(%rsp), %r13                               #660.28
        vmulpd    %xmm7, %xmm6, %xmm12                          #659.38
        addq      %rdi, %r13                                    #660.28
        vmovsd    (%r13), %xmm8                                 #660.28
        vmovhpd   (%r13,%rbx), %xmm8, %xmm10                    #660.28
        movq      336(%rsp), %r13                               #660.40
        addq      %r8, %r13                                     #660.40
        vmovsd    (%r13), %xmm9                                 #660.40
        vmovhpd   (%r13,%r10), %xmm9, %xmm11                    #660.40
        lea       (%r14,%rdi), %r13                             #661.28
        vmovsd    (%r13), %xmm14                                #661.28
        vmovhpd   (%r13,%rbx), %xmm14, %xmm0                    #661.28
        movq      376(%rsp), %r13                               #661.40
        vmulpd    %xmm11, %xmm10, %xmm13                        #660.38
        vaddpd    %xmm13, %xmm12, %xmm3                         #660.26
        addq      %r8, %r13                                     #661.40
        vmovsd    (%r13), %xmm15                                #661.40
        vmovhpd   (%r13,%r10), %xmm15, %xmm14                   #661.40
        lea       (%rax,%rsi), %r13                             #658.11
        vmulpd    %xmm14, %xmm0, %xmm4                          #661.38
        vaddpd    %xmm4, %xmm3, %xmm5                           #661.26
        vmulpd    %xmm5, %xmm1, %xmm1                           #658.11
        vmovlpd   %xmm1, (%r13)                                 #658.11
        vmovhpd   %xmm1, (%r13,%r9)                             #658.11
        vmovsd    (%r11), %xmm6                                 #664.28
        vmovhpd   (%r11,%rdx), %xmm6, %xmm7                     #664.28
        movq      120(%rsp), %r11                               #665.28
        movq      360(%rsp), %r13                               #665.40
        vmulpd    %xmm7, %xmm2, %xmm0                           #664.38
        addq      %rdi, %r11                                    #665.28
        vmovsd    (%r11), %xmm8                                 #665.28
        vmovhpd   (%r11,%rbx), %xmm8, %xmm10                    #665.28
        lea       (%r13,%r8), %r11                              #665.40
        movq      144(%rsp), %r13                               #666.28
        .byte     15                                            #665.40
        .byte     31                                            #665.40
        .byte     0                                             #665.40
        vmovsd    (%r11), %xmm9                                 #665.40
        vmovhpd   (%r11,%r10), %xmm9, %xmm11                    #665.40
        vmulpd    %xmm11, %xmm10, %xmm1                         #665.38
        lea       (%r13,%rdi), %r11                             #666.28
        movq      352(%rsp), %r13                               #666.40
        vmovsd    (%r11), %xmm12                                #666.28
        vmovhpd   (%r11,%rbx), %xmm12, %xmm15                   #666.28
        lea       (%r13,%r8), %r11                              #666.40
        movq      112(%rsp), %r13                               #667.28
        vmovsd    (%r11), %xmm13                                #666.40
        vmovhpd   (%r11,%r10), %xmm13, %xmm12                   #666.40
        vmulpd    %xmm12, %xmm15, %xmm3                         #666.38
        vaddpd    %xmm3, %xmm1, %xmm8                           #666.26
        lea       (%r13,%rdi), %r11                             #667.28
        movq      344(%rsp), %r13                               #667.40
        vmovsd    (%r11), %xmm4                                 #667.28
        vmovhpd   (%r11,%rbx), %xmm4, %xmm6                     #667.28
        lea       (%r13,%r8), %r11                              #667.40
        movq      136(%rsp), %r13                               #668.28
        vmovsd    (%r11), %xmm5                                 #667.40
        vmovhpd   (%r11,%r10), %xmm5, %xmm7                     #667.40
        vmulpd    %xmm7, %xmm6, %xmm9                           #667.38
        vaddpd    %xmm9, %xmm8, %xmm15                          #667.26
        lea       (%r13,%rdi), %r11                             #668.28
        movq      368(%rsp), %r13                               #668.40
        vmovsd    (%r11), %xmm10                                #668.28
        vmovhpd   (%r11,%rbx), %xmm10, %xmm13                   #668.28
        lea       (%r13,%r8), %r11                              #668.40
        movq      160(%rsp), %r13                               #669.28
        vmovsd    (%r11), %xmm11                                #668.40
        vmovhpd   (%r11,%r10), %xmm11, %xmm14                   #668.40
        vmulpd    %xmm14, %xmm13, %xmm1                         #668.38
        vaddpd    %xmm1, %xmm15, %xmm7                          #668.26
        lea       (%r13,%rdi), %r11                             #669.28
        movq      384(%rsp), %r13                               #669.40
        vmovsd    (%r11), %xmm3                                 #669.28
        vmovhpd   (%r11,%rbx), %xmm3, %xmm5                     #669.28
        lea       (%r13,%r8), %r11                              #669.40
        movq      128(%rsp), %r13                               #670.28
        vmovsd    (%r11), %xmm4                                 #669.40
        vmovhpd   (%r11,%r10), %xmm4, %xmm6                     #669.40
        vmulpd    %xmm6, %xmm5, %xmm8                           #669.38
        vaddpd    %xmm8, %xmm7, %xmm13                          #669.26
        lea       (%r13,%rdi), %r11                             #670.28
        movq      392(%rsp), %r13                               #670.40
        vmovsd    (%r11), %xmm9                                 #670.28
        lea       (%rdi,%rbx,2), %rdi                           #657.9
        vmovhpd   (%r11,%rbx), %xmm9, %xmm11                    #670.28
        lea       (%r13,%r8), %r11                              #670.40
        .byte     144                                           #664.11
        movq      96(%rsp), %r13                                #664.11
        vmovsd    (%r11), %xmm10                                #670.40
        lea       (%r8,%r10,2), %r8                             #657.9
        vmovhpd   (%r11,%r10), %xmm10, %xmm12                   #670.40
        vmulpd    %xmm12, %xmm11, %xmm14                        #670.38
        vaddpd    %xmm14, %xmm13, %xmm15                        #670.26
        vmulpd    %xmm15, %xmm0, %xmm0                          #664.11
        lea       (%r13,%rsi), %r11                             #664.11
        vmovlpd   %xmm0, (%r11)                                 #664.11
        lea       (%rsi,%r9,2), %rsi                            #657.9
        vmovhpd   %xmm0, (%r11,%r9)                             #664.11
        cmpq      320(%rsp), %r12                               #657.9
        jb        ..B10.95      # Prob 81%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm2
..B10.96:                       # Preds ..B10.95
        movq      432(%rsp), %r11                               #675.11
        movq      400(%rsp), %r14                               #676.40
        movq      200(%rsp), %rbx                               #
        movq      (%r11), %rdi                                  #675.11
        movq      64(%r11), %rsi                                #675.11
        movq      56(%r11), %r12                                #675.11
        movq      56(%r14), %r15                                #676.40
        movq      %rdi, 192(%rsp)                               #675.11
        movq      88(%r11), %r8                                 #675.11
        movq      80(%r11), %rdi                                #675.11
        movq      %rsi, 184(%rsp)                               #675.11
        movq      %r12, 280(%rsp)                               #675.11
        movq      264(%rsp), %rcx                               #
        movq      (%rsp), %r9                                   #
        movq      8(%rsp), %r10                                 #
        movq      16(%rsp), %rax                                #
        movq      (%r14), %rsi                                  #676.40
        movq      64(%r14), %r12                                #676.40
        movq      88(%r14), %r13                                #676.40
        movq      80(%r14), %r11                                #676.40
        movq      %r15, 208(%rsp)                               #676.40
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.97:                       # Preds ..B10.96 ..B10.156
        movq      296(%rsp), %r15                               #
        movq      320(%rsp), %r14                               #
        imulq     %r14, %r15                                    #
        movq      %r15, 152(%rsp)                               #
        movq      424(%rsp), %r15                               #
        imulq     %r14, %r15                                    #
        movq      %r15, 144(%rsp)                               #
        movq      %rdx, %r15                                    #
        imulq     %r14, %r15                                    #
        movq      %r15, 160(%rsp)                               #
        movq      288(%rsp), %r15                               #
        imulq     %r14, %r15                                    #
        movq      %r15, 136(%rsp)                               #
        cmpq      %rcx, %r14                                    #657.9
        jae       ..B10.101     # Prob 3%                       #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.98:                       # Preds ..B10.97
        movq      %r12, 24(%rsp)                                #
        movq      %rdx, %r12                                    #
        imulq     408(%rsp), %r12                               #
        movq      416(%rsp), %r14                               #
        lea       (,%r10,8), %r15                               #659.28
        movq      %r13, 40(%rsp)                                #
        movq      %rbx, 200(%rsp)                               #
        movq      %rdi, 56(%rsp)                                #
        lea       (%r14,%rdx), %r13                             #
        subq      %r12, %r13                                    #
        lea       (%r10,%r10,4), %rdi                           #667.28
        movq      424(%rsp), %r12                               #
        addq      %r13, 160(%rsp)                               #
        movq      %r10, %r13                                    #
        imulq     %r9, %r13                                     #
        movq      %rcx, 264(%rsp)                               #
        lea       (%rbx,%r12), %r14                             #
        imulq     %rax, %r12                                    #
        movq      %r12, %rbx                                    #
        movq      %rdi, %rcx                                    #667.28
        subq      %r10, %rbx                                    #
        negq      %rcx                                          #667.28
        negq      %rbx                                          #
        addq      %r12, %rcx                                    #
        addq      %r14, %rbx                                    #
        negq      %rcx                                          #
        subq      %r13, %rbx                                    #
        addq      %r14, %rcx                                    #
        movq      %rsi, 48(%rsp)                                #
        lea       (%r10,%r10), %rsi                             #668.28
        movq      %r11, 32(%rsp)                                #
        lea       (%r10,%r10,2), %r11                           #670.28
        movq      %rbx, 336(%rsp)                               #
        movq      %r10, %rbx                                    #659.28
        negq      %rsi                                          #
        negq      %r11                                          #670.28
        subq      %r15, %rbx                                    #659.28
        addq      %r12, %rsi                                    #
        addq      %r12, %r11                                    #
        addq      %r12, %rbx                                    #
        negq      %rsi                                          #
        negq      %r11                                          #
        negq      %rbx                                          #
        addq      %r14, %rsi                                    #
        addq      %r14, %r11                                    #
        addq      %r14, %rbx                                    #
        subq      %r13, %rcx                                    #
        subq      %r13, %rsi                                    #
        movq      %rcx, 80(%rsp)                                #
        subq      %r13, %r11                                    #
        movq      176(%rsp), %rcx                               #
        subq      %r13, %rbx                                    #
        movq      %rsi, 104(%rsp)                               #
        addq      %r10, %rdi                                    #669.28
        movq      %r11, 96(%rsp)                                #
        negq      %rdi                                          #669.28
        movq      %rbx, 72(%rsp)                                #
        addq      %r12, %rdi                                    #
        movq      216(%rsp), %rsi                               #
        negq      %rdi                                          #
        movq      168(%rsp), %r11                               #
        negq      %r15                                          #
        movq      296(%rsp), %rbx                               #
        addq      %r14, %rdi                                    #
        imulq     %rcx, %rsi                                    #
        imulq     %rbx, %r11                                    #
        movq      %r8, 64(%rsp)                                 #
        lea       (,%r10,4), %r8                                #665.28
        negq      %r8                                           #
        addq      %r12, %r15                                    #
        addq      %r12, %r8                                     #
        subq      %r13, %rdi                                    #
        movq      %rdi, 112(%rsp)                               #
        negq      %r8                                           #
        movq      224(%rsp), %rdi                               #
        negq      %r15                                          #
        addq      %r14, %r8                                     #
        addq      %r14, %r15                                    #
        addq      %r11, %rsi                                    #
        lea       (%rcx,%rcx,2), %r11                           #659.40
        subq      %r13, %r8                                     #
        subq      %r13, %r15                                    #
        movq      %r8, 88(%rsp)                                 #
        lea       (%rdi,%r11), %r8                              #
        movq      %r15, 128(%rsp)                               #
        lea       (%rdi,%r11,2), %r15                           #
        subq      %rsi, %r8                                     #
        lea       (%r10,%r10,8), %r11                           #661.28
        negq      %r11                                          #661.28
        addq      %rbx, %r8                                     #
        addq      %r11, %r12                                    #
        subq      %rsi, %r15                                    #
        subq      %r12, %r14                                    #
        lea       (%rcx,%rcx,8), %r12                           #661.40
        addq      %rdi, %r12                                    #
        subq      %r13, %r14                                    #
        subq      %rsi, %r12                                    #
        addq      %rbx, %r15                                    #
        movq      240(%rsp), %r11                               #
        addq      %rbx, %r12                                    #
        movq      %r8, 120(%rsp)                                #
        movq      %r12, 352(%rsp)                               #
        movq      248(%rsp), %r12                               #
        movq      232(%rsp), %r8                                #
        movq      288(%rsp), %r13                               #
        imulq     %r11, %r12                                    #
        imulq     %r13, %r8                                     #
        movq      %r15, 344(%rsp)                               #
        addq      %r8, %r12                                     #
        movq      256(%rsp), %r8                                #
        lea       (%r11,%r11,2), %r15                           #658.11
        addq      %r8, %r15                                     #
        subq      %r12, %r15                                    #
        addq      %r13, %r15                                    #
        movq      %r15, 392(%rsp)                               #
        lea       (%rcx,%rdi), %r15                             #
        subq      %rsi, %r15                                    #
        addq      %rbx, %r15                                    #
        movq      %r15, 376(%rsp)                               #
        lea       (%rdi,%rcx,2), %r15                           #
        subq      %rsi, %r15                                    #
        addq      %rbx, %r15                                    #
        movq      %r15, 368(%rsp)                               #
        lea       (%rdi,%rcx,4), %r15                           #
        subq      %rsi, %r15                                    #
        addq      %rbx, %r15                                    #
        movq      %r15, 360(%rsp)                               #
        lea       (%rcx,%rcx,4), %r15                           #668.40
        addq      %rdi, %r15                                    #
        subq      %rsi, %r15                                    #
        addq      %rbx, %r15                                    #
        movq      %r15, 384(%rsp)                               #
        lea       (,%rcx,8), %r15                               #669.40
        subq      %rcx, %r15                                    #669.40
        lea       (%rdi,%rcx,8), %rcx                           #
        addq      %rdi, %r15                                    #
        subq      %rsi, %rcx                                    #
        subq      %rsi, %r15                                    #
        lea       (%r8,%r11,4), %rsi                            #
        subq      %r12, %rsi                                    #
        addq      %rbx, %r15                                    #
        addq      %rcx, %rbx                                    #
        addq      %rsi, %r13                                    #
        movq      %rbx, 296(%rsp)                               #
        movq      %r13, 288(%rsp)                               #
        movq      %r14, 328(%rsp)                               #664.38
        movq      %rdx, 272(%rsp)                               #664.38
        movq      %r9, (%rsp)                                   #664.38
        movq      %r10, 8(%rsp)                                 #664.38
        movq      %rax, 16(%rsp)                                #664.38
        movq      %r15, 400(%rsp)                               #
        vmovsd    .L_2il0floatpacket.171(%rip), %xmm1           #664.38
        movq      128(%rsp), %r12                               #664.38
        movq      120(%rsp), %r11                               #664.38
        movq      72(%rsp), %r10                                #664.38
        movq      112(%rsp), %r14                               #664.38
        movq      80(%rsp), %r9                                 #664.38
        movq      88(%rsp), %rax                                #664.38
        movq      96(%rsp), %rdx                                #664.38
        movq      104(%rsp), %rcx                               #664.38
        movq      160(%rsp), %rsi                               #664.38
        movq      136(%rsp), %r13                               #664.38
        movq      144(%rsp), %rbx                               #664.38
        movq      152(%rsp), %rdi                               #664.38
        movq      320(%rsp), %r8                                #664.38
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1
..B10.99:                       # Preds ..B10.99 ..B10.98
        movq      344(%rsp), %r15                               #660.38
        incq      %r8                                           #657.9
        vmovsd    (%rbx,%r12), %xmm3                            #660.28
        vmovsd    (%rbx,%r10), %xmm2                            #659.28
        vmulsd    (%rdi,%r15), %xmm3, %xmm5                     #660.38
        vmulsd    (%rdi,%r11), %xmm2, %xmm4                     #659.38
        movq      328(%rsp), %r15                               #661.28
        vaddsd    %xmm5, %xmm4, %xmm7                           #660.26
        vmovsd    (%rbx,%r15), %xmm6                            #661.28
        movq      352(%rsp), %r15                               #661.38
        vmulsd    (%rdi,%r15), %xmm6, %xmm8                     #661.38
        movq      392(%rsp), %r15                               #658.11
        vaddsd    %xmm8, %xmm7, %xmm9                           #661.26
        vmulsd    (%rsi), %xmm9, %xmm10                         #658.11
        vmovsd    %xmm10, (%r13,%r15)                           #658.11
        movq      376(%rsp), %r15                               #665.38
        vmovsd    (%rbx,%rax), %xmm11                           #665.28
        vmovsd    (%rbx,%r9), %xmm15                            #667.28
        vmulsd    (%rdi,%r15), %xmm11, %xmm13                   #665.38
        vmulsd    (%rsi), %xmm1, %xmm0                          #664.38
        movq      336(%rsp), %r15                               #666.28
        vmovsd    (%rbx,%rcx), %xmm4                            #668.28
        vmovsd    (%rbx,%r14), %xmm7                            #669.28
        vmovsd    (%rbx,%r15), %xmm12                           #666.28
        movq      368(%rsp), %r15                               #666.38
        vmovsd    (%rbx,%rdx), %xmm10                           #670.28
        addq      272(%rsp), %rsi                               #657.9
        vmulsd    (%rdi,%r15), %xmm12, %xmm14                   #666.38
        movq      360(%rsp), %r15                               #667.38
        vaddsd    %xmm14, %xmm13, %xmm2                         #666.26
        vmulsd    (%rdi,%r15), %xmm15, %xmm3                    #667.38
        movq      384(%rsp), %r15                               #668.38
        vaddsd    %xmm3, %xmm2, %xmm5                           #667.26
        vmulsd    (%rdi,%r15), %xmm4, %xmm6                     #668.38
        movq      400(%rsp), %r15                               #669.38
        vaddsd    %xmm6, %xmm5, %xmm8                           #668.26
        vmulsd    (%rdi,%r15), %xmm7, %xmm9                     #669.38
        movq      296(%rsp), %r15                               #670.38
        vaddsd    %xmm9, %xmm8, %xmm11                          #669.26
        vmulsd    (%rdi,%r15), %xmm10, %xmm12                   #670.38
        movq      288(%rsp), %r15                               #664.11
        vaddsd    %xmm12, %xmm11, %xmm13                        #670.26
        vmulsd    %xmm13, %xmm0, %xmm0                          #664.11
        vmovsd    %xmm0, (%r13,%r15)                            #664.11
        addq      312(%rsp), %r13                               #657.9
        .byte     144                                           #657.9
        addq      424(%rsp), %rbx                               #657.9
        addq      304(%rsp), %rdi                               #657.9
        cmpq      264(%rsp), %r8                                #657.9
        jb        ..B10.99      # Prob 81%                      #657.9
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1
..B10.100:                      # Preds ..B10.87 ..B10.99
        movq      200(%rsp), %rbx                               #
        movq      264(%rsp), %rcx                               #
        movq      24(%rsp), %r12                                #
        movq      32(%rsp), %r11                                #
        movq      40(%rsp), %r13                                #
        movq      48(%rsp), %rsi                                #
        movq      56(%rsp), %rdi                                #
        movq      64(%rsp), %r8                                 #
        movq      272(%rsp), %rdx                               #
        movq      (%rsp), %r9                                   #
        movq      8(%rsp), %r10                                 #
        movq      16(%rsp), %rax                                #
                                # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13
..B10.101:                      # Preds ..B10.85 ..B10.100 ..B10.97
        cmpq      $8, %rdx                                      #676.40

TimP · ‎08-08-2014

At asm lines 505-576 you have full AVX-256 vector code, where the memory access is split into AVX-128 chunks to allow for non-32-byte alignment without the extreme performance penalties which would be incurred on Sandy Bridge with unaligned AVX-256 moves. It's still possible for this to use the full bandwidth of your L2 cache and memory.

At 948-1061 you have another version which appears to allow for total mis-alignment, with memory access split into 64-bit scalar chunks. If the alignment is so bad, there's no point in packing into more than AVX-128 for the floating point operations.

Both versions have scalar remainder loops.

You would want to check at run time that most of the work is done in the AVX-256 loop. This is not difficult with oprofile or VTune.

Did you check opt-report to see if there is any remark about this versioning?

jimdempseyatthecove · ‎08-08-2014

I'm with Tim. Use VTune to see is the preponderance of operations occur in ..B10.83 loop.

You appear to have three different computation loops any of which are entered dependent upon alignment and counts. You also have a peel and two residual loops.

Jim Dempsey

Wentao_Z_ · ‎08-11-2014

Thanks for your reply.

I think the vectorization reports suggests the main loop body is vectorized with avx256:

src/ModNavierStokesRHS.f90(657): (col. 9) remark: SIMD LOOP WAS VECTORIZED.
src/ModNavierStokesRHS.f90(657): (col. 9) remark: loop was not vectorized: unsupported data type.
src/ModNavierStokesRHS.f90(657): (col. 9) warning #13379: loop was not vectorized with "simd"
src/ModNavierStokesRHS.f90(657): (col. 9) remark: SIMD LOOP WAS VECTORIZED.

I also want to add that the loop extent Nc is quite large. In my current case Nc = 110592.

jimdempseyatthecove · ‎08-11-2014

The report may say it is vectorized, but there are also three code paths. Using VTune will assure that the fully vectorized loop is used the majority of the time (or if it is not used that much).

Jim Dempsey

AVX instruction using xmm ?