- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a question about AVX instruction. I compiled my code using ifort 13 with -O2 and -xHost. I want to enable 256-bit wide AVX to perform four 64-bit floating point operations per cycle.
Here is my first code piece:
623 !DIR$ SIMD 624 do ii = 1, Nc 625 ! diagonal components first 626 StrnRt(ii,1) = JAC(ii) * ( & 627 MT1(ii,1) * VelGrad1st(ii,1) & 628 + MT1(ii,2) * VelGrad1st(ii,3) ) ... 640 end do
The assembly files show that the following instructions were generated for line 627:
vmulsd 8(%r8,%r14,8), %xmm1, %xmm3 #627.38 vmulpd %xmm6, %xmm5, %xmm11 #627.38 vmulpd %ymm5, %ymm4, %ymm10 #627.38
I understand why I got vmulsd. My question is why vmulpd %xmm6, %xmm5, %xmm11 was generated and what does it stand for? I think vmulpd should be an AVX instruction and should use ymm to have 256-bit wide vectorization.
For the second code piece:
643 !DIR$ SIMD 644 do ii = 1, Nc 645 ! diagonal components first 646 StrnRt(ii,1) = JAC(ii) * ( & 647 MT1(ii,1) * VelGrad1st(ii,1) & 648 + MT1(ii,2) * VelGrad1st(ii,4) & 649 + MT1(ii,3) * VelGrad1st(ii,7) ) ... 685 end do
The assembly files show that the following instructions were generated for line 647:
vmulsd (%r12), %xmm4, %xmm6 #647.38 vmulpd %xmm11, %xmm10, %xmm0 #647.38
Here again I got vmulpd with xmm. I even did NOT get vmulpd with ymm. I am worrying that this code piece is only performing two 64-bit floating point operations per cycle, rather than four.
I truly appreciate your help.
Best regards,
Wentao
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In order to generate AVX instructions, you need -mAVX or -xHost.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When the compiler does not know the alignment of the arrays it might revert to using xmm registers. Often a cachline split load, which may be permitted on some processor architectures, is slower than performing two (four) operations in sequence. IOW the processor doesn't stall.
IIF the first dimension (Nc) is a multiple of your vector size .and. you know JAC, MT1 and VelGrad1st are vector aligned (for AVX), then try using:
!DIR SMD VECTORLENGTH(4)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your reply.
Your explanations help me to understand the behavior in the first code piece.
I have found out why I did not get avx ymm in the second code piece. It is because my loop body is too large (I only show part of it here), which probably makes the ymm registers not enough. Later after I divided the big loop body into two smaller loops, I got avx ymm generated for the second code piece.
Best regards,
Wentao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you looked at the complete code of the loop?
On loops were the alignment is not known, there usually is code, called peel code, that runs the loop in smaller vector widths until alignment is attained, then it continues on with the wider vector widths. Is this the case of you examining the peel portion of the code?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Here is the code piece and corresponding assembly for line 647:
643 !DIR$ SIMD 644 do ii = 1, Nc 645 ! diagonal components first 646 StrnRt(ii,1) = JAC(ii) * ( & 647 MT1(ii,1) * VelGrad1st(ii,1) & 648 + MT1(ii,2) * VelGrad1st(ii,4) & 649 + MT1(ii,3) * VelGrad1st(ii,7) ) 650 651 StrnRt(ii,2) = JAC(ii) * ( & 652 MT1(ii,4) * VelGrad1st(ii,2) & 653 + MT1(ii,5) * VelGrad1st(ii,5) & 654 + MT1(ii,6) * VelGrad1st(ii,8) ) 655 656 StrnRt(ii,3) = JAC(ii) * ( & 657 MT1(ii,7) * VelGrad1st(ii,3) & 658 + MT1(ii,8) * VelGrad1st(ii,6) & 659 + MT1(ii,9) * VelGrad1st(ii,9) ) 660 661 end do
login4$ grep '#647' ModNavierStokesRHS.s movq 344(%rsp), %rbx #647.40 movq 88(%rbx), %rsi #647.40 movq %rsi, 112(%rsp) #647.40 movq %rsi, 160(%rsp) #647.40 movq (%rbx), %r10 #647.40 movq 64(%rbx), %r8 #647.40 movq 80(%rbx), %r9 #647.40 movq 56(%rbx), %rsi #647.40 movq %r10, 208(%rsp) #647.40 movq %r9, 104(%rsp) #647.40 movq %r8, 96(%rsp) #647.40 movq %rsi, 408(%rsp) #647.40 movq %rsi, 264(%rsp) #647.40 vmovsd 8(%rcx,%r11,8), %xmm0 #647.28 vmulsd 8(%rdx,%r11,8), %xmm0, %xmm2 #647.38 vmovupd 8(%r10,%r11,8), %xmm0 #647.28 vmovupd 8(%r14,%r11,8), %xmm1 #647.40 vinsertf128 $1, 24(%r10,%r11,8), %ymm0, %ymm2 #647.28 vinsertf128 $1, 24(%r14,%r11,8), %ymm1, %ymm3 #647.40 vmulpd %ymm3, %ymm2, %ymm8 #647.38 movq 280(%rsp), %r15 #647.28 vmovsd 8(%r15,%r11,8), %xmm0 #647.28 movq 272(%rsp), %r15 #647.38 vmulsd 8(%r15,%r11,8), %xmm0, %xmm2 #647.38 movq 296(%rsp), %r15 #647.28 addq %r8, %r15 #647.28 vmovsd (%r15), %xmm4 #647.28 vmovhpd (%r15,%rdx), %xmm4, %xmm6 #647.28 movq 152(%rsp), %r15 #647.40 addq %r9, %r15 #647.40 vmovsd (%r15), %xmm5 #647.40 vmovhpd (%r15,%r13), %xmm5, %xmm7 #647.40 vmulpd %xmm7, %xmm6, %xmm12 #647.38 movq 304(%rsp), %r15 #647.28 vmovsd (%rbx,%r15), %xmm0 #647.28 vmulsd (%rdi,%rcx), %xmm0, %xmm2 #647.38
The assembly lines for line 647 is a mixture of three things:
vmulsd (scalar)
vmulpd %xmm1, %xmm2, %xmm3
vmulpd %ymm1, %ymm2, %ymm3
I think the main loop body has been vectorized with avx256. Scalar instructions and avx128 appear here to deal with the remainder(peel) loops.
Best regards,
Wentao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code shown here is AVX-256 vectorized (on only one operation), but the memory accesses are effectively scalar, repacking data into 256-bit registers, to allow for mis-alignment without penalty on early AVX platforms. If you could assure the compiler of 32-byte data alignment, you could get AVX-256 memory references. Usual ways of doing that are with combinations of -align array32byte and !dir$ vector aligned or assume_aligned assertions. You aren't getting significant benefit for AVX in this code. Apparently, the compiler sees so many possible combinations of mis-alignment that it basically gives up on useful AVX.
If you are using 32-bit compile mode, you may be losing optimization due to not having enough independent pointer registers available. That would be a case where 64-bit mode could make a big difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The grep listing is no good for this purpose.
What you have above are code lines from all three potential sections of the compiled loop:
The peel section (mix of scalars), the vectorized loop (note, there may be multiples of these, one executed on given machine with particular alignment), then finally the residual section.
IOW your listing is a mix-mash of the code of interest.
Do your grep for "6[456].[.]". Or simply look for the section yourself, and copy/paste the appropriate code.
This will get the full range of the loop (and a tad more).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply. I did not show the complete picture because I worry that it is too lengthy.
Here is the code
656 !DIR$ SIMD 657 do ii = 1, Nc 658 StrnRt(ii,3) = JAC(ii) * ( & 659 MT1(ii,7) * VelGrad1st(ii,3) & 660 + MT1(ii,8) * VelGrad1st(ii,6) & 661 + MT1(ii,9) * VelGrad1st(ii,9) ) 662 663 ! upper-half part of strain-rate tensor due to symmetry 664 StrnRt(ii,4) = JAC(ii) * 0.5_rfreal * ( & 665 MT1(ii,4) * VelGrad1st(ii,1) & 666 + MT1(ii,1) * VelGrad1st(ii,2) & 667 + MT1(ii,5) * VelGrad1st(ii,4) & 668 + MT1(ii,2) * VelGrad1st(ii,5) & 669 + MT1(ii,6) * VelGrad1st(ii,7) & 670 + MT1(ii,3) * VelGrad1st(ii,8) ) 671 end do
Here is the corresponding assembly lines
jb ..B10.63 # Prob 81% #644.9 # LOE rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 ..B10.64: # Preds ..B10.63 movq 432(%rsp), %r11 #658.11 movq 400(%rsp), %r14 #659.40 movq 200(%rsp), %rbx # movq (%r11), %rdi #658.11 movq 64(%r11), %rsi #658.11 movq 56(%r11), %r12 #658.11 movq 56(%r14), %r15 #659.40 movq %rdi, 192(%rsp) #658.11 movq 88(%r11), %r8 #658.11 movq 80(%r11), %rdi #658.11 movq %rsi, 184(%rsp) #658.11 movq %r12, 280(%rsp) #658.11 movq 264(%rsp), %rcx # movq (%rsp), %r9 # movq 8(%rsp), %r10 # movq 16(%rsp), %rax # movq (%r14), %rsi #659.40 movq 64(%r14), %r12 #659.40 movq 88(%r14), %r13 #659.40 movq 80(%r14), %r11 #659.40 movq %r15, 208(%rsp) #659.40 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.65: # Preds ..B10.64 ..B10.161 movq 152(%rsp), %r15 # movq 304(%rsp), %r14 # imulq %r14, %r15 # movq %r15, 136(%rsp) # movq 424(%rsp), %r15 # imulq %r14, %r15 # movq %r15, 128(%rsp) # movq %rdx, %r15 # imulq %r14, %r15 # movq %r15, 144(%rsp) # movq 256(%rsp), %r15 # imulq %r14, %r15 # movq %r15, 120(%rsp) # cmpq %rcx, %r14 #644.9 jae ..B10.69 # Prob 3% #644.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.66: # Preds ..B10.65 movq 416(%rsp), %r14 # movq %rdx, 272(%rsp) # movq %r13, 40(%rsp) # movq %r10, %r13 # movq %r11, 32(%rsp) # lea (%r14,%rdx), %r15 # imulq 408(%rsp), %rdx # imulq %r9, %r13 # subq %rdx, %r15 # movq 424(%rsp), %rdx # movq 168(%rsp), %r11 # movq %r8, 64(%rsp) # movq %rsi, 48(%rsp) # lea (%rbx,%rdx), %r14 # imulq %rax, %rdx # movq %r12, 24(%rsp) # movq 176(%rsp), %r12 # movq 160(%rsp), %rsi # movq 152(%rsp), %r8 # imulq %r11, %r12 # imulq %r8, %rsi # movq %rcx, 264(%rsp) # movq %rdx, %rcx # subq %r10, %rcx # addq %rsi, %r12 # negq %rcx # lea (,%r11,8), %rsi #649.40 addq %r14, %rcx # subq %r11, %rsi #649.40 movq %rdi, 56(%rsp) # subq %r13, %rcx # movq 216(%rsp), %rdi # movq %rax, 16(%rsp) # lea (%r10,%r10), %rax #648.28 movq %rcx, 72(%rsp) # lea (%r10,%r10,2), %rcx #649.28 negq %rax # negq %rcx #649.28 addq %rdx, %rax # addq %rdx, %rcx # movq %rbx, 200(%rsp) # lea (%rdi,%r11), %rbx # negq %rax # negq %rcx # addq %rdi, %rsi # subq %r12, %rbx # addq %r14, %rax # addq %r14, %rcx # subq %r12, %rsi # addq %r8, %rbx # movq %rbx, 80(%rsp) # subq %r13, %rax # movq 232(%rsp), %rbx # subq %r13, %rcx # addq %r8, %rsi # movq %rax, 88(%rsp) # movq %rcx, 328(%rsp) # movq %rsi, 320(%rsp) # movq 240(%rsp), %rcx # movq 224(%rsp), %rax # movq 256(%rsp), %rsi # imulq %rbx, %rcx # imulq %rsi, %rax # movq %r9, (%rsp) # lea (%rdi,%r11,4), %r9 # subq %r12, %r9 # addq %rax, %rcx # movq 248(%rsp), %rax # addq %r8, %r9 # movq %r9, 312(%rsp) # lea (,%r10,4), %r9 #652.28 negq %r9 # addq %r15, 144(%rsp) # lea (%rax,%rbx), %r15 # addq %rdx, %r9 # subq %rcx, %r15 # negq %r9 # addq %rsi, %r15 # movq %r15, 344(%rsp) # addq %r14, %r9 # subq %r13, %r9 # lea (%rdi,%r11,2), %r15 # subq %r12, %r15 # movq %r9, 112(%rsp) # lea (%r10,%r10,4), %r9 #653.28 addq %r8, %r15 # movq %r15, 336(%rsp) # movq %r9, %r15 #653.28 negq %r15 #653.28 addq %r10, %r9 #654.28 addq %rdx, %r15 # negq %r9 #654.28 negq %r15 # addq %r9, %rdx # addq %r14, %r15 # subq %rdx, %r14 # subq %r13, %r15 # subq %r13, %r14 # movq %r15, 104(%rsp) # lea (%r11,%r11,4), %r15 #653.40 addq %rdi, %r15 # lea (%rdi,%r11,8), %rdi # subq %r12, %r15 # subq %r12, %rdi # addq %r8, %r15 # addq %rdi, %r8 # movq %r8, 152(%rsp) # lea (%rax,%rbx,2), %r8 # subq %rcx, %r8 # movq %r15, 96(%rsp) # addq %r8, %rsi # movq %rsi, 256(%rsp) # movq %r10, 8(%rsp) # movq 152(%rsp), %r11 # movq 96(%rsp), %r9 # movq 104(%rsp), %r13 # movq 112(%rsp), %r8 # movq 88(%rsp), %r10 # movq 80(%rsp), %rax # movq 72(%rsp), %rdx # movq 144(%rsp), %rbx # movq 120(%rsp), %r12 # movq 128(%rsp), %rcx # movq 136(%rsp), %rsi # movq 304(%rsp), %rdi # # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 ..B10.67: # Preds ..B10.67 ..B10.66 movq 312(%rsp), %r15 #648.38 incq %rdi #644.9 vmovsd (%rcx,%r10), %xmm1 #648.28 vmovsd (%rcx,%rdx), %xmm0 #647.28 vmulsd (%rsi,%r15), %xmm1, %xmm3 #648.38 vmulsd (%rsi,%rax), %xmm0, %xmm2 #647.38 movq 328(%rsp), %r15 #649.28 vaddsd %xmm3, %xmm2, %xmm5 #648.26 vmovsd (%rcx,%r15), %xmm4 #649.28 movq 320(%rsp), %r15 #649.38 vmulsd (%rsi,%r15), %xmm4, %xmm6 #649.38 movq 344(%rsp), %r15 #646.11 vaddsd %xmm6, %xmm5, %xmm7 #649.26 vmulsd (%rbx), %xmm7, %xmm8 #646.11 vmovsd %xmm8, (%r12,%r15) #646.11 movq 336(%rsp), %r15 #652.38 vmovsd (%rcx,%r8), %xmm9 #652.28 vmovsd (%rcx,%r13), %xmm10 #653.28 vmulsd (%rsi,%r15), %xmm9, %xmm11 #652.38 vmulsd (%rsi,%r9), %xmm10, %xmm12 #653.38 vmovsd (%rcx,%r14), %xmm13 #654.28 vaddsd %xmm12, %xmm11, %xmm14 #653.26 vmulsd (%rsi,%r11), %xmm13, %xmm15 #654.38 movq 256(%rsp), %r15 #651.11 vaddsd %xmm15, %xmm14, %xmm0 #654.26 vmulsd (%rbx), %xmm0, %xmm1 #651.11 vmovsd %xmm1, (%r12,%r15) #651.11 addq 296(%rsp), %r12 #644.9 addq 272(%rsp), %rbx #644.9 addq 424(%rsp), %rcx #644.9 addq 288(%rsp), %rsi #644.9 cmpq 264(%rsp), %rdi #644.9 jb ..B10.67 # Prob 81% #644.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 ..B10.68: # Preds ..B10.55 ..B10.67 movq 200(%rsp), %rbx # movq 264(%rsp), %rcx # movq 24(%rsp), %r12 # movq 32(%rsp), %r11 # movq 40(%rsp), %r13 # movq 48(%rsp), %rsi # movq 56(%rsp), %rdi # movq 64(%rsp), %r8 # movq 272(%rsp), %rdx # movq (%rsp), %r9 # movq 8(%rsp), %r10 # movq 16(%rsp), %rax # # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.69: # Preds ..B10.53 ..B10.68 ..B10.65 movq 192(%rsp), %r14 #658.11 movq 184(%rsp), %r15 #658.11 movq %r14, 256(%rsp) #658.11 movq %r15, 232(%rsp) #658.11 movq 280(%rsp), %r14 #658.11 movq 208(%rsp), %r15 #659.40 movq %r8, 248(%rsp) #658.11 movq %rdi, 240(%rsp) #658.11 movq %r14, 288(%rsp) #658.11 movq %r14, 312(%rsp) # movq %rsi, 224(%rsp) #659.40 movq %r13, 216(%rsp) #659.40 movq %r11, 176(%rsp) #659.40 movq %r12, 168(%rsp) #659.40 movq %r15, 296(%rsp) #659.40 movq %r15, 304(%rsp) # cmpq $8, %rdx #657.9 jne ..B10.89 # Prob 50% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.70: # Preds ..B10.69 cmpq $8, 424(%rsp) #657.9 jne ..B10.89 # Prob 50% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.71: # Preds ..B10.70 cmpq $8, 208(%rsp) #657.9 jne ..B10.89 # Prob 50% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.72: # Preds ..B10.71 cmpq $8, 280(%rsp) #657.9 jne ..B10.92 # Prob 50% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.73: # Preds ..B10.72 cmpq $4, %rcx #657.9 jl ..B10.145 # Prob 10% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.74: # Preds ..B10.73 movq 408(%rsp), %r14 #657.9 lea (,%r14,8), %r15 #657.9 negq %r15 #657.9 addq 416(%rsp), %r15 #657.9 movq %r15, 72(%rsp) #657.9 lea 8(%r15), %r14 #657.9 andq $31, %r14 #657.9 movl %r14d, 80(%rsp) #657.9 testl %r14d, %r14d #657.9 je ..B10.77 # Prob 50% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 esi edi r8d r11d r12d r13d r14d sil dil r8b r11b r12b r13b r14b ..B10.75: # Preds ..B10.74 testb $7, 80(%rsp) #657.9 jne ..B10.145 # Prob 10% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 esi edi r8d r11d r12d r13d r14d sil dil r8b r11b r12b r13b r14b ..B10.76: # Preds ..B10.75 negl %r14d #657.9 addl $32, %r14d #657.9 shrl $3, %r14d #657.9 movl %r14d, 80(%rsp) #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 esi edi r8d r11d r12d r13d r14d sil dil r8b r11b r12b r13b r14b ..B10.77: # Preds ..B10.76 ..B10.74 lea 4(%r14), %r15d #657.9 cmpq %r15, %rcx #657.9 jl ..B10.145 # Prob 10% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 esi edi r8d r11d r12d r13d r14d sil dil r8b r11b r12b r13b r14b ..B10.78: # Preds ..B10.77 movq %rcx, 264(%rsp) # movq %rdx, 272(%rsp) # movl %ecx, %edx #657.9 subl %r14d, %edx #657.9 andl $3, %edx #657.9 subl %edx, %ecx #657.9 movq %r10, %rdx # imulq %r9, %rdx # movl %ecx, 168(%rsp) #657.9 lea (,%rax,8), %rcx # negq %rcx # movq %rbx, 200(%rsp) # addq %rbx, %rcx # movl %r14d, %r14d #657.9 lea (%r10,%r10,2), %rbx #669.28 movq $0, 128(%rsp) #657.9 lea (%rbx,%rbx), %r15 #669.28 subq %rdx, %r15 # subq %rdx, %rbx # addq %rcx, %r15 # addq %rcx, %rbx # movq %r15, 216(%rsp) # lea (%r10,%r10,4), %r15 #667.28 subq %rdx, %r15 # addq %rcx, %r15 # movq %r15, 224(%rsp) # lea (,%r10,4), %r15 #665.28 subq %rdx, %r15 # addq %rcx, %r15 # movq %r15, 232(%rsp) # lea (%r10,%r10), %r15 #668.28 movq %rbx, 240(%rsp) # lea (%r10,%r10,8), %rbx #661.28 subq %rdx, %r15 # subq %rdx, %rbx # addq %rcx, %r15 # addq %rcx, %rbx # movq %r15, 96(%rsp) # lea (,%r10,8), %r15 #660.28 movq %rbx, 104(%rsp) # movq %r15, %rbx # subq %rdx, %rbx # subq %r10, %r15 #659.28 addq %rcx, %rbx # subq %rdx, %r15 # movq %rbx, 88(%rsp) # movq %r10, %rbx # subq %rdx, %rbx # movq %r13, %rdx # imulq %r11, %rdx # addq %rcx, %rbx # addq %rcx, %r15 # movq %rbx, 80(%rsp) # lea (%rdx,%r12,8), %rbx # movq %r15, 248(%rsp) # lea (%r11,%r11,2), %rdx #659.40 movq %r14, 136(%rsp) #657.9 lea (%rsi,%rdx), %rcx # subq %rbx, %rcx # lea (%rsi,%rdx,2), %r15 # movq %rcx, 112(%rsp) # lea (%r11,%r11,8), %rdx #661.40 addq %rsi, %rdx # lea (%r11,%rsi), %rcx # subq %rbx, %rdx # subq %rbx, %rcx # movq %rdx, 304(%rsp) # lea (%rsi,%r11,2), %rdx # subq %rbx, %rdx # subq %rbx, %r15 # movq %rdx, 288(%rsp) # lea (%r11,%r11,4), %rdx #668.40 movq %rcx, 296(%rsp) # lea (,%r11,8), %rcx #669.40 addq %rsi, %rdx # subq %r11, %rcx #669.40 subq %rbx, %rdx # addq %rsi, %rcx # movq %r15, 120(%rsp) # lea (%rsi,%r11,4), %r15 # movq %rdx, 312(%rsp) # lea (%rsi,%r11,8), %rdx # subq %rbx, %r15 # subq %rbx, %rcx # subq %rbx, %rdx # movq %r8, %rbx # imulq %rdi, %rbx # movq %rdx, 152(%rsp) # movq 184(%rsp), %rdx # movq %r15, 256(%rsp) # movq %rcx, 160(%rsp) # lea (%rdi,%rdi,2), %rcx #658.11 lea (%rbx,%rdx,8), %r15 # movq 192(%rsp), %rbx # addq %rbx, %rcx # subq %r15, %rcx # movq %rcx, 176(%rsp) # lea (%rbx,%rdi,4), %rdx # subq %r15, %rdx # movq %rdx, 144(%rsp) # movq 264(%rsp), %rcx #657.9 movq 200(%rsp), %rbx #657.9 movq 272(%rsp), %rdx #657.9 testq %r14, %r14 #657.9 jbe ..B10.82 # Prob 3% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.79: # Preds ..B10.78 movq %rbx, 200(%rsp) #664.38 movq %rcx, 264(%rsp) #664.38 movq %r12, 24(%rsp) #664.38 movq %r11, 32(%rsp) #664.38 movq %r13, 40(%rsp) #664.38 movq %rsi, 48(%rsp) #664.38 movq %rdi, 56(%rsp) #664.38 movq %r8, 64(%rsp) #664.38 movq %rdx, 272(%rsp) #664.38 movq %r9, (%rsp) #664.38 movq %r10, 8(%rsp) #664.38 movq %rax, 16(%rsp) #664.38 vmovsd .L_2il0floatpacket.171(%rip), %xmm1 #664.38 movq 312(%rsp), %r12 #664.38 movq 256(%rsp), %r11 #664.38 movq 288(%rsp), %r10 #664.38 movq 296(%rsp), %r14 #664.38 movq 304(%rsp), %r9 #664.38 movq 120(%rsp), %rax #664.38 movq 112(%rsp), %rdx #664.38 movq 80(%rsp), %rcx #664.38 movq 88(%rsp), %rbx #664.38 movq 96(%rsp), %rsi #664.38 movq 104(%rsp), %rdi #664.38 movq 128(%rsp), %r13 #664.38 movq 72(%rsp), %r8 #664.38 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1 ..B10.80: # Preds ..B10.80 ..B10.79 movq 248(%rsp), %r15 #659.28 vmovsd 8(%rbx,%r13,8), %xmm3 #660.28 vmulsd 8(%rax,%r13,8), %xmm3, %xmm5 #660.38 vmovsd 8(%r15,%r13,8), %xmm2 #659.28 vmulsd 8(%rdx,%r13,8), %xmm2, %xmm4 #659.38 vmovsd 8(%rdi,%r13,8), %xmm6 #661.28 vaddsd %xmm5, %xmm4, %xmm7 #660.26 vmulsd 8(%r9,%r13,8), %xmm6, %xmm8 #661.38 movq 176(%rsp), %r15 #658.11 vaddsd %xmm8, %xmm7, %xmm9 #661.26 vmulsd 8(%r8,%r13,8), %xmm9, %xmm10 #658.11 vmovsd %xmm10, 8(%r15,%r13,8) #658.11 movq 232(%rsp), %r15 #665.28 vmovsd 8(%rcx,%r13,8), %xmm12 #666.28 vmulsd 8(%r10,%r13,8), %xmm12, %xmm14 #666.38 vmulsd 8(%r8,%r13,8), %xmm1, %xmm0 #664.38 vmovsd 8(%r15,%r13,8), %xmm11 #665.28 vmulsd 8(%r14,%r13,8), %xmm11, %xmm13 #665.38 movq 224(%rsp), %r15 #667.28 vaddsd %xmm14, %xmm13, %xmm2 #666.26 vmovsd 8(%r15,%r13,8), %xmm15 #667.28 vmulsd 8(%r11,%r13,8), %xmm15, %xmm3 #667.38 movq 216(%rsp), %r15 #669.28 vaddsd %xmm3, %xmm2, %xmm5 #667.26 vmovsd 8(%rsi,%r13,8), %xmm4 #668.28 vmulsd 8(%r12,%r13,8), %xmm4, %xmm6 #668.38 vmovsd 8(%r15,%r13,8), %xmm7 #669.28 vaddsd %xmm6, %xmm5, %xmm8 #668.26 movq 160(%rsp), %r15 #669.38 vmulsd 8(%r15,%r13,8), %xmm7, %xmm9 #669.38 movq 240(%rsp), %r15 #670.28 vaddsd %xmm9, %xmm8, %xmm11 #669.26 vmovsd 8(%r15,%r13,8), %xmm10 #670.28 movq 152(%rsp), %r15 #670.38 vmulsd 8(%r15,%r13,8), %xmm10, %xmm12 #670.38 movq 144(%rsp), %r15 #664.11 vaddsd %xmm12, %xmm11, %xmm13 #670.26 vmulsd %xmm13, %xmm0, %xmm0 #664.11 vmovsd %xmm0, 8(%r15,%r13,8) #664.11 incq %r13 #657.9 cmpq 136(%rsp), %r13 #657.9 jb ..B10.80 # Prob 81% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1 ..B10.81: # Preds ..B10.80 movq 200(%rsp), %rbx # movq 264(%rsp), %rcx # movq 24(%rsp), %r12 # movq 32(%rsp), %r11 # movq 40(%rsp), %r13 # movq 48(%rsp), %rsi # movq 56(%rsp), %rdi # movq 64(%rsp), %r8 # movq 272(%rsp), %rdx # movq (%rsp), %r9 # movq 8(%rsp), %r10 # movq 16(%rsp), %rax # # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.82: # Preds ..B10.78 ..B10.81 movslq 168(%rsp), %r14 #657.9 movq %rbx, 200(%rsp) #657.9 movq %rcx, 264(%rsp) #657.9 movq %r12, 24(%rsp) #657.9 movq %r11, 32(%rsp) #657.9 movq %r13, 40(%rsp) #657.9 movq %rsi, 48(%rsp) #657.9 movq %rdi, 56(%rsp) #657.9 movq %r8, 64(%rsp) #657.9 movq %r14, 128(%rsp) #657.9 movq %rdx, 272(%rsp) #657.9 movq %r9, (%rsp) #657.9 movq %r10, 8(%rsp) #657.9 movq %rax, 16(%rsp) #657.9 vmovupd .L_2il0floatpacket.173(%rip), %ymm1 #664.38 movq 120(%rsp), %r12 #657.9 movq 112(%rsp), %r11 #657.9 movq 248(%rsp), %r10 #657.9 movq 80(%rsp), %r14 #657.9 movq 88(%rsp), %r9 #657.9 movq 96(%rsp), %rax #657.9 movq 104(%rsp), %rdx #657.9 movq 240(%rsp), %rcx #657.9 movq 232(%rsp), %rbx #657.9 movq 224(%rsp), %rsi #657.9 movq 216(%rsp), %rdi #657.9 movq 136(%rsp), %r13 #657.9 movq 72(%rsp), %r8 #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 ymm1 ..B10.83: # Preds ..B10.83 ..B10.82 vmovupd 8(%r10,%r13,8), %xmm2 #659.28 vmovupd 8(%r11,%r13,8), %xmm3 #659.40 vmovupd 8(%r9,%r13,8), %xmm6 #660.28 vmovupd 8(%r12,%r13,8), %xmm7 #660.40 vmovupd 8(%rdx,%r13,8), %xmm12 #661.28 movq 304(%rsp), %r15 #661.40 vmovupd 8(%r15,%r13,8), %xmm13 #661.40 vinsertf128 $1, 24(%r10,%r13,8), %ymm2, %ymm4 #659.28 vinsertf128 $1, 24(%r11,%r13,8), %ymm3, %ymm5 #659.40 vinsertf128 $1, 24(%r9,%r13,8), %ymm6, %ymm8 #660.28 vinsertf128 $1, 24(%r12,%r13,8), %ymm7, %ymm9 #660.40 vmulpd %ymm5, %ymm4, %ymm10 #659.38 vmulpd %ymm9, %ymm8, %ymm11 #660.38 vaddpd %ymm11, %ymm10, %ymm0 #660.26 vinsertf128 $1, 24(%rdx,%r13,8), %ymm12, %ymm14 #661.28 vinsertf128 $1, 24(%r15,%r13,8), %ymm13, %ymm15 #661.40 vmulpd %ymm15, %ymm14, %ymm2 #661.38 vaddpd %ymm2, %ymm0, %ymm3 #661.26 vmulpd 8(%r8,%r13,8), %ymm3, %ymm4 #658.11 movq 176(%rsp), %r15 #658.11 vmovupd %xmm4, 8(%r15,%r13,8) #658.11 vextractf128 $1, %ymm4, 24(%r15,%r13,8) #658.11 movq 296(%rsp), %r15 #665.40 vmovupd 8(%rbx,%r13,8), %xmm5 #665.28 vmovupd 8(%r15,%r13,8), %xmm6 #665.40 vmovupd 8(%r14,%r13,8), %xmm9 #666.28 vmovupd 8(%rsi,%r13,8), %xmm15 #667.28 vmulpd 8(%r8,%r13,8), %ymm1, %ymm0 #664.38 vinsertf128 $1, 24(%r15,%r13,8), %ymm6, %ymm8 #665.40 movq 288(%rsp), %r15 #666.40 vmovupd 8(%r15,%r13,8), %xmm10 #666.40 vinsertf128 $1, 24(%rbx,%r13,8), %ymm5, %ymm7 #665.28 vmulpd %ymm8, %ymm7, %ymm13 #665.38 vmovupd 8(%rax,%r13,8), %xmm7 #668.28 vinsertf128 $1, 24(%r15,%r13,8), %ymm10, %ymm12 #666.40 movq 256(%rsp), %r15 #667.40 vmovupd 8(%r15,%r13,8), %xmm2 #667.40 vinsertf128 $1, 24(%r14,%r13,8), %ymm9, %ymm11 #666.28 vmulpd %ymm12, %ymm11, %ymm14 #666.38 vaddpd %ymm14, %ymm13, %ymm5 #666.26 vmovupd 8(%rdi,%r13,8), %xmm13 #669.28 vinsertf128 $1, 24(%r15,%r13,8), %ymm2, %ymm4 #667.40 movq 312(%rsp), %r15 #668.40 vmovupd 8(%r15,%r13,8), %xmm8 #668.40 vinsertf128 $1, 24(%rsi,%r13,8), %ymm15, %ymm3 #667.28 vmulpd %ymm4, %ymm3, %ymm6 #667.38 vaddpd %ymm6, %ymm5, %ymm11 #667.26 vmovupd 8(%rcx,%r13,8), %xmm5 #670.28 vinsertf128 $1, 24(%r15,%r13,8), %ymm8, %ymm10 #668.40 movq 160(%rsp), %r15 #669.40 vmovupd 8(%r15,%r13,8), %xmm14 #669.40 vinsertf128 $1, 24(%rax,%r13,8), %ymm7, %ymm9 #668.28 vmulpd %ymm10, %ymm9, %ymm12 #668.38 vaddpd %ymm12, %ymm11, %ymm3 #668.26 vinsertf128 $1, 24(%r15,%r13,8), %ymm14, %ymm2 #669.40 movq 152(%rsp), %r15 #670.40 vmovupd 8(%r15,%r13,8), %xmm6 #670.40 vinsertf128 $1, 24(%rdi,%r13,8), %ymm13, %ymm13 #669.28 vmulpd %ymm2, %ymm13, %ymm4 #669.38 vaddpd %ymm4, %ymm3, %ymm9 #669.26 vinsertf128 $1, 24(%rcx,%r13,8), %ymm5, %ymm7 #670.28 vinsertf128 $1, 24(%r15,%r13,8), %ymm6, %ymm8 #670.40 vmulpd %ymm8, %ymm7, %ymm10 #670.38 vaddpd %ymm10, %ymm9, %ymm11 #670.26 vmulpd %ymm11, %ymm0, %ymm0 #664.11 movq 144(%rsp), %r15 #664.11 vmovupd %xmm0, 8(%r15,%r13,8) #664.11 vextractf128 $1, %ymm0, 24(%r15,%r13,8) #664.11 addq $4, %r13 #657.9 cmpq 128(%rsp), %r13 #657.9 jb ..B10.83 # Prob 81% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 ymm1 ..B10.84: # Preds ..B10.83 movq 200(%rsp), %rbx # movq 264(%rsp), %rcx # movq 24(%rsp), %r12 # movq 32(%rsp), %r11 # movq 40(%rsp), %r13 # movq 48(%rsp), %rsi # movq 56(%rsp), %rdi # movq 64(%rsp), %r8 # movq 272(%rsp), %rdx # movq (%rsp), %r9 # movq 8(%rsp), %r10 # movq 16(%rsp), %rax # # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.85: # Preds ..B10.84 ..B10.145 movl 168(%rsp), %r14d #657.9 movq $0, 176(%rsp) #657.9 lea 1(%r14), %r15d #657.9 movslq %r15d, %r15 #657.9 cmpq %r15, %rcx #657.9 jb ..B10.101 # Prob 3% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.86: # Preds ..B10.85 movq %rdx, 272(%rsp) # movq %r10, %rdx # imulq %r9, %rdx # movq %r9, (%rsp) # lea (,%rax,8), %r14 # negq %r14 # lea (%r10,%r10,2), %r9 #670.28 addq %rbx, %r14 # movq %rbx, 200(%rsp) # movslq 168(%rsp), %r15 # movq %r13, 40(%rsp) # imulq %r11, %r13 # movq %r8, 64(%rsp) # lea (%r14,%r9), %rbx # subq %rdx, %rbx # lea (%r14,%r9,2), %r9 # subq %rdx, %r9 # imulq %rdi, %r8 # movq %rcx, 264(%rsp) # lea (%rbx,%r15,8), %rbx # movq %rbx, 224(%rsp) # lea (%r9,%r15,8), %rbx # movq %rbx, 232(%rsp) # lea (%r14,%r10,2), %r9 # subq %rdx, %r9 # movq %r12, 24(%rsp) # movq %r11, 32(%rsp) # movq %rsi, 48(%rsp) # movq %rdi, 56(%rsp) # lea (%r9,%r15,8), %rbx # movq %rbx, 240(%rsp) # lea (%r10,%r10,4), %r9 #667.28 addq %r14, %r9 # subq %rdx, %r9 # movq %r10, 8(%rsp) # movq %rax, 16(%rsp) # vmovsd .L_2il0floatpacket.171(%rip), %xmm1 #664.38 lea (%r9,%r15,8), %rbx # movq %rbx, 248(%rsp) # lea (%r10,%r14), %r9 # subq %rdx, %r9 # lea (%r9,%r15,8), %rbx # movq %rbx, 256(%rsp) # lea (%r14,%r10,4), %r9 # subq %rdx, %r9 # lea (%r9,%r15,8), %rbx # movq %rbx, 288(%rsp) # lea (%r10,%r10,8), %r9 #661.28 addq %r14, %r9 # subq %rdx, %r9 # lea (%r9,%r15,8), %rbx # movq %rbx, 72(%rsp) # lea (%r14,%r10,8), %r9 # subq %rdx, %r9 # lea (%r9,%r15,8), %rbx # movq %rbx, 80(%rsp) # lea (,%r10,8), %r9 #659.28 subq %r10, %r9 #659.28 addq %r9, %r14 # movq %rcx, %r9 #657.9 subq %rdx, %r14 # subq %r15, %r9 #657.9 movq %r9, 304(%rsp) #657.9 lea (%r14,%r15,8), %rdx # movq 408(%rsp), %r14 # movq %rdx, 296(%rsp) # lea (,%r14,8), %rbx # negq %rbx # addq 416(%rsp), %rbx # lea (%rbx,%r15,8), %rdx # movq %rdx, 88(%rsp) # lea (%rsi,%r11,8), %rdx # subq %r13, %rdx # lea (,%r12,8), %rbx # subq %rbx, %rdx # lea (%rdx,%r15,8), %r14 # movq %r14, 96(%rsp) # lea (,%r11,8), %rdx #669.40 subq %r11, %rdx #669.40 addq %rsi, %rdx # subq %r13, %rdx # subq %rbx, %rdx # lea (%rdx,%r15,8), %r9 # movq %r9, 104(%rsp) # movq 104(%rsp), %rcx # lea (%r11,%r11,4), %rdx #668.40 addq %rsi, %rdx # subq %r13, %rdx # subq %rbx, %rdx # lea (%rdx,%r15,8), %r14 # movq %r14, 112(%rsp) # lea (%rsi,%r11,4), %rdx # subq %r13, %rdx # subq %rbx, %rdx # lea (%rdx,%r15,8), %r9 # movq %r9, 120(%rsp) # movq 120(%rsp), %rax # lea (%rsi,%r11,2), %rdx # subq %r13, %rdx # subq %rbx, %rdx # lea (%rdx,%r15,8), %r14 # movq %r14, 128(%rsp) # lea (%r11,%rsi), %rdx # subq %r13, %rdx # subq %rbx, %rdx # lea (%rdx,%r15,8), %r9 # movq %r9, 136(%rsp) # lea (%r11,%r11,8), %rdx #661.40 addq %rsi, %rdx # subq %r13, %rdx # subq %rbx, %rdx # lea (%rdx,%r15,8), %r14 # movq %r14, 144(%rsp) # movq 144(%rsp), %r10 # lea (%r11,%r11,2), %r14 #660.40 lea (%rsi,%r14,2), %rdx # addq %rsi, %r14 # subq %r13, %rdx # subq %r13, %r14 # subq %rbx, %rdx # subq %rbx, %r14 # movq 88(%rsp), %rsi # lea (%rdx,%r15,8), %r9 # movq 184(%rsp), %rdx # movq %r9, 152(%rsp) # lea (%r14,%r15,8), %r13 # movq %r13, 312(%rsp) # movq 192(%rsp), %r13 # lea (,%rdx,8), %r14 # movq 152(%rsp), %r11 # lea (%rdi,%rdi,2), %rdx #658.11 addq %r13, %rdx # subq %r8, %rdx # lea (%r13,%rdi,4), %rbx # subq %r8, %rbx # subq %r14, %rdx # subq %r14, %rbx # movq 80(%rsp), %rdi # lea (%rdx,%r15,8), %r14 # lea (%rbx,%r15,8), %r9 # movq %r9, 160(%rsp) # movq %r14, 216(%rsp) # movq 160(%rsp), %r12 # movq 136(%rsp), %r14 # movq 128(%rsp), %r9 # movq 112(%rsp), %rdx # movq 96(%rsp), %rbx # movq 72(%rsp), %r8 # movq 176(%rsp), %r13 # # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1 ..B10.87: # Preds ..B10.87 ..B10.86 movq 296(%rsp), %r15 #659.28 vmovsd 8(%rdi,%r13,8), %xmm3 #660.28 vmulsd 8(%r11,%r13,8), %xmm3, %xmm5 #660.38 vmovsd 8(%r15,%r13,8), %xmm2 #659.28 movq 312(%rsp), %r15 #659.38 vmovsd 8(%r8,%r13,8), %xmm6 #661.28 vmulsd 8(%r10,%r13,8), %xmm6, %xmm8 #661.38 vmulsd 8(%r15,%r13,8), %xmm2, %xmm4 #659.38 movq 216(%rsp), %r15 #658.11 vaddsd %xmm5, %xmm4, %xmm7 #660.26 vaddsd %xmm8, %xmm7, %xmm9 #661.26 vmulsd 8(%rsi,%r13,8), %xmm9, %xmm10 #658.11 vmovsd %xmm10, 8(%r15,%r13,8) #658.11 movq 288(%rsp), %r15 #665.28 vmulsd 8(%rsi,%r13,8), %xmm1, %xmm0 #664.38 vmovsd 8(%r15,%r13,8), %xmm11 #665.28 movq 256(%rsp), %r15 #666.28 vmulsd 8(%r14,%r13,8), %xmm11, %xmm13 #665.38 vmovsd 8(%r15,%r13,8), %xmm12 #666.28 vmulsd 8(%r9,%r13,8), %xmm12, %xmm14 #666.38 movq 248(%rsp), %r15 #667.28 vaddsd %xmm14, %xmm13, %xmm2 #666.26 vmovsd 8(%r15,%r13,8), %xmm15 #667.28 vmulsd 8(%rax,%r13,8), %xmm15, %xmm3 #667.38 movq 240(%rsp), %r15 #668.28 vaddsd %xmm3, %xmm2, %xmm5 #667.26 vmovsd 8(%r15,%r13,8), %xmm4 #668.28 vmulsd 8(%rdx,%r13,8), %xmm4, %xmm6 #668.38 movq 232(%rsp), %r15 #669.28 vaddsd %xmm6, %xmm5, %xmm8 #668.26 vmovsd 8(%r15,%r13,8), %xmm7 #669.28 vmulsd 8(%rcx,%r13,8), %xmm7, %xmm9 #669.38 movq 224(%rsp), %r15 #670.28 vaddsd %xmm9, %xmm8, %xmm11 #669.26 vmovsd 8(%r15,%r13,8), %xmm10 #670.28 vmulsd 8(%rbx,%r13,8), %xmm10, %xmm12 #670.38 vaddsd %xmm12, %xmm11, %xmm13 #670.26 vmulsd %xmm13, %xmm0, %xmm0 #664.11 vmovsd %xmm0, 8(%r12,%r13,8) #664.11 incq %r13 #657.9 cmpq 304(%rsp), %r13 #657.9 jb ..B10.87 # Prob 81% #657.9 jmp ..B10.100 # Prob 100% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1 ..B10.89: # Preds ..B10.69 ..B10.70 ..B10.71 cmpq $0, 208(%rsp) #657.9 je ..B10.156 # Prob 10% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.90: # Preds ..B10.89 cmpq $0, 424(%rsp) #657.9 je ..B10.156 # Prob 10% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.91: # Preds ..B10.90 testq %rdx, %rdx #657.9 je ..B10.156 # Prob 10% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.92: # Preds ..B10.72 ..B10.91 cmpq $0, 280(%rsp) #657.9 je ..B10.156 # Prob 10% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.93: # Preds ..B10.92 cmpq $2, %rcx #657.9 jl ..B10.156 # Prob 10% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 esi edi r8d r11d r12d r13d sil dil r8b r11b r12b r13b ..B10.94: # Preds ..B10.93 movq 416(%rsp), %r14 # xorl %r15d, %r15d #657.9 movq %rcx, 264(%rsp) # andl $-2, %ecx #657.9 movq %rdx, 272(%rsp) # movslq %ecx, %rcx #657.9 movq %rcx, 320(%rsp) #657.9 lea (%r14,%rdx), %rcx # imulq 408(%rsp), %rdx # imulq %r11, %r13 # imulq %rdi, %r8 # subq %rdx, %rcx # movq 424(%rsp), %rdx # movq %rbx, 200(%rsp) # movq %r9, (%rsp) # movq %rcx, 72(%rsp) # lea (%rbx,%rdx), %r14 # movq %r10, %rbx # lea (%r10,%r10,2), %rcx #670.28 imulq %rax, %rdx # imulq %r9, %rbx # movq %r10, %r9 # subq %rdx, %r14 # subq %rbx, %r9 # lea (%r10,%r10), %rdx #668.28 addq %r14, %r9 # subq %rbx, %rdx # movq %r9, 144(%rsp) # movq %rcx, %r9 # subq %rbx, %r9 # addq %r14, %rdx # addq %r14, %r9 # addq %rcx, %rcx #669.28 movq %r9, 128(%rsp) # lea (%r10,%r10,4), %r9 #667.28 subq %rbx, %r9 # subq %rbx, %rcx # addq %r14, %r9 # addq %r14, %rcx # movq %r9, 112(%rsp) # movq 208(%rsp), %r9 # imulq %r9, %r12 # movq %rdx, 136(%rsp) # lea (,%r10,4), %rdx #665.28 subq %rbx, %rdx # subq %r9, %r12 # addq %r14, %rdx # lea (%r11,%r11,2), %r9 #659.40 movq %rdx, 120(%rsp) # lea (,%r10,8), %rdx #659.28 movq %rcx, 160(%rsp) # movq %rdx, %rcx #659.28 subq %r10, %rcx #659.28 subq %rbx, %rdx # subq %rbx, %rcx # addq %r14, %rdx # addq %r14, %rcx # movq %rcx, 152(%rsp) # lea (%rsi,%r9), %rcx # movq %rdx, 328(%rsp) # lea (%rsi,%r9,2), %rdx # subq %r13, %rdx # lea (%r10,%r10,8), %r9 #661.28 subq %rbx, %r9 # lea (%r11,%r11,8), %rbx #661.40 addq %rsi, %rbx # subq %r12, %rdx # subq %r13, %rbx # subq %r13, %rcx # subq %r12, %rbx # subq %r12, %rcx # movq %rdx, 336(%rsp) # addq %r9, %r14 # movq %rbx, 376(%rsp) # movq 280(%rsp), %rdx # movq 184(%rsp), %rbx # imulq %rdx, %rbx # movq 192(%rsp), %r9 # subq %rdx, %rbx # movq %rcx, 104(%rsp) # lea (%rdi,%rdi,2), %rcx #658.11 addq %r9, %rcx # lea (%r11,%rsi), %rdx # subq %r13, %rdx # subq %r8, %rcx # subq %r12, %rdx # subq %rbx, %rcx # movq %rdx, 360(%rsp) # lea (%rsi,%r11,4), %rdx # movq %rcx, 80(%rsp) # lea (%rsi,%r11,2), %rcx # subq %r13, %rdx # subq %r13, %rcx # subq %r12, %rdx # subq %r12, %rcx # movq %rdx, 344(%rsp) # lea (,%r11,8), %rdx #669.40 movq %rcx, 352(%rsp) # lea (%r11,%r11,4), %rcx #668.40 subq %r11, %rdx #669.40 addq %rsi, %rcx # addq %rsi, %rdx # lea (%rsi,%r11,8), %rsi # subq %r13, %rcx # lea (%r9,%rdi,4), %r11 # subq %r8, %r11 # xorl %edi, %edi # subq %r13, %rdx # subq %r13, %rsi # movq %r15, 88(%rsp) #657.9 subq %rbx, %r11 # xorl %r8d, %r8d # subq %r12, %rcx # movq %rcx, 368(%rsp) # subq %r12, %rdx # movq %rdx, 384(%rsp) # subq %r12, %rsi # movq %r10, 8(%rsp) # movq %rax, 16(%rsp) # vmovupd .L_2il0floatpacket.172(%rip), %xmm2 #664.38 movq %rsi, 392(%rsp) # xorl %esi, %esi # movq 272(%rsp), %rdx # movq 208(%rsp), %r10 # movq 280(%rsp), %r9 # movq 80(%rsp), %rax # movq 72(%rsp), %rcx # movq 88(%rsp), %r12 # movq 424(%rsp), %rbx # movq %r11, 96(%rsp) # # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm2 ..B10.95: # Preds ..B10.95 ..B10.94 movq 152(%rsp), %r13 #659.28 lea (%rcx,%r15), %r11 #658.28 vmovsd (%r11), %xmm3 #658.28 addq $2, %r12 #657.9 vmovhpd (%r11,%rdx), %xmm3, %xmm1 #658.28 lea (%r15,%rdx,2), %r15 #657.9 addq %rdi, %r13 #659.28 vmovsd (%r13), %xmm4 #659.28 vmovhpd (%r13,%rbx), %xmm4, %xmm6 #659.28 movq 104(%rsp), %r13 #659.40 addq %r8, %r13 #659.40 vmovsd (%r13), %xmm5 #659.40 vmovhpd (%r13,%r10), %xmm5, %xmm7 #659.40 movq 328(%rsp), %r13 #660.28 vmulpd %xmm7, %xmm6, %xmm12 #659.38 addq %rdi, %r13 #660.28 vmovsd (%r13), %xmm8 #660.28 vmovhpd (%r13,%rbx), %xmm8, %xmm10 #660.28 movq 336(%rsp), %r13 #660.40 addq %r8, %r13 #660.40 vmovsd (%r13), %xmm9 #660.40 vmovhpd (%r13,%r10), %xmm9, %xmm11 #660.40 lea (%r14,%rdi), %r13 #661.28 vmovsd (%r13), %xmm14 #661.28 vmovhpd (%r13,%rbx), %xmm14, %xmm0 #661.28 movq 376(%rsp), %r13 #661.40 vmulpd %xmm11, %xmm10, %xmm13 #660.38 vaddpd %xmm13, %xmm12, %xmm3 #660.26 addq %r8, %r13 #661.40 vmovsd (%r13), %xmm15 #661.40 vmovhpd (%r13,%r10), %xmm15, %xmm14 #661.40 lea (%rax,%rsi), %r13 #658.11 vmulpd %xmm14, %xmm0, %xmm4 #661.38 vaddpd %xmm4, %xmm3, %xmm5 #661.26 vmulpd %xmm5, %xmm1, %xmm1 #658.11 vmovlpd %xmm1, (%r13) #658.11 vmovhpd %xmm1, (%r13,%r9) #658.11 vmovsd (%r11), %xmm6 #664.28 vmovhpd (%r11,%rdx), %xmm6, %xmm7 #664.28 movq 120(%rsp), %r11 #665.28 movq 360(%rsp), %r13 #665.40 vmulpd %xmm7, %xmm2, %xmm0 #664.38 addq %rdi, %r11 #665.28 vmovsd (%r11), %xmm8 #665.28 vmovhpd (%r11,%rbx), %xmm8, %xmm10 #665.28 lea (%r13,%r8), %r11 #665.40 movq 144(%rsp), %r13 #666.28 .byte 15 #665.40 .byte 31 #665.40 .byte 0 #665.40 vmovsd (%r11), %xmm9 #665.40 vmovhpd (%r11,%r10), %xmm9, %xmm11 #665.40 vmulpd %xmm11, %xmm10, %xmm1 #665.38 lea (%r13,%rdi), %r11 #666.28 movq 352(%rsp), %r13 #666.40 vmovsd (%r11), %xmm12 #666.28 vmovhpd (%r11,%rbx), %xmm12, %xmm15 #666.28 lea (%r13,%r8), %r11 #666.40 movq 112(%rsp), %r13 #667.28 vmovsd (%r11), %xmm13 #666.40 vmovhpd (%r11,%r10), %xmm13, %xmm12 #666.40 vmulpd %xmm12, %xmm15, %xmm3 #666.38 vaddpd %xmm3, %xmm1, %xmm8 #666.26 lea (%r13,%rdi), %r11 #667.28 movq 344(%rsp), %r13 #667.40 vmovsd (%r11), %xmm4 #667.28 vmovhpd (%r11,%rbx), %xmm4, %xmm6 #667.28 lea (%r13,%r8), %r11 #667.40 movq 136(%rsp), %r13 #668.28 vmovsd (%r11), %xmm5 #667.40 vmovhpd (%r11,%r10), %xmm5, %xmm7 #667.40 vmulpd %xmm7, %xmm6, %xmm9 #667.38 vaddpd %xmm9, %xmm8, %xmm15 #667.26 lea (%r13,%rdi), %r11 #668.28 movq 368(%rsp), %r13 #668.40 vmovsd (%r11), %xmm10 #668.28 vmovhpd (%r11,%rbx), %xmm10, %xmm13 #668.28 lea (%r13,%r8), %r11 #668.40 movq 160(%rsp), %r13 #669.28 vmovsd (%r11), %xmm11 #668.40 vmovhpd (%r11,%r10), %xmm11, %xmm14 #668.40 vmulpd %xmm14, %xmm13, %xmm1 #668.38 vaddpd %xmm1, %xmm15, %xmm7 #668.26 lea (%r13,%rdi), %r11 #669.28 movq 384(%rsp), %r13 #669.40 vmovsd (%r11), %xmm3 #669.28 vmovhpd (%r11,%rbx), %xmm3, %xmm5 #669.28 lea (%r13,%r8), %r11 #669.40 movq 128(%rsp), %r13 #670.28 vmovsd (%r11), %xmm4 #669.40 vmovhpd (%r11,%r10), %xmm4, %xmm6 #669.40 vmulpd %xmm6, %xmm5, %xmm8 #669.38 vaddpd %xmm8, %xmm7, %xmm13 #669.26 lea (%r13,%rdi), %r11 #670.28 movq 392(%rsp), %r13 #670.40 vmovsd (%r11), %xmm9 #670.28 lea (%rdi,%rbx,2), %rdi #657.9 vmovhpd (%r11,%rbx), %xmm9, %xmm11 #670.28 lea (%r13,%r8), %r11 #670.40 .byte 144 #664.11 movq 96(%rsp), %r13 #664.11 vmovsd (%r11), %xmm10 #670.40 lea (%r8,%r10,2), %r8 #657.9 vmovhpd (%r11,%r10), %xmm10, %xmm12 #670.40 vmulpd %xmm12, %xmm11, %xmm14 #670.38 vaddpd %xmm14, %xmm13, %xmm15 #670.26 vmulpd %xmm15, %xmm0, %xmm0 #664.11 lea (%r13,%rsi), %r11 #664.11 vmovlpd %xmm0, (%r11) #664.11 lea (%rsi,%r9,2), %rsi #657.9 vmovhpd %xmm0, (%r11,%r9) #664.11 cmpq 320(%rsp), %r12 #657.9 jb ..B10.95 # Prob 81% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r12 r14 r15 xmm2 ..B10.96: # Preds ..B10.95 movq 432(%rsp), %r11 #675.11 movq 400(%rsp), %r14 #676.40 movq 200(%rsp), %rbx # movq (%r11), %rdi #675.11 movq 64(%r11), %rsi #675.11 movq 56(%r11), %r12 #675.11 movq 56(%r14), %r15 #676.40 movq %rdi, 192(%rsp) #675.11 movq 88(%r11), %r8 #675.11 movq 80(%r11), %rdi #675.11 movq %rsi, 184(%rsp) #675.11 movq %r12, 280(%rsp) #675.11 movq 264(%rsp), %rcx # movq (%rsp), %r9 # movq 8(%rsp), %r10 # movq 16(%rsp), %rax # movq (%r14), %rsi #676.40 movq 64(%r14), %r12 #676.40 movq 88(%r14), %r13 #676.40 movq 80(%r14), %r11 #676.40 movq %r15, 208(%rsp) #676.40 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.97: # Preds ..B10.96 ..B10.156 movq 296(%rsp), %r15 # movq 320(%rsp), %r14 # imulq %r14, %r15 # movq %r15, 152(%rsp) # movq 424(%rsp), %r15 # imulq %r14, %r15 # movq %r15, 144(%rsp) # movq %rdx, %r15 # imulq %r14, %r15 # movq %r15, 160(%rsp) # movq 288(%rsp), %r15 # imulq %r14, %r15 # movq %r15, 136(%rsp) # cmpq %rcx, %r14 #657.9 jae ..B10.101 # Prob 3% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.98: # Preds ..B10.97 movq %r12, 24(%rsp) # movq %rdx, %r12 # imulq 408(%rsp), %r12 # movq 416(%rsp), %r14 # lea (,%r10,8), %r15 #659.28 movq %r13, 40(%rsp) # movq %rbx, 200(%rsp) # movq %rdi, 56(%rsp) # lea (%r14,%rdx), %r13 # subq %r12, %r13 # lea (%r10,%r10,4), %rdi #667.28 movq 424(%rsp), %r12 # addq %r13, 160(%rsp) # movq %r10, %r13 # imulq %r9, %r13 # movq %rcx, 264(%rsp) # lea (%rbx,%r12), %r14 # imulq %rax, %r12 # movq %r12, %rbx # movq %rdi, %rcx #667.28 subq %r10, %rbx # negq %rcx #667.28 negq %rbx # addq %r12, %rcx # addq %r14, %rbx # negq %rcx # subq %r13, %rbx # addq %r14, %rcx # movq %rsi, 48(%rsp) # lea (%r10,%r10), %rsi #668.28 movq %r11, 32(%rsp) # lea (%r10,%r10,2), %r11 #670.28 movq %rbx, 336(%rsp) # movq %r10, %rbx #659.28 negq %rsi # negq %r11 #670.28 subq %r15, %rbx #659.28 addq %r12, %rsi # addq %r12, %r11 # addq %r12, %rbx # negq %rsi # negq %r11 # negq %rbx # addq %r14, %rsi # addq %r14, %r11 # addq %r14, %rbx # subq %r13, %rcx # subq %r13, %rsi # movq %rcx, 80(%rsp) # subq %r13, %r11 # movq 176(%rsp), %rcx # subq %r13, %rbx # movq %rsi, 104(%rsp) # addq %r10, %rdi #669.28 movq %r11, 96(%rsp) # negq %rdi #669.28 movq %rbx, 72(%rsp) # addq %r12, %rdi # movq 216(%rsp), %rsi # negq %rdi # movq 168(%rsp), %r11 # negq %r15 # movq 296(%rsp), %rbx # addq %r14, %rdi # imulq %rcx, %rsi # imulq %rbx, %r11 # movq %r8, 64(%rsp) # lea (,%r10,4), %r8 #665.28 negq %r8 # addq %r12, %r15 # addq %r12, %r8 # subq %r13, %rdi # movq %rdi, 112(%rsp) # negq %r8 # movq 224(%rsp), %rdi # negq %r15 # addq %r14, %r8 # addq %r14, %r15 # addq %r11, %rsi # lea (%rcx,%rcx,2), %r11 #659.40 subq %r13, %r8 # subq %r13, %r15 # movq %r8, 88(%rsp) # lea (%rdi,%r11), %r8 # movq %r15, 128(%rsp) # lea (%rdi,%r11,2), %r15 # subq %rsi, %r8 # lea (%r10,%r10,8), %r11 #661.28 negq %r11 #661.28 addq %rbx, %r8 # addq %r11, %r12 # subq %rsi, %r15 # subq %r12, %r14 # lea (%rcx,%rcx,8), %r12 #661.40 addq %rdi, %r12 # subq %r13, %r14 # subq %rsi, %r12 # addq %rbx, %r15 # movq 240(%rsp), %r11 # addq %rbx, %r12 # movq %r8, 120(%rsp) # movq %r12, 352(%rsp) # movq 248(%rsp), %r12 # movq 232(%rsp), %r8 # movq 288(%rsp), %r13 # imulq %r11, %r12 # imulq %r13, %r8 # movq %r15, 344(%rsp) # addq %r8, %r12 # movq 256(%rsp), %r8 # lea (%r11,%r11,2), %r15 #658.11 addq %r8, %r15 # subq %r12, %r15 # addq %r13, %r15 # movq %r15, 392(%rsp) # lea (%rcx,%rdi), %r15 # subq %rsi, %r15 # addq %rbx, %r15 # movq %r15, 376(%rsp) # lea (%rdi,%rcx,2), %r15 # subq %rsi, %r15 # addq %rbx, %r15 # movq %r15, 368(%rsp) # lea (%rdi,%rcx,4), %r15 # subq %rsi, %r15 # addq %rbx, %r15 # movq %r15, 360(%rsp) # lea (%rcx,%rcx,4), %r15 #668.40 addq %rdi, %r15 # subq %rsi, %r15 # addq %rbx, %r15 # movq %r15, 384(%rsp) # lea (,%rcx,8), %r15 #669.40 subq %rcx, %r15 #669.40 lea (%rdi,%rcx,8), %rcx # addq %rdi, %r15 # subq %rsi, %rcx # subq %rsi, %r15 # lea (%r8,%r11,4), %rsi # subq %r12, %rsi # addq %rbx, %r15 # addq %rcx, %rbx # addq %rsi, %r13 # movq %rbx, 296(%rsp) # movq %r13, 288(%rsp) # movq %r14, 328(%rsp) #664.38 movq %rdx, 272(%rsp) #664.38 movq %r9, (%rsp) #664.38 movq %r10, 8(%rsp) #664.38 movq %rax, 16(%rsp) #664.38 movq %r15, 400(%rsp) # vmovsd .L_2il0floatpacket.171(%rip), %xmm1 #664.38 movq 128(%rsp), %r12 #664.38 movq 120(%rsp), %r11 #664.38 movq 72(%rsp), %r10 #664.38 movq 112(%rsp), %r14 #664.38 movq 80(%rsp), %r9 #664.38 movq 88(%rsp), %rax #664.38 movq 96(%rsp), %rdx #664.38 movq 104(%rsp), %rcx #664.38 movq 160(%rsp), %rsi #664.38 movq 136(%rsp), %r13 #664.38 movq 144(%rsp), %rbx #664.38 movq 152(%rsp), %rdi #664.38 movq 320(%rsp), %r8 #664.38 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1 ..B10.99: # Preds ..B10.99 ..B10.98 movq 344(%rsp), %r15 #660.38 incq %r8 #657.9 vmovsd (%rbx,%r12), %xmm3 #660.28 vmovsd (%rbx,%r10), %xmm2 #659.28 vmulsd (%rdi,%r15), %xmm3, %xmm5 #660.38 vmulsd (%rdi,%r11), %xmm2, %xmm4 #659.38 movq 328(%rsp), %r15 #661.28 vaddsd %xmm5, %xmm4, %xmm7 #660.26 vmovsd (%rbx,%r15), %xmm6 #661.28 movq 352(%rsp), %r15 #661.38 vmulsd (%rdi,%r15), %xmm6, %xmm8 #661.38 movq 392(%rsp), %r15 #658.11 vaddsd %xmm8, %xmm7, %xmm9 #661.26 vmulsd (%rsi), %xmm9, %xmm10 #658.11 vmovsd %xmm10, (%r13,%r15) #658.11 movq 376(%rsp), %r15 #665.38 vmovsd (%rbx,%rax), %xmm11 #665.28 vmovsd (%rbx,%r9), %xmm15 #667.28 vmulsd (%rdi,%r15), %xmm11, %xmm13 #665.38 vmulsd (%rsi), %xmm1, %xmm0 #664.38 movq 336(%rsp), %r15 #666.28 vmovsd (%rbx,%rcx), %xmm4 #668.28 vmovsd (%rbx,%r14), %xmm7 #669.28 vmovsd (%rbx,%r15), %xmm12 #666.28 movq 368(%rsp), %r15 #666.38 vmovsd (%rbx,%rdx), %xmm10 #670.28 addq 272(%rsp), %rsi #657.9 vmulsd (%rdi,%r15), %xmm12, %xmm14 #666.38 movq 360(%rsp), %r15 #667.38 vaddsd %xmm14, %xmm13, %xmm2 #666.26 vmulsd (%rdi,%r15), %xmm15, %xmm3 #667.38 movq 384(%rsp), %r15 #668.38 vaddsd %xmm3, %xmm2, %xmm5 #667.26 vmulsd (%rdi,%r15), %xmm4, %xmm6 #668.38 movq 400(%rsp), %r15 #669.38 vaddsd %xmm6, %xmm5, %xmm8 #668.26 vmulsd (%rdi,%r15), %xmm7, %xmm9 #669.38 movq 296(%rsp), %r15 #670.38 vaddsd %xmm9, %xmm8, %xmm11 #669.26 vmulsd (%rdi,%r15), %xmm10, %xmm12 #670.38 movq 288(%rsp), %r15 #664.11 vaddsd %xmm12, %xmm11, %xmm13 #670.26 vmulsd %xmm13, %xmm0, %xmm0 #664.11 vmovsd %xmm0, (%r13,%r15) #664.11 addq 312(%rsp), %r13 #657.9 .byte 144 #657.9 addq 424(%rsp), %rbx #657.9 addq 304(%rsp), %rdi #657.9 cmpq 264(%rsp), %r8 #657.9 jb ..B10.99 # Prob 81% #657.9 # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 r14 xmm1 ..B10.100: # Preds ..B10.87 ..B10.99 movq 200(%rsp), %rbx # movq 264(%rsp), %rcx # movq 24(%rsp), %r12 # movq 32(%rsp), %r11 # movq 40(%rsp), %r13 # movq 48(%rsp), %rsi # movq 56(%rsp), %rdi # movq 64(%rsp), %r8 # movq 272(%rsp), %rdx # movq (%rsp), %r9 # movq 8(%rsp), %r10 # movq 16(%rsp), %rax # # LOE rax rdx rcx rbx rbp rsi rdi r8 r9 r10 r11 r12 r13 ..B10.101: # Preds ..B10.85 ..B10.100 ..B10.97 cmpq $8, %rdx #676.40
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At asm lines 505-576 you have full AVX-256 vector code, where the memory access is split into AVX-128 chunks to allow for non-32-byte alignment without the extreme performance penalties which would be incurred on Sandy Bridge with unaligned AVX-256 moves. It's still possible for this to use the full bandwidth of your L2 cache and memory.
At 948-1061 you have another version which appears to allow for total mis-alignment, with memory access split into 64-bit scalar chunks. If the alignment is so bad, there's no point in packing into more than AVX-128 for the floating point operations.
Both versions have scalar remainder loops.
You would want to check at run time that most of the work is done in the AVX-256 loop. This is not difficult with oprofile or VTune.
Did you check opt-report to see if there is any remark about this versioning?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm with Tim. Use VTune to see is the preponderance of operations occur in ..B10.83 loop.
You appear to have three different computation loops any of which are entered dependent upon alignment and counts. You also have a peel and two residual loops.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply.
I think the vectorization reports suggests the main loop body is vectorized with avx256:
src/ModNavierStokesRHS.f90(657): (col. 9) remark: SIMD LOOP WAS VECTORIZED. src/ModNavierStokesRHS.f90(657): (col. 9) remark: loop was not vectorized: unsupported data type. src/ModNavierStokesRHS.f90(657): (col. 9) warning #13379: loop was not vectorized with "simd" src/ModNavierStokesRHS.f90(657): (col. 9) remark: SIMD LOOP WAS VECTORIZED.
I also want to add that the loop extent Nc is quite large. In my current case Nc = 110592.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The report may say it is vectorized, but there are also three code paths. Using VTune will assure that the fully vectorized loop is used the majority of the time (or if it is not used that much).
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page