- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I was attempting to write a Fortran equivalent to the code provided in this vectorization tutorial
https://cvw.cac.cornell.edu/vector/exercise_1
My attempt at a Fortran vectorization equivalent is here:
#define ARRAY_SIZE 1024 #define NUMBER_OF_TRIALS 1000000 module array !dir$ attributes align:64 :: a, b, c double precision, dimension(ARRAY_SIZE) :: a, b, c end module program main use array implicit none integer i, t double precision :: m = 1.0001 ! Populate A and B arrays do i=1, ARRAY_SIZE b(i) = i a(i) = i + 1 end do ! Perform an operation a number of times do t = 1, NUMBER_OF_TRIALS; do i = 1, ARRAY_SIZE c(i) = c(i) + m*a(i) + b(i) end do ! I have also tried a version with this line ! c = c + m * a + b end do end program
My questions(s):
0) Is this code as near-equivalent to the version 'simple.c' as in the link above?
1) Any further optimization recommendations?
2) The c version appears to always be faster, and I believe to be related to compound assignment (with '+='). Here's some reasoning http://programmers.stackexchange.com/a/134136/69046
c += m*a + b;
Am I off base in thinking that this is the primary difference between the C and Fortran versions?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One issue you may have is the compiler optimization may notice that the results placed into array c[:] are not used following the loop. The optimization may result in elimination of the array c and the loops (initialization and timed loop). To correct for this, after the timed loop insert something innocuous (outside timed section), such as
IF(ISNAN(SUM(A))) PRINT *,"Found NaN"
If your input arrays are "orderly" the IF test won't print. If a NaN is in array c, then fix the input data such that it does not generate NaN's.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim,
I inserted the extra line and it does not impact the compiler optimization.
Furthermore (for all), I should note that my timings are about identical between c and Fortran when I do use
c = c + m*a + b;
C (no compound assignment)/Fortran
simple_no_vec: 0.88
simple 0.35
simple_avx 0.23
C (compound assignment) only:
simple_no_vec: 0.69
simple 0.16
simple_avx 0.08
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you expect an advantage for the order of operations implied by the += operator, you can do the same thing with parentheses in Fortran. Both intel Fortran and c will ignore the specified order including the one in += until you set protect-parens option (included in ifort -standard-semantics).
As Jim pointed out, you can't draw any useful conclusions from your example.
In a realistic similar situation, you might have a point due to shortage of pointer registers in 32 bit mode.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The C program (Intel C++) should have produced the same assembly code for
c += m*a + b;
and
c = c + m*a + b;
The fact that it differs "may" imply one of the two used FMA and the other did not. Though other instruction sequences may have been used.
A second factor may be that the loop unrolling may have differed. You will have to look at the disassembly (VTune) to assure what the compiler did or did not do. (I assume the C program performed an aligned allocation to 64 bytes as did your Fortran example)
I caution you at making any lasting inference of any compiler optimization missed opportunities since these are subject to change with any revision.
If you have a simple reproducer, please send it to premier support inclusive of compiler version and command line options.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Those 2 c versions are required to differ if standards compliance options are set. It may be reasonable to assume no standards compliance, but then it is totally unreasonable to make generalizations about how Fortran and c compilers will act in general, particularly with respect to dead code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
>>Those 2 C versions are required to differ if standards compliance options are set.
Other than for when FMA may or may not be used. What in the standards indicate that the two statements are not to be implemented equivalently?
Note, excepting for when array c[:] is an array of atomics or volatiles, or where i is volatile or atomic "c +=" and "c = c +" should be implementable equivalently.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there some reason this couldn't have been written in Fortran? I translated to something that would compile:
module array implicit none integer, parameter :: ARRAY_SIZE = 1024 integer, parameter :: NUMBER_OF_TRIALS = 1000000 !dir$ attributes align:64 :: a, b, c double precision, dimension(ARRAY_SIZE) :: a, b, c end module program main use array implicit none integer i, t double precision :: m = 1.0001d0 ! Populate A and B arrays do i=1, ARRAY_SIZE b(i) = i a(i) = i + 1 end do ! Perform an operation a number of times do t = 1, NUMBER_OF_TRIALS; do i = 1, ARRAY_SIZE c(i) = c(i) + m*a(i) + b(i) end do ! I have also tried a version with this line ! c = c + m * a + b end do end program
The inner loop looked like:
.B1.6:: ; Preds .B1.6 .B1.5 vmovupd ymm1, YMMWORD PTR [imagerel(ARRAY_mp_A)+rcx+rdx*8] ;26.29 lea r8, QWORD PTR [rcx+rdx*8] ;26.13 vmovupd ymm3, YMMWORD PTR [imagerel(ARRAY_mp_A)+32+rcx+rdx*8] ;26.29 vmovupd ymm5, YMMWORD PTR [imagerel(ARRAY_mp_A)+64+rcx+rdx*8] ;26.29 vfmadd213pd ymm1, ymm0, YMMWORD PTR [imagerel(ARRAY_mp_C)+rcx+rdx*8] ;26.25 vfmadd213pd ymm3, ymm0, YMMWORD PTR [imagerel(ARRAY_mp_C)+32+rcx+rdx*8] ;26.25 vfmadd213pd ymm5, ymm0, YMMWORD PTR [imagerel(ARRAY_mp_C)+64+rcx+rdx*8] ;26.25 vaddpd ymm2, ymm1, YMMWORD PTR [imagerel(ARRAY_mp_B)+rcx+rdx*8] ;26.13 vaddpd ymm4, ymm3, YMMWORD PTR [imagerel(ARRAY_mp_B)+32+rcx+rdx*8] ;26.13 vaddpd ymm1, ymm5, YMMWORD PTR [imagerel(ARRAY_mp_B)+64+rcx+rdx*8] ;26.13 vmovupd YMMWORD PTR [imagerel(ARRAY_mp_C)+r8], ymm2 ;26.13 vmovupd ymm2, YMMWORD PTR [imagerel(ARRAY_mp_A)+96+rcx+rdx*8] ;26.29 vmovupd YMMWORD PTR [imagerel(ARRAY_mp_C)+32+r8], ymm4 ;26.13 vmovupd YMMWORD PTR [imagerel(ARRAY_mp_C)+64+r8], ymm1 ;26.13 vfmadd213pd ymm2, ymm0, YMMWORD PTR [imagerel(ARRAY_mp_C)+96+rcx+rdx*8] ;26.25 vaddpd ymm3, ymm2, YMMWORD PTR [imagerel(ARRAY_mp_B)+96+rcx+rdx*8] ;26.13 add rdx, 16 ;25.9 vmovupd YMMWORD PTR [imagerel(ARRAY_mp_C)+96+r8], ymm3 ;26.13 cmp rdx, 1024 ;25.9 jb .B1.6 ; Prob 99% ;25.9
(Compiled with /QxCORE-AVX2). Looks normal, with m = ymm0 and indeed an FMA. What does this inner loop look like in your C compiler?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
compile options -O3 -S -fsource-asm -c # mark_description "Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.3.187 Build 2"
### ! Perform an operation a number of times ### do t = 1, NUMBER_OF_TRIALS; ### do i = 1, ARRAY_SIZE ### c(i) = c(i) + m*a(i) + b(i) movaps .L_2il0floatpacket.0(%rip), %xmm0 #27.20 xorl %eax, %eax #25.5 # LOE rbx r12 r13 r14 r15 eax xmm0 ..B1.5: # Preds ..B1.7 ..B1.4 xorl %edx, %edx #26.9 # LOE rdx rbx r12 r13 r14 r15 eax xmm0 ..B1.6: # Preds ..B1.6 ..B1.5 movaps array_mp_a_(,%rdx,8), %xmm1 #27.29 movaps 16+array_mp_a_(,%rdx,8), %xmm2 #27.29 movaps 32+array_mp_a_(,%rdx,8), %xmm3 #27.29 movaps 48+array_mp_a_(,%rdx,8), %xmm4 #27.29 mulpd %xmm0, %xmm1 #27.28 mulpd %xmm0, %xmm2 #27.28 mulpd %xmm0, %xmm3 #27.28 mulpd %xmm0, %xmm4 #27.28 addpd array_mp_c_(,%rdx,8), %xmm1 #27.25 addpd 16+array_mp_c_(,%rdx,8), %xmm2 #27.25 addpd 32+array_mp_c_(,%rdx,8), %xmm3 #27.25 addpd 48+array_mp_c_(,%rdx,8), %xmm4 #27.25 addpd array_mp_b_(,%rdx,8), %xmm1 #27.13 addpd 16+array_mp_b_(,%rdx,8), %xmm2 #27.13 addpd 32+array_mp_b_(,%rdx,8), %xmm3 #27.13 addpd 48+array_mp_b_(,%rdx,8), %xmm4 #27.13 movaps %xmm1, array_mp_c_(,%rdx,8) #27.13 movaps %xmm2, 16+array_mp_c_(,%rdx,8) #27.13 movaps %xmm3, 32+array_mp_c_(,%rdx,8) #27.13 movaps %xmm4, 48+array_mp_c_(,%rdx,8) #27.13 addq $8, %rdx #26.9 cmpq $1024, %rdx #26.9 jb ..B1.6 # Prob 99% #26.9 # LOE rdx rbx r12 r13 r14 r15 eax xmm0 ..B1.7: # Preds ..B1.6 incl %eax #25.5 cmpl $1000000, %eax #25.5 jb ..B1.5 # Prob 99% #25.5 # LOE rbx r12 r13 r14 r15 eax xmm0 ..B1.8: # Preds ..B1.7 ### end do ### !c = c + (m * a + b)
With the parentheses:
### ! Perform an operation a number of times ### do t = 1, NUMBER_OF_TRIALS; ### do i = 1, ARRAY_SIZE ### c(i) = c(i) + (m*a(i) + b(i)) movsd .L_2il0floatpacket.0(%rip), %xmm0 #27.20 pxor %xmm1, %xmm1 #27.13 # LOE rax rbx r12 r13 r14 r15 xmm0 xmm1 ..B1.5: # Preds ..B1.7 ..B1.4 movsd array_mp_a_(,%rax,8), %xmm2 #27.30 movaps %xmm1, %xmm9 #27.13 mulsd %xmm0, %xmm2 #27.29 pxor %xmm10, %xmm10 #27.13 movaps %xmm1, %xmm8 #27.13 movaps %xmm1, %xmm7 #27.13 movaps %xmm1, %xmm6 #27.13 movaps %xmm1, %xmm5 #27.13 movaps %xmm1, %xmm4 #27.13 movaps %xmm1, %xmm3 #27.13 xorl %edx, %edx #25.5 addsd array_mp_b_(,%rax,8), %xmm2 #27.35 unpcklpd %xmm2, %xmm2 #27.35 # LOE rax rbx r12 r13 r14 r15 edx xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 ..B1.6: # Preds ..B1.6 ..B1.5 addl $16, %edx #25.5 addpd %xmm2, %xmm10 #27.13 addpd %xmm2, %xmm9 #27.13 addpd %xmm2, %xmm8 #27.13 addpd %xmm2, %xmm7 #27.13 addpd %xmm2, %xmm6 #27.13 addpd %xmm2, %xmm5 #27.13 addpd %xmm2, %xmm4 #27.13 addpd %xmm2, %xmm3 #27.13 cmpl $1000000, %edx #25.5 jb ..B1.6 # Prob 99% #25.5 # LOE rax rbx r12 r13 r14 r15 edx xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 ..B1.7: # Preds ..B1.6 addpd %xmm9, %xmm10 #27.13 addpd %xmm7, %xmm8 #27.13 addpd %xmm5, %xmm6 #27.13 addpd %xmm3, %xmm4 #27.13 addpd %xmm8, %xmm10 #27.13 addpd %xmm4, %xmm6 #27.13 addpd %xmm6, %xmm10 #27.13 movaps %xmm10, %xmm2 #27.13 unpckhpd %xmm10, %xmm2 #27.13 addsd %xmm2, %xmm10 #27.13 addsd array_mp_c_(,%rax,8), %xmm10 #27.13 movsd %xmm10, array_mp_c_(,%rax,8) #27.13 incq %rax #26.9 cmpq $1024, %rax #26.9 jb ..B1.5 # Prob 99% #26.9 # LOE rax rbx r12 r13 r14 r15 xmm0 xmm1 ..B1.8: # Preds ..B1.7 ### end do ### !c = c + (m * a + b) ### end do
I don't know how to interpret this - can someone spell this out further?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
C version (the 'compound assignment' or rather, different Order of Operations?):
### /* Perform an operation a number of times */ ### for (t=0; t < NUMBER_OF_TRIALS; t++) { ### for (i=0; i < ARRAY_SIZE; i++) { ### c += m*a + b; pxor %xmm1, %xmm1 #27.4 # LOE rax rbx r12 r13 r14 r15 xmm0 xmm1 ..B1.4: # Preds ..B1.6 ..B1.3 movsd a(,%rax,8), %xmm2 #27.15 movaps %xmm1, %xmm9 #27.4 mulsd %xmm0, %xmm2 #27.15 pxor %xmm10, %xmm10 #27.4 movaps %xmm1, %xmm8 #27.4 movaps %xmm1, %xmm7 #27.4 movaps %xmm1, %xmm6 #27.4 movaps %xmm1, %xmm5 #27.4 movaps %xmm1, %xmm4 #27.4 movaps %xmm1, %xmm3 #27.4 xorl %edx, %edx #25.5 addsd b(,%rax,8), %xmm2 #27.22 unpcklpd %xmm2, %xmm2 #27.22 # LOE rax rbx r12 r13 r14 r15 edx xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 ..B1.5: # Preds ..B1.5 ..B1.4 addl $16, %edx #25.5 addpd %xmm2, %xmm10 #27.4 addpd %xmm2, %xmm9 #27.4 addpd %xmm2, %xmm8 #27.4 addpd %xmm2, %xmm7 #27.4 addpd %xmm2, %xmm6 #27.4 addpd %xmm2, %xmm5 #27.4 addpd %xmm2, %xmm4 #27.4 addpd %xmm2, %xmm3 #27.4 cmpl $1000000, %edx #25.5 jb ..B1.5 # Prob 99% #25.5 # LOE rax rbx r12 r13 r14 r15 edx xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 ..B1.6: # Preds ..B1.5 addpd %xmm9, %xmm10 #27.4 addpd %xmm7, %xmm8 #27.4 addpd %xmm5, %xmm6 #27.4 addpd %xmm3, %xmm4 #27.4 addpd %xmm8, %xmm10 #27.4 addpd %xmm4, %xmm6 #27.4 addpd %xmm6, %xmm10 #27.4 movaps %xmm10, %xmm2 #27.4 unpckhpd %xmm10, %xmm2 #27.4 addsd %xmm2, %xmm10 #27.4 addsd c(,%rax,8), %xmm10 #27.4 movsd %xmm10, c(,%rax,8) #27.4 incq %rax #26.3 cmpq $1024, %rax #26.3 jb ..B1.4 # Prob 99% #26.3 # LOE rax rbx r12 r13 r14 r15 xmm0 xmm1 ..B1.7: # Preds ..B1.6 ### } ### } ###
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler is gaming your benchmark in the last two versions. It sees that it can switch the order of the loops and so avoids a lot of loads and stores. Try somehow reading or writing a random element of array c in the outer loop or even call a no-nothing subroutine with c as the actual argument (and make sure that the compiler can't see that the subroutine actually does nothing, as by dynamically linking it.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did that (call a dynamically linked library containing 'dummy' subroutine with c as argument), and it prevented the loop interchange. I also learned to set the value of qopt-report>=2 so that I could see the diagnostic messages for loop interchange.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So did that make the performance results more consistent between the C and Fortran versions? Do you really have an old computer that doesn't have avx, let alone FMA?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The parenthesis do not (abstractly) require a change in order of operation, however, in Fortran, the addition of the parenthesis around the sub expression require the use of a temporary (in this case, it added overhead the elieded vectorization).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For this test, I worked on Xeon E5-2670.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looks like your processor has AVX but not FMA. You should be looking for a compiler switch that changes those %xmm* registers to %ymm* registers because that should give you roughly 2X the throughput since your arrays fit in cache.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I've primarily been getting used to using '-xHost' for compiles. It appears that changes the registers to %ymm. Do I need to explicitly call out '-fma' as a compiler option?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can try -fma, but my guess would be that it should cause an undefined opcode fault at runtime on your machine.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page