Solved: I have tested this now myself

Alexander_K_2 · ‎09-11-2015

Hi,

I have the following 1-D PIC Code from the Livermore Fortran Kernels. The following options with the compiler version 15.0.3 where used: ifort -O3 -mavx -align array64byte -ansi-alias -qopt-report=5 livermore.f

               fw= 1.000d0
c
 1014 DO   141  k= 1,n
            VX(k)= 0.0d0
            XX(k)= 0.0d0
            IX(k)= INT(  GRD(k))
            XI(k)= REAL( IX(k))
           EX1(k)= EX   ( IX(k))
          DEX1(k)= DEX  ( IX(k))
 141  CONTINUE
c
      DO   142  k= 1,n
            VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k)
            XX(k)= XX(k) + VX(k)  + FLX
            IR(k)= XX(k)
            RX(k)= XX(k) - IR(k)
            IR(k)= MOD2N(  IR(k),2048) + 1
            XX(k)= RX(k) + IR(k)
 142  CONTINUE
c
      DO  14    k= 1,n
      RH(IR(k)  )= RH(IR(k)  ) + fw - RX(k)
      RH(IR(k)+1)= RH(IR(k)+1) + RX(k)
  14  CONTINUE
c
c...................
      IF( TEST(14) .GT. 0) GO TO 1014

The compiler merges the three inner loops into one. This is fine since they have the same length and also the operations do not conflict. However the vectorization fails because of a dependence betwwen RH in the last two lines.

  LOOP BEGIN at livermore_cray.f(2207,12)
      remark #25045: Fused Loops: ( 2207 2216 2225 )

      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed OUTPUT dependence between rh line 2226 and rh line 2227
      remark #15346: vector dependence: assumed OUTPUT dependence between rh line 2227 and rh line 2226
      remark #25015: Estimate of max trip count of loop=1001
   LOOP END

But the indices are definitely different because IR(k) and IR(k)+1 cannot be the same value. The only thing I can image is, that between two iterations these operations may be in conflict. But why is vectorization not possible? Shouldn't the iterations be somewhat independent even with loop unrolling?

jimdempseyatthecove · ‎09-11-2015

As I see it the first two loops should have been fused, but not all three. Insert a !DIR$ NOFUSION in front of the third loop.

The third loop has undeterminable indexing order. While k is determinable to be in sequence, IR(k) is not.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎09-11-2015

As I see it the first two loops should have been fused, but not all three. Insert a !DIR$ NOFUSION in front of the third loop.

The third loop has undeterminable indexing order. While k is determinable to be in sequence, IR(k) is not.

Jim Dempsey

jimdempseyatthecove · ‎09-11-2015

Also, the third loop can be vectorized with a vector length of 2 whereas the first two loops may have been vectorized with a longer vector length.

!DIR$ NOFUSION
!DIR$ SIMD VECTORLENGTH(2)
do 14 k=1,n

Jim Dempsey

jimdempseyatthecove · ‎09-11-2015

Or:

      DO 14    k= 1,n
      TEMP(1) = fw - RX(k)
      TEMP(2) = RX(k)
      !DIR$ SIMD
      RH(IR(k):IR(k)+1)= RH(IR(k):IR(k)+1) + TEMP(1:2)
14 CONTINUE

Jim Dempsey

Martyn_C_Intel · ‎09-11-2015

I like Jim's first reply best. The last loop is a scatter; RH(IR(k)) and RH(IR(k)+1) may be different within a single iteration, but RH(IR(k)) from two different iterations may both point to the same memory location. The compiler cannot vectorize this simply and safely, at least for instruction sets up to and including Intel AVX2.

I don't think the compiler should be fusing this loop with the earlier ones. We'll see what the developers say. Using a NOFUSION directive is a good workaround - that's the sort of thing it was designed for.

Whilst it may be technically possible to create and vectorize an inner loop over k and k+1, I doubt that would be worthwhile. Better to keep things simple.

TimP · ‎09-11-2015

My version from long ago (when ifort still had SSE1 vectorization), with the newly required directive:

C***********************************************************************
C*** KERNEL 14 1-D PIC Particle In Cell
C***********************************************************************
C
C
do

! vectorizable by SSE2, not SSE1
                  ix(:n)= grd(:n)
C               DO k= 1,n
C                 vx(k)= ex(ix(k))-ix(k)*dex(ix(k))
C                 ir(k)= vx(k)+flx
C               END DO
              vx(:n)= ex(ix(:n))-ix(:n)*dex(ix(:n))
              ir(:n)= vx(:n)+flx
! vectorizable by SSE2, not SSE1
                  rx(:n)= vx(:n)+flx-ir(:n)
! vectorizable by SSE1
                  ir(:n)= iand(ir(:n),2047)+1
! vectorizable by SSE2, not SSE1
                  xx(:n)= rx(:n)+ir(:n)

!dir$ no fusion ! needed for 2015 ifort at -O3
                DO k= 1,n
                      rh(ir(k))= rh(ir(k))+fw-rx(k)
                      rh(ir(k)+1)= rh(ir(k)+1)+rx(k)
                END DO
                IF(test(14) <= 0)              EXIT

                END DO

At -O3 AVX2, ifort 16.0 reports the bad fusion which ends up preventing vectorization beyond the first 2 assignments, until the NO FUSION directive is added before the non-vectorizable loop. Then the rest is fused into a single loop with expected vector speedup of 4.5.

When the NO FUSION directive was introduced several years ago, to replace a usage of the less descriptive DISTRIBUTE POINT, it wasn't well advertised. This case shows yet one more instance where the compiler no longer pays attention to the consequence of fusion and needs the directive, as a consequence of setting -O3.

I have discussed before the equivalent use of !$omp simd to place a barrier against fusion. Here it would require explicitly fusing the first 2 original old style loops into a single omp simd loop; then the compiler would not attempt to merge in the remaining loop. The advantage I see is that it accomplishes the same thing without Intel proprietary directives.

If I write

rh(ir(k):ir(k)+1)= rh(ir(k):ir(k)+1)+[fw,0.]-[1,-1.]*rx(k:k+1)

Apparently the compiler is concerned about the case where indexing may go out of bounds. An unaligned vector of 2 elements isn't worth much anyway. If I force it by simd directive, as Jim suggested, it reports expected speedup of 0.6 (a loss).

jimdempseyatthecove · ‎09-12-2015

>>If I force it by simd directive, as Jim suggested, it reports expected speedup of 0.6 (a loss).

What is the actual speedup?

If the underlying architecture supports unaligned load/store, then the SIMDized loop on average over random index pairs (ir(k):ir(K)+1) will have half aligned on two element boundary. (2x + 0.6x) / 2 or 1.3x speedup.

It is easy enough to run the test as opposed to speculate.

Jim Dempsey

jimdempseyatthecove · ‎09-12-2015

>>use of !$omp simd to place a barrier against fusion... it accomplishes the same thing without Intel proprietary directives.

But then your code breaks when compiled without -Qopenmp.

The correct way is to either not fuse the loops or provide a compiler directive to subvert loop fusion.

If a non-Intel compiler does not support the !DIR$ NOFUSION then by convention the directive should be considered a comment (possibly with warning), and if further that compiler fuses the loops and breaks the code, then it is the fault of the compiler for fusing the loops.

Jim Dempsey

TimP · ‎09-12-2015

I don't have a Linux avx box but will check on Windows. Other kernels of this benchmark need openmp or openmp-simd so I don't see that as an objection.

TimP · ‎09-12-2015

According to the report which the benchmark issues for the full timing loop including both the vectorized and non-vectorized parts, the vectorization of the fully vectorizable part gives me just under twice the performance of non-vector version. Forcing the additional local vector length 2 vectorization reduces performance 8% from peak. These are single precision AVX2 Windows numbers. I believe the double precision is considered more important.

Alexander_K_2 · ‎09-13-2015

Thanks for the comments so far. I will test it when I am back in the office on Monday.

jimdempseyatthecove · ‎09-13-2015

TimP,

Thanks for making the test run. Can you re-run using REAL(8).

The SSE instructions do not have a mask load/store so you cannot do two of the four single precision lanes. Double precision will not require a mask load/store

The AVX instruction set has masked load/store for single precision... however the mask is not specifiable using an immediate (iow it requires the use of an ymm register loaded from memory).

It is not until you get to KNC and AVX512 that you have the _mm512_kmov

Vector length of 2 for floats will be problematic (performance wise) until you have available

a) masked load/store with immediate specifying mask
b) masked gather/scatter (with immediate specifying stride 1 and mask)

Jim Dempsey

TimP · ‎09-13-2015

In the double precision case, AVX2 vectorization of the code up to !dir$ NO FUSION gives me a 60% overall speedup for the medium size case. The compiler's prediction for the vectorized portion is a speedup of 1.9. The full size case begins to be limited by memory issues on my laptop. In the code displayed by -S, there is both an AVX-256 and an AVX-128 version of the loop. -align:aray32byte suppresses the AVX-256 version, which apparently will not be used unless array alignment is less than 16-byte. Note that due to the gather operations, 32-byte array alignments don't allow for much 32-byte access.

Although the compiler's prediction of performance of the double precision length 2 inner loop AVX2-128 vectorization makes it look like a wash, the actual performance is poor. AVX-128 eliminates need for a mask, and masked stores are slow, but that doesn't appear to be a useful key to outsmart the compiler.

I haven't been able to install the current VTune on my Haswell Windows platform, although the beta version was OK. Advisor isn't capturing this kernel in its survey report unless I build with -align:array32byte. Then (without Qunroll) it quotes a speedup larger than the compiler prediction. With Qunroll it captures only the remainder loop (which spend almost no time, as should be the case.) Surprisingly, the alignment option ruins the performance of the MKL dgemm call in another part of the benchmark.

Alexander_K_2 · ‎09-14-2015

I have tested this now myself. I only used AVX and not AVX2. I was also using only ifort 15 not 16 like Tim did.

First, I run a test without the suggested modifications. According to the benchmark output this baseline test resulted in 1570.9681 MFlops with the long DO Span (1001). All other results are also the long span.

When I added only the nofusion directive, the first two loops are vectorized while the last one is not. The reported estimated speedup for the first two loops is 2.76. The reported MFlops are now 2049.5236 which is a speedup of about factor 1.3.

Forcing the vectorization of the last loop with vectorlength 2 resulted in an estimated speedup of 1,19. But the actual MFlops are now with 1967.9301 slightly below the nofusion variant. This is roughly the same result as Tim got.

jimdempseyatthecove · ‎09-14-2015

Thanks for running the tests. It is good to know to be aware that code that can vectorize using vector lengths shorter than vector width do not always run faster than scalar code given current state of the compiler optimizations. This might be a good test case for the Intel optimization team to look at. It is not unusual for code to contain shorter vectors than the full vector width. In this case, but in a lot of cases it will be three. Calculating the distance between two point (stored as AOS) comes to mind.

It would be interesting to revisit this on architectures that support the __mmask8/__mmask16 instruction formats as well as an immediate supplying the mask.

Jim Dempsey

TimP · ‎09-14-2015

The alignment option degrades LFK kernels 9 and 18. This is off the original topic, but it's annoying to find that an option which helps more than it should on the cited case is degrading results overall (even if it's due to code which is written so as to be unfavorable to alignment). On kernel 18, I was using omp do collapse in order to be able to benefit from a larger number of threads, but (not surprisingly) collapse degrades performance on my present dual core CPU (and the benchmark is hardly substantial enough to justify more than 2 cores). The data alignment option worsened performance regardless of threading strategy.

Kernel 9 (and 10) don't benefit significantly from vectorization because the subscripts are backwards, so I'm surprised at the sensitivity to alignment.

Kernel 2 is vectorizable, but I don't get nearly as good vector performance as with a version which is written for greater register locality. It uses only 6 xmm registers even then.

I hadn't seen much before of ifort automatically choosing AVX-128. opt-report4 notates such loops with "vector length 2." It's likely due to the small maximum trip count set internally by the compiler according to the declared array sizes. The compiler sees that the loops aren't big enough to use AVX-256. Not surprisingly, compiler predicted vector speedups are less than 2, and when Advisor reports more than that, I don't believe it. As to the cited kernel 14, the compiler hasn't continued to generate the AVX-128 version, and now I have an Advisor run capturing the AVX2-256 (where the time is spent, quoted as vector speedup of 3.4) with AVX-128 remainder

Why does the code not vectorize