- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have the following 1-D PIC Code from the Livermore Fortran Kernels. The following options with the compiler version 15.0.3 where used: ifort -O3 -mavx -align array64byte -ansi-alias -qopt-report=5 livermore.f
fw= 1.000d0 c 1014 DO 141 k= 1,n VX(k)= 0.0d0 XX(k)= 0.0d0 IX(k)= INT( GRD(k)) XI(k)= REAL( IX(k)) EX1(k)= EX ( IX(k)) DEX1(k)= DEX ( IX(k)) 141 CONTINUE c DO 142 k= 1,n VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k) XX(k)= XX(k) + VX(k) + FLX IR(k)= XX(k) RX(k)= XX(k) - IR(k) IR(k)= MOD2N( IR(k),2048) + 1 XX(k)= RX(k) + IR(k) 142 CONTINUE c DO 14 k= 1,n RH(IR(k) )= RH(IR(k) ) + fw - RX(k) RH(IR(k)+1)= RH(IR(k)+1) + RX(k) 14 CONTINUE c c................... IF( TEST(14) .GT. 0) GO TO 1014
The compiler merges the three inner loops into one. This is fine since they have the same length and also the operations do not conflict. However the vectorization fails because of a dependence betwwen RH in the last two lines.
LOOP BEGIN at livermore_cray.f(2207,12) remark #25045: Fused Loops: ( 2207 2216 2225 ) remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #15346: vector dependence: assumed OUTPUT dependence between rh line 2226 and rh line 2227 remark #15346: vector dependence: assumed OUTPUT dependence between rh line 2227 and rh line 2226 remark #25015: Estimate of max trip count of loop=1001 LOOP END
But the indices are definitely different because IR(k) and IR(k)+1 cannot be the same value. The only thing I can image is, that between two iterations these operations may be in conflict. But why is vectorization not possible? Shouldn't the iterations be somewhat independent even with loop unrolling?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I see it the first two loops should have been fused, but not all three. Insert a !DIR$ NOFUSION in front of the third loop.
The third loop has undeterminable indexing order. While k is determinable to be in sequence, IR(k) is not.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I see it the first two loops should have been fused, but not all three. Insert a !DIR$ NOFUSION in front of the third loop.
The third loop has undeterminable indexing order. While k is determinable to be in sequence, IR(k) is not.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, the third loop can be vectorized with a vector length of 2 whereas the first two loops may have been vectorized with a longer vector length.
!DIR$ NOFUSION
!DIR$ SIMD VECTORLENGTH(2)
do 14 k=1,n
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or:
DO 14 k= 1,n
TEMP(1) = fw - RX(k)
TEMP(2) = RX(k)
!DIR$ SIMD
RH(IR(k):IR(k)+1)= RH(IR(k):IR(k)+1) + TEMP(1:2)
14 CONTINUE
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I like Jim's first reply best. The last loop is a scatter; RH(IR(k)) and RH(IR(k)+1) may be different within a single iteration, but RH(IR(k)) from two different iterations may both point to the same memory location. The compiler cannot vectorize this simply and safely, at least for instruction sets up to and including Intel AVX2.
I don't think the compiler should be fusing this loop with the earlier ones. We'll see what the developers say. Using a NOFUSION directive is a good workaround - that's the sort of thing it was designed for.
Whilst it may be technically possible to create and vectorize an inner loop over k and k+1, I doubt that would be worthwhile. Better to keep things simple.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My version from long ago (when ifort still had SSE1 vectorization), with the newly required directive:
C***********************************************************************
C*** KERNEL 14 1-D PIC Particle In Cell
C***********************************************************************
C
C
do
! vectorizable by SSE2, not SSE1
ix(:n)= grd(:n)
C DO k= 1,n
C vx(k)= ex(ix(k))-ix(k)*dex(ix(k))
C ir(k)= vx(k)+flx
C END DO
vx(:n)= ex(ix(:n))-ix(:n)*dex(ix(:n))
ir(:n)= vx(:n)+flx
! vectorizable by SSE2, not SSE1
rx(:n)= vx(:n)+flx-ir(:n)
! vectorizable by SSE1
ir(:n)= iand(ir(:n),2047)+1
! vectorizable by SSE2, not SSE1
xx(:n)= rx(:n)+ir(:n)
!dir$ no fusion ! needed for 2015 ifort at -O3
DO k= 1,n
rh(ir(k))= rh(ir(k))+fw-rx(k)
rh(ir(k)+1)= rh(ir(k)+1)+rx(k)
END DO
IF(test(14) <= 0) EXIT
END DO
At -O3 AVX2, ifort 16.0 reports the bad fusion which ends up preventing vectorization beyond the first 2 assignments, until the NO FUSION directive is added before the non-vectorizable loop. Then the rest is fused into a single loop with expected vector speedup of 4.5.
When the NO FUSION directive was introduced several years ago, to replace a usage of the less descriptive DISTRIBUTE POINT, it wasn't well advertised. This case shows yet one more instance where the compiler no longer pays attention to the consequence of fusion and needs the directive, as a consequence of setting -O3.
I have discussed before the equivalent use of !$omp simd to place a barrier against fusion. Here it would require explicitly fusing the first 2 original old style loops into a single omp simd loop; then the compiler would not attempt to merge in the remaining loop. The advantage I see is that it accomplishes the same thing without Intel proprietary directives.
If I write
rh(ir(k):ir(k)+1)= rh(ir(k):ir(k)+1)+[fw,0.]-[1,-1.]*rx(k:k+1)
Apparently the compiler is concerned about the case where indexing may go out of bounds. An unaligned vector of 2 elements isn't worth much anyway. If I force it by simd directive, as Jim suggested, it reports expected speedup of 0.6 (a loss).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>If I force it by simd directive, as Jim suggested, it reports expected speedup of 0.6 (a loss).
What is the actual speedup?
If the underlying architecture supports unaligned load/store, then the SIMDized loop on average over random index pairs (ir(k):ir(K)+1) will have half aligned on two element boundary. (2x + 0.6x) / 2 or 1.3x speedup.
It is easy enough to run the test as opposed to speculate.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>use of !$omp simd to place a barrier against fusion... it accomplishes the same thing without Intel proprietary directives.
But then your code breaks when compiled without -Qopenmp.
The correct way is to either not fuse the loops or provide a compiler directive to subvert loop fusion.
If a non-Intel compiler does not support the !DIR$ NOFUSION then by convention the directive should be considered a comment (possibly with warning), and if further that compiler fuses the loops and breaks the code, then it is the fault of the compiler for fusing the loops.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
According to the report which the benchmark issues for the full timing loop including both the vectorized and non-vectorized parts, the vectorization of the fully vectorizable part gives me just under twice the performance of non-vector version. Forcing the additional local vector length 2 vectorization reduces performance 8% from peak. These are single precision AVX2 Windows numbers. I believe the double precision is considered more important.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the comments so far. I will test it when I am back in the office on Monday.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TimP,
Thanks for making the test run. Can you re-run using REAL(8).
The SSE instructions do not have a mask load/store so you cannot do two of the four single precision lanes. Double precision will not require a mask load/store
The AVX instruction set has masked load/store for single precision... however the mask is not specifiable using an immediate (iow it requires the use of an ymm register loaded from memory).
It is not until you get to KNC and AVX512 that you have the _mm512_kmov
Vector length of 2 for floats will be problematic (performance wise) until you have available
a) masked load/store with immediate specifying mask
b) masked gather/scatter (with immediate specifying stride 1 and mask)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the double precision case, AVX2 vectorization of the code up to !dir$ NO FUSION gives me a 60% overall speedup for the medium size case. The compiler's prediction for the vectorized portion is a speedup of 1.9. The full size case begins to be limited by memory issues on my laptop. In the code displayed by -S, there is both an AVX-256 and an AVX-128 version of the loop. -align:aray32byte suppresses the AVX-256 version, which apparently will not be used unless array alignment is less than 16-byte. Note that due to the gather operations, 32-byte array alignments don't allow for much 32-byte access.
Although the compiler's prediction of performance of the double precision length 2 inner loop AVX2-128 vectorization makes it look like a wash, the actual performance is poor. AVX-128 eliminates need for a mask, and masked stores are slow, but that doesn't appear to be a useful key to outsmart the compiler.
I haven't been able to install the current VTune on my Haswell Windows platform, although the beta version was OK. Advisor isn't capturing this kernel in its survey report unless I build with -align:array32byte. Then (without Qunroll) it quotes a speedup larger than the compiler prediction. With Qunroll it captures only the remainder loop (which spend almost no time, as should be the case.) Surprisingly, the alignment option ruins the performance of the MKL dgemm call in another part of the benchmark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have tested this now myself. I only used AVX and not AVX2. I was also using only ifort 15 not 16 like Tim did.
First, I run a test without the suggested modifications. According to the benchmark output this baseline test resulted in 1570.9681 MFlops with the long DO Span (1001). All other results are also the long span.
When I added only the nofusion directive, the first two loops are vectorized while the last one is not. The reported estimated speedup for the first two loops is 2.76. The reported MFlops are now 2049.5236 which is a speedup of about factor 1.3.
Forcing the vectorization of the last loop with vectorlength 2 resulted in an estimated speedup of 1,19. But the actual MFlops are now with 1967.9301 slightly below the nofusion variant. This is roughly the same result as Tim got.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for running the tests. It is good to know to be aware that code that can vectorize using vector lengths shorter than vector width do not always run faster than scalar code given current state of the compiler optimizations. This might be a good test case for the Intel optimization team to look at. It is not unusual for code to contain shorter vectors than the full vector width. In this case, but in a lot of cases it will be three. Calculating the distance between two point (stored as AOS) comes to mind.
It would be interesting to revisit this on architectures that support the __mmask8/__mmask16 instruction formats as well as an immediate supplying the mask.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The alignment option degrades LFK kernels 9 and 18. This is off the original topic, but it's annoying to find that an option which helps more than it should on the cited case is degrading results overall (even if it's due to code which is written so as to be unfavorable to alignment). On kernel 18, I was using omp do collapse in order to be able to benefit from a larger number of threads, but (not surprisingly) collapse degrades performance on my present dual core CPU (and the benchmark is hardly substantial enough to justify more than 2 cores). The data alignment option worsened performance regardless of threading strategy.
Kernel 9 (and 10) don't benefit significantly from vectorization because the subscripts are backwards, so I'm surprised at the sensitivity to alignment.
Kernel 2 is vectorizable, but I don't get nearly as good vector performance as with a version which is written for greater register locality. It uses only 6 xmm registers even then.
I hadn't seen much before of ifort automatically choosing AVX-128. opt-report4 notates such loops with "vector length 2." It's likely due to the small maximum trip count set internally by the compiler according to the declared array sizes. The compiler sees that the loops aren't big enough to use AVX-256. Not surprisingly, compiler predicted vector speedups are less than 2, and when Advisor reports more than that, I don't believe it. As to the cited kernel 14, the compiler hasn't continued to generate the AVX-128 version, and now I have an Advisor run capturing the AVX2-256 (where the time is spent, quoted as vector speedup of 3.4) with AVX-128 remainder

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page