DIV by 0.0, vectorization and optimization

jimdempseyatthecove · ‎01-23-2015

It is not unusual for physics simulation programs to use unit vectors.
Physics simulation programs tend to have large sets of data and can benefit significantly through vectorization.

In these types of programs it may be rare, but not impossible, to have a separation of particles of 0.0, thus making the statement(s) that produces the unit vector containing a divide by 0.0. In the cases where separation is 0.0, you do not want to abort the program, instead you usually want to do something else, such as produce a unit vector of 0.0, or alternately a random unit vector.

A defensive measure is to insert a test for 0.0, and when found, perform the alternate substitution.

The problem with this is the test quashes vectorization.

ab(index,1:3) = b(index,1:3) - a(index,1:3)
separation(index) = sqrt(ab(index,1)**2 + ab(index,2)**2 + ab(index,3)**2)
if(separation(index) .eq. 0.0) then
  uv(index,1:3) = 0.0
else
  uv(index,1:3) = ab(index,1:3) / separation(index)
endif

What I would like to suggest is for the compiler optimization gurus to recognize the code sequence is a divide by 0.0 defensive code sequence with a known substitution, in this case 0.0, but it could be someOtherValue(1:3).

Then, when determined safe, have the loop disable divide by zero fault an substitute

uv(index,1:3) = ab(index,1:3) / separation(index)
if(uv(index,1) .eq. Infinity) uv(index,1) = Substitutio(1)
if(uv(index,2) .eq. Infinity) uv(index,2) = Substitutio(2)
if(uv(index,3) .eq. Infinity) uv(index,3) = Substitutio(3)

While the above looks ugly in source code, it can use masked moves (amongst temporary vector registers)

The above optimization removes code branching and thus improves probability of vectorizing loops containing defensive code for divide by 0.

Jim Dempsey

TimP · ‎01-23-2015

Writing such conditionals as MERGE is more likely to promote simd masked move code generation.

jimdempseyatthecove · ‎01-23-2015

Tim,

The problem is, for the most part, a divide by zero aborts the program.

If the MERGE were to perform the test for 0.0 then take one of two code paths, then this would quash vectorization.

If the MERGE were to perform evaluation for both code paths, then use masked moves dependent on test, then it would generate a divide by 0.0 fault. Meaning you would have to explicitly toggle the fpdz flag.

Though explicitly manipulating the isn't a bad thing. Something like this:

logical :: was_divide_by_zero
...
! save /0.0 flag
call iee_get_flag(iee_divide_by_zero, was_divide_by_zero)
! set /0.0 flag to not generate fault
call iee_set_flag(iee_divide_by_zero, .false.)

  ab(1:n,1:3) = b(1:n,1:3) - a(1:n,1:3)
  separation(1:n) = sqrt(ab(1:n,1)**2 + ab(1:n,2)**2 + ab(1:n,3)**2)
  uv(1:n,1:3) = ab(1:n,1:3) / separation(1:n)
  uv(1:n,1:3) = merge(uv(1:n,1:3), 0.0, separation(1:n) == 0.0)

call iee_set_flag(iee_divide_by_zero, was_divide_by_zero)
...

Or possibly:

logical :: was_divide_by_zero
...
! save /0.0 flag
call iee_get_flag(iee_divide_by_zero, was_divide_by_zero)
! set /0.0 flag to not generate fault
call iee_set_flag(iee_divide_by_zero, .false.)

  ab(1:n,1:3) = b(1:n,1:3) - a(1:n,1:3)
  separation(1:n) = sqrt(ab(1:n,1)**2 + ab(1:n,2)**2 + ab(1:n,3)**2)
  uv(1:n,1:3) = merge(ab(1:n,1:3) / separation(1:n), 0.0, separation(1:n) == 0.0)

call iee_set_flag(iee_divide_by_zero, was_divide_by_zero)

Jim Dempsey

andrew_4619 · ‎01-23-2015

Tim Prince wrote:

Writing such conditionals as MERGE is more likely to promote simd masked move code generation.

And another thing I have learned today "MERGE", very useful thanks Tim, BTW Jim I always use the NORM2 intrinsic for vector magnitudes, perhaps more optimised, perhaps no difference but less typing and a cleaner looking code....

TimP · ‎01-23-2015

simd masked vector code isn't well suited to capture of exceptions under USE ieee_arithmetic. I would expect this to be one of those situations for which the documented requirement of setting /fp:strict applies, and that may prevent vectorization. However, I haven't done any testing of such situations.

Aggressive vectorization would produce Inf for the result of division by 0. then silently proceed with masked replacement. When you said you were looking for a way to enable masked moves, I assumed you didn't care about recording exceptions.

andrew_4619 · ‎01-23-2015

What about:

ab(index,1:3) = b(index,1:3) - a(index,1:3)
separation(index) = norm2( ab(index,1:3) )
if(separation(index) .eq. 0.0) separation(index)=1.0
uv(index,1:3) = ab(index,1:3) / separation(index)

TimP · ‎01-23-2015

Along this line,

uv(index,1:3) = ab(index,1:3) / merge(huge(separation(index)),separation(index),separation(index)==0)

appears to offer closer to the original request while avoiding the production of an Inf intermediate result.

If the objective is to extend the replacement to more cases where overflow might occur, the condition might be changed, e.g. to

abs(separation(index)) < tiny(separation(index)) but that might amount to the same thing if building with -ftz (which is no longer as likely a choice as it was before introduction of Sandy Bridge CPU).

jimdempseyatthecove · ‎01-23-2015

That would work too (I actually do this elsewhere in my code). Although in those places I use a temporary as the separation is used elsewhere:

ab(index,1:3) = b(index,1:3) - a(index,1:3)
separation(index) = norm2( ab(index,1:3) )
if(separation(index) .eq. 0.0)
  divisor = 1.0 ! works because ab(index,1:3) == 0.0
else
  divisor = separation(index)
uv(index,1:3) = ab(index,1:3) / divisor

Thanks for the reminder.

I will check out using norm2. I want to be sure that it does not use the fast sqrt instructions (with less precision).

sqrt is a vectorizable intrinsic. I am not sure if it follows /Qimf-precision:high. Can you shed some light on this Tim?

Also, is there a way to define to the compiler the precision of a specific intrinsic.

*** Hypothetical sketch code:

!DEC$ sqrt:precision=push(high) ! push current, set to high
result = sqrt(value)
!DEC$ sqrt:precision=pop() ! restore to prior precision

Jim Dempsey

Steven_L_Intel1 · ‎01-23-2015

There is not that fine-grained control of intrinsics, though I suppose you could write your own "my_sqrt" routine that was a VECTOR routine (for vectorization) and compile that with the specific option you want. I am pretty sure that vectorized sqrt is affected by /Qimf-precision.

mecej4 · ‎01-23-2015

Jim, have you considered another way of expressing the algorithm which the compiler appears to vectorize nicely? In the following example, the WHERE construct is used to exclude division by zero-valued elements of sep:

subroutine dist(a,b,ab,sep,uv,n)
integer n
real :: ab(n,3),a(n,3),b(n,3),uv(n,3),sep(n)

ab = b - a
sep = sqrt(ab(:,1)**2 + ab(:,2)**2 + ab(:,3)**2)

uv(:,1:3)=0.0
where(sep .ne. 0.0)
   uv(:,1) = uv(:,1)/sep
   uv(:,2) = uv(:,2)/sep
   uv(:,3) = uv(:,3)/sep
end where
return
end subroutine dist

I did a cursory visual inspection of the pseudo-assembler output of the compiler when run with /fpe0 /Qxhost on a Haswell i5 laptop. The only jcc instructions that I saw were for controlling loops, and the rest of the code appeared to be nicely vectorized. The readability of the Fortran is helped by the presence of the WHERE construct, as well.

If you think that setting all of uv to zero at first is wasteful, the ELSE WHERE feature could be used instead. Likewise, the ".ne. 0.0" could be replaced by ".le. TINY(0.0)".

TimP · ‎01-23-2015

Under /Qprec-sqrt- (the default), vectorized sqrt is under the control of /Qimf-precision and the like, as Steve mentioned. The default is nearly IEEE precision (except for corner cases), but if you want IEEE sqrt (for both vectorized and not), you should set /Qprec-sqrt. This option has little adverse effect on performance under 128-bit simd since Harpertown nor under AVX since Ivy Bridge. The possible reduced precision of sqrt should not show up except under vectorization, as it is considered as a "throughput" optimization (allowing the pipeline to keep running on independent calculations).

I wouldn't expect the use of norm2 intrinsic to change directly the implementation of sqrt. The odd thing about norm2 is the (optional) recommendation of the standard that it include protection against over- or under-flow, which would be expected to take significantly longer and possibly incur different roundoff behavior. The way this might be done is by intermediate promotion from single to double, and with additional conditional choices for double. Ifort uses these methods for complex abs() when /Qcomplex-limited-range (or /fp:fast=2) is not set. Unfortunately, /Qvec-report doesn't show when the careful treatment prevents full vectorization.

Apparently, the poor performance of IEEE sqrt on current MIC has led to the retention of the same defaults of /Qprec-div- and /Qprec-sqrt- which were introduced for Pentium III and 4. I haven't seen any indication whether this will change in the next MIC architecture.

jimdempseyatthecove · ‎01-23-2015

Mecej4

The problem with your suggestion is, in order to provide for divide by zero protection a branch must be made (else the /sep faults).

the IF(sep == 0.0) sep = 1 fixes the issue without a branch.

assume ab(1,1:3) is a non-null vector, and ab(2,1:3) is a null vector and the remainder are not. Assuming AVX and REAL(8) the internal small vectors are 4-wide, ab(1:4,1), ab(1:4,2) and ab(1:4,3) cannot take both paths in the same instruction sequence. What the programmer needs to do (until the compiler writers can do this) is to produce a code sequence that converts to an instruction sequence that vectorizes without branch (and without /0.0)

sep(1:4) = sqrt(ab(1:4,1)**2 +ab(1:4,2)**2 + ab(1:4,3)**2)

! now determine uv(1:4,1:3) with no branch and without /0.0
sep(1:4) = where(sep(1:4),1.0_8, sep(1:4) == 0.0_8)
! the where should vectorize the compare and then use a masked move all as vectors

uv(1:4,1) = ab(1:4,1) / sep(1:4) ! or uv(1:4,1:3) = ab(1:4,1:3) / sep(1:4) when compiler permits
uv(1:4,2) = ab(1:4,2) / sep(1:4) ! when ab(2,1:3) is a null vector
uv(1:4,3) = ab(1:4,3) / sep(1:4)

uv(1:4,1:3) = where(uv(1:4,1:3), substitute(1:4), sep(1:4) /= 0.0_8)

(you may have to do the above for each of the last dimension individually)

*** you would let the compiler choose the small vector width, the above is for illustrative purposes **

app4619 has the correct solution (concept)... produce a 0.0/1.0.

Jim Dempsey