Missed optimization opportunity ANY(array .eq. 0.0)

jimdempseyatthecove · ‎02-10-2016

It is a common occurrence to test a result array for conditions.

vector_mod = mod(vector_num, vector_i)
!dir$ if(.true.)
  if(ANY(vector_mod .eq. 0) return
!dir$ else
  do j=1,vector_length
    if(vector_mod(j) .eq. 0) return
  end do
!dir$ endif

Where the ANY intrinsic or short loop is performing a relational operation on an array with scalar.

The expanded code IVF V16.0 update 1 on Windows generates scalar code for both !dir$ expansions.

On the MIC you have available __mmask16 _mm512_mask_cmpeq_epi32_mask and related instructions that could make quick work of making this determination using vectors.

As for usefulness, it is not unusual for code to contain:

a) Defensive code to detect for potential of divide by 0.0 ANY(array .eq. 0.0)
b) Convergence code to detect for convergence ANY(array .lt. bingo) or ANY(abs(array) .lt. bingo)

As an additional request:

c) ANY(isNaN(realArray))

Where it is vectorized and does not call for_is_nan_s_ (or d).

You might want to extend this to an intrinsic isNormal (in line and vectorized).

When I write simulation code that contains convergence routines and/or may produce 3D vector lengths of 0.0, that I must insert defensive code to test for unusual (exception) conditions, and that these tests typically do not vectorize (and are not in line). Are there others here on this forum that can express annoyance with the lack of vectorization in this area (and estimate what extent this impacts your performance).

Jim Dempsey

jimdempseyatthecove · ‎02-10-2016

For example take the isNaN:

The standard single precision (32-bit) NaN would be: s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx where s is the sign (most often ignored in applications) and x is non-zero (the value zero encodes infinities). Therefore, isNaN,including infinities would be;

s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx
        bitwise OR with
1000 0000 0111 1111 1111 1111 1111 1111
       bitwise EQ  (the vcmpd)
1111 1111 1111 1111 1111 1111 1111 1111

The above is fully vectorizable.

Jim Dempsey

Masrul · ‎04-29-2016

Jim,

Though, i did not get everything you said. But i have a query, if a computationally expensive loop contains some conditional statement (i know , it might prevent optimization opportunities ), can such code take advantage(vectorization or any other forms) on KNC.

! Pseudo-code  
logical, allocatable:: check(:)
real,allocatable::x(:),y(:),z(:)
integer,parameter::natom
allocate(check(natom))
allocate(x(natom),y(natom),z(natom))
do i=1,natom-1
    do j=i+1,natom
        if(check)then
            dx=x(i)-x(j)
            dy=y(i)-y(j)
            dz=z(i)-z(j)
            !some computation will be following.............
        else 
            continue
        end if  
    end do
end do

--Masrul

TimP · ‎04-29-2016

any(array==0) may work the same as minval(abs(array))==0, and the latter has had simd instructions suitable for Fortran since SSE and SSE2. A sometimes important consideration is to convince the compiler not to generate a local array. Still the C and C++ compilers can't agree on how to do it. I myself filed a feature request years ago on some simple any() vectorization which has been implemented.

ifort generally has good flexibility about alternate vectorizable forms for the same operation, as Jim seems to want. It's reasonable to hope that whichever form appears most readable (probably not mine!) will optimize. In my experience with gfortran, more often than not, I could find only one form which could optimize fully, and I've even found that form to change in a few cases between gfortran versions. I continue to file ifort PR cases where ifort requires specific syntax to optimize when gfortran is more friendly about accepting alternates.

In the case of conditionals, the issue of "protects exception" is to be avoided as much as possible, by not performing arithmetic operations inside a conditional, with Fortran MERGE becoming an important tool. Even when VECTOR ALWAYS or omp simd are available to assure the compiler that you don't care about handling those exceptions, the generated code tends to be inefficient when you (inadvertently?) request speculative operations.

Masrul's example has enough discrepancy vs. valid syntax, and enough left unspecified, that I'm not certain what is intended. ifort performs many vector optimizations on block if ,, endif or where() even though they may depend on evaluation of both conditions or on masked move store. Either way, on the (over-simplified) face of it, the best expected gain is like 2x for 4 lanes, or only half of the expected vectorization gain for unconditional vectorizable operations. With some practice, I've been able to get Intel Parallel Advisor to display reasonable numbers about vector speedup by loop.

I'm working on an application now which has important loops containing conditionals on the value of the loop counter, so they can be vectorized better by splitting the loops to remove that conditional. Still I end up with a few ugly constructs like

#if __AVX__ || __MIC__

!$omp simd private(........)

#endif

because the private list helps the compiler eliminate some potential dependencies across loop boundaries, but the omp simd doesn't allow for the compiler to decide whether there are enough lanes to benefit from vectorization.

It may be interesting that omp simd uses private in a complementary way from omp parallel. In the latter, private is needed for correctness.

jimdempseyatthecove · ‎04-30-2016

Masrul,

In some situations where I produce unit vectors and where the possibility of where point A and point B are collocated (IOW expect divide by 0), I make a programming decision as to if I wish to produce a unit vector of {0.0, 0.0, 0.0} or a randomly pointed unit vector. This can be accomplished by (pseudo code).

lengthSquared = dx**2 + dy**2 + dz**2
if(lengthSquared .eq. 0.0) lengthSquared = 1.0 ! IOW produce {0.0 / 1.0, 0.0 / 1.0, 0.0 / 1.0}
sqrtLengthSquared = sqrt(lenghtSquared)
ux = dx / sqrtLenghtSquared
uy = dy / sqrtLengthSquared
uz = dz / sqrtLengthSquared

The above produces a null unit vector using conditional move.

Note, I omitted array indexing for simplicity of the pseudo code.

The random unit vector substitution is a little more involved but can be done using conditional move

Jim Dempsey