A question about optimization

Paolo_P_ · ‎03-16-2017

Dear Sir,

In the latest release of "Parallel Universe" you state that the following code is vectorized efficiently thanks to avx512 instructions.

 nb = 0
 do ia=1, na ! line 23
 if (a(ia) > 0.) then
 nb = nb + 1 ! dependency
 b(nb) = a(ia) ! compress
 endif
 enddo

Using Fortran 90 intrinsics, one would equivalently write

n = count( a(1:na) > 0.)
b(1:n) = pack( a(1:na), mask=( a(1:na) > 0.) )

My question is whether your implementation of the intrinsics is always optimized better than old school (F77) equivalent code and thus is

to be preferred. Related to this, I read that icc an ifort share the same back-end, hence perform similar optimizations. But the array-based syntax of F90 provides the compiler with lots of information about data (in)dependency, which should help optimize better. Is this correct ? I am no expert, so I apologize in advance for being approximative. Thank you.

TimP · ‎03-17-2017

It goes without saying that "always" doesn't apply in this context. Yet it's an interesting example. The obstacle to optimizing combinations of intrinsics is in the need for fusion to avoid repeated memory access. In my experience, ifort has difficulty in such context, while other compilers almost never accomplish it.

If intrinsics produce partial vectorization, that may be an improvement or an indication that directives such as vector always or omp simd reduction lastprivate might be tried.

Martyn_C_Intel · ‎03-17-2017

Actually, it's the other way round: the compiler sometimes finds it easier to optimize explicit Fortran77 DO loops than the loops implied by array notation. One reason is that a Fortran 90 array descriptor includes a stride, so the compiler may allow for a general array section to have non-unit stride. Another is that the loops implied by array notation have slightly different semantics from DO loops. This may cause the compiler to generate temporary array copies to ensure that reads on the RHS are independent oh write on the LHS. Directives can be applied to guide compiler optimization of DO loops, but in many cases these cannot be applied to array notation. Finally, as Tim indicated, your code above contains two loops over A, not one. The compiler optimizer may merge simple array assignments to reduce the number of loops to improve data reuse and reduce overhead, but I'm doubtful that it could merge calls to intrinsics.

I haven't measured it, but I would expect the Fortran77 code above to be faster than your intrinsic version. Intrinsics have to be general and allow for a variety of inputs compared to explicit code. It's sometimes better to call an intrinsic with a whole array argument than with an array section, e.g. SUM(A) is more likely to get vectorized than SUM(A(1:NA)).

Intrinsics may be faster for more complex functions and operations, such as matrix multiplication, where the compiler may have an opportunity to make additional optimizations.

jimdempseyatthecove · ‎03-17-2017

You should run some tests. The first code segment (DO) has an additional requirement over the second code segment. This is the value of nb must be known(preserved).

When you run a test, try a 3rd test with

block
  integer :: nb
  nb = 0
  do ia=1, na ! line 23
    if (a(ia) > 0.) then
      nb = nb + 1 ! dependency
      b(nb) = a(ia) ! compress
    endif
  enddo
end block

In the above, nb need not be preserved (but then you won't know the extent to which b were filled, which you can get with the count).

Jim Dempsey