Optimization question: Using temporary variables instead of direct array access

zp3 · ‎06-20-2015

Hi,

lately I've seen a piece of code out of a famous scientific article, which is known for its fastness. In it the author always uses temporary variables to restrict access to array elements to a minimum. For example in a inner loop for a recurrent formula he uses:

...
real :: temp
real :: p(:)
...
do i=2,m
    temp=foo(temp)
    ... some read access to 'temp' 
        for other variables...
    p(i)=temp
end do
...

I personally would prefer:

...
real :: p(:)
...
do i=2,m
    p(i)=foo(p(i-1))
    ... some read access to 'p(i)' 
        for other variables...
end do
...

So therefor my question: Is the first version really faster then the second one?

I by myself always believed in the compilers capabilities of automated optimization of such simple stuff (regarding memory and register management), but now I'm a bit unsure since the first version is from a very established author...

Thanks for your advice!

mecej4 · ‎06-20-2015

Code fragments in articles are usually written to be simple, clear and readable, unless the article is about the execution speed of the fragments.

Not enough information is provided to answer. Why not get rid of p(:) and references to p? Or is p needed elsewhere in the program?

...
real :: temp
... ! initialize 'temp'
do i=2,m
    temp=foo(temp)
    ... some read access to 'temp' 
        for other variables...
end do
...

TimP · ‎06-20-2015

If carried to extreme, a large number of temporary variables may ruin ifort optimization, particularly in OpenMP context where they must be designated private. In simple cases, it should make no difference. So the policy of writing the code to be legible and maintainable ought to be satisfactory at least until proven otherwise.

For compilations in debug mode with default debug optimization, explicit combination of common references into temporaries could improve performance, so it is a way of reducing the impact of optimization.

zp3 · ‎06-20-2015

Thanks for your answers,

@mecej4: Sorry I thought it would be clear that p(:) is the/a resulting vector of the calculation, otherwise it clearly makes no sense allocating memory for no reason. As the problem statet is of a recursive nature one can nothing gain out of parallelization, so yes, speed is propably an important criteria.

@Tim Prince: Of course I meant the case when optimization flags are turned on in the compiler. So your opinion is the mine, making the code more complicated just out of the believe that the compiler may not optimize as desired is generally a bad thing to do. I just thought that in this special case the recursive access pattern p(i)=foo(p(i-1)) might prevent the compiler of introducing automatically the appropriate variable temp ...

TimP · ‎06-20-2015

Optimization of the cases you mention may be reported as scalar replacement in optreport.

jimdempseyatthecove · ‎06-21-2015

>>As the problem statet is of a recursive nature one can nothing gain out of parallelization...

When you look towards optimization opportunities, see if you can replace a recursive solution with an iterative solution, then see if the iterative solution can be parallelized. Or... you might be pleasantly surprised that the iterative solution is much faster than the recursive solution. Example in point:

// Serial recursive method to calculate Fibonacci series
long SerialFib( long n )
{
 if( n<2 )
  return n;
 else
  return SerialFib(n-1)+SerialFib(n-2);
}

// Serial non-recursive method to calculate Fibonacci series
// *** Note, several orders of magnitude faster than all
// *** recursive techniques.
long SerialFib2( long n )
{
 if( n<2 )
  return n;
 long fib0 = 1; // most recent
 long fib1 = 1; // 1 prior to most recent
 long fib2 = 0; // 2 prior to most recent
 long i;
 for(i=2; i<n; ++i)
 {
  fib2 = fib1;
  fib1 = fib0;
  fib0 = fib1 + fib2;
 }
 return fib0;
}

Jim Dempsey

TimP · ‎06-21-2015

Fortran compilers nowadays are good at the recursive scalar replacement option, and don't need the help of those explicit temporaries. C++ compilers (and Fortrans based more closely on them) vary in this respect, with Microsoft performing better than g++ or even Intel C++, so programmers may pick up this habit from past experience with other compilers. An intermediate possibility seen on some compilers not so long ago is optimization of recursion between automatically unrolled loop iterations, which might realize most of the potential gain.

As Jim pointed out, loop carried recursion isn't as expensive as function carried recursion, even with tail recursion optimization.

In some cases loop carried recursion may be used to deal with cyclic boundary conditions, inherently requiring scalar temporaries, where vectorization is possible by peeling. Some of those can be dealt with by the Intel proprietary !dir$ simd firstprivate, not covered by OpenMP 4.0. Nor does the attraction to Fortran 90 help with optimization, e.g.

a(:n)= (b(:n)+cshift(b(:n),1)+cshift(b(:n),2))/3.

This may be vectorized by writing with explicit peeling and no scalar temporaries, but the loop recursive version with scalar temporaries and Intel directive is faster. The pipeline style compilation for Itanium didn't require peeling.