optimization do loop

DataScientist · ‎09-15-2014

I am wondering if in the following DO-loop, the introduction of the new INTEGER variable upper_index in place of the two identical calculations x_index(i,l) + N will have a benefit to the performance? I have already timed this loop and compared it to the original loop, and the runtime ratio indicates ~1.40 performance gain.

  DO i = 1,N
    upper_index = x_index(i,l)+N
    coef(i) = 0.5 * vz(x_index(i,l),l)
    g2(i) = 2. * ( a(upper_index,k,l) - a(x_index(i,l),k,l) ) / ( dz(x_index(i,l)) * (1.0 + az1(x_index(i,l)) ) )
    IF (s(x_index(i,l),l) <= p1 .OR. s(upper_index,l) <= p1) g2(i)=0.0
  END DO

This loop is nested in loops over K and L

However, when I tested the same idea in the following simple loop, I did not see meaningful difference in the speed of the code.

do i = 1,nloop1
  do j = 1,nloop2
    k = i+j
    variable(k) = k
  end do
end do

Which one of the above two observations I can possibly trust?

Also, I noticed that the use of IF-ELSEIF construct is noticeably less efficient than using multiple IF statements in a DO-loop. Is this observation correct? I am using Intel Fortran compiler 2015, with O2 optimization flag.

Thanks you in advance,

TimP · ‎09-15-2014

It's not unusual, in my experience, that a compiler can benefit from your help in simplifying any but the simplest loops.

Particularly in cases of auto-vectorization, replacing

if(condition)then

...

else

...

by

if(condition)..

if(.not. condition)

.....

may be helpful (assuming that you don't insist on short-cut evaluation but are willing to have multiple cases evaluated speculatively). I wouldn't normally go so far as to replace an if block by multiple ifs.

mecej4 · ‎09-15-2014

The second loop does a lot of repeated assignments to the same locations. It could be replaced by

do i=2,nloop1+nloop2
   variable(i)=i
end do

For nloop1 = nloop2 = 100000, the first version of the loop took 4.7 s of CPU time with an i7-2720. The second version took less time than the resolution of the cpu_time() function (the compiler could have even done away with the loop -- I did not check).

Perhaps you could come up with more realistic sections of an algorithm that are more worthy of optimization. Secondly, compilers are so adept at optimizing these days that one should think twice before trying to help them with loop invariants by doing manual code rewrites.

DataScientist · ‎09-15-2014

Thanks Tim & mecej4. I think I was not clear enough in my question, and so it created some confusion. The question was to compare the performance of this code:

call cpu_time(tstart)
do i = 1,nloop1
  do j = 1,nloop2
    variable(i+j) = i+j
  end do
end do
call cpu_time(tend)
write(*,*) tend-tstart

with the following,

call cpu_time(tstart)
do i = 1,nloop1
  do j = 1,nloop2
    k = i+j
    variable(k) = k
  end do
end do
call cpu_time(tend)
write(*,*) tend-tstart

The only difference between the two is the assignment k = i + j in place of i + j wherever it appears in the loop. So, even though both loops do repeated assignments to the same locations, they should only differ in performance where K is used in place of I+J. I did this simple test to see if this change can have any benefits to performance, in more complicated codes like the very first code snippet that I posted above.

IanH · ‎09-15-2014

In your latest examples, the value of `variable` is never used. If the compiler is clever enough (and often with these sorts of examples ... it is) it will recognise that and then decide that it doesn't actually need to execute the assignment to `variable`. It may then realize that the loops aren't doing anything useful, and with progressive analysis replace them with simple assignments to i, j [and k]. It may then recognize that i, j and k aren't being used, and get rid of them too. All up, your code could end up being...

call cpu_time(tstart)
call cpu_time(tend)
write (*,*) tend-tstart

which is possibly not all that interesting to test.

As stated elsewhere - different code will result in different optimization outcomes. From an outsiders point of view it is very difficult to predict what the compilers optimizer will do, beyond simple cases, and what it does may change with version and other compile options. So if a bit of code is that important that you care deeply that it be optimized most effectively, you need to measure that specific code, in a manner that tests that specific code in a realistic fashion. In your opening post you say you've done that measurement... so you know the answer to your question.

But I would be getting pretty desperate to be embarking on hand optimization like this. As a very good general rule, write the code that is easiest to write/easiest to understand/easiest to maintain, and make optimization a problem for the optimizer. Is your code easier to write/understand/maintain with the intermediate variable? That's your call.

John_Campbell · ‎09-15-2014

You should look again at mecej4's comment as his approach does reduce the number of loop trips in your second loop by an order of N.

I am suggesting a change for your first loop, only for the reason to make it a bit clearer to read. (most/all optimising compilers would do this anyway.) I also changed the IF test so that g2 is only calculated when the test is false. Without knowing the probability of the IF test being true, I don't know if this is a significant improvement or may hinder auto-vectorisation.

  DO i = 1,N
    ix      = x_index(i,l)
    coef(i) = 0.5 * vz(ix,l)
    IF (s(ix,  l) <= p1 .OR.   &
        s(ix+N,l) <= p1) then
       g2(i) = 0.0
    else
       g2(i) = 2.0 * ( a(ix+n,k,l) - a(ix,k,l) )     &
             / ( dz(ix) * (1.0 + az1(ix) ) )
    end if
  END DO

I am not sure if Tim's comment on auto-vectorisation applies, as I am not sure that multiple calculations of g2 could be vectorised, given that ix can vary for each cycle of DO I. If the g2 calculation can be vectorised and there is a low probability that the IF test is true, then any coding that assists the vectorisation would help.

John