Faster loops

shukur · ‎09-16-2008

Hi all,

I need little help on my fortran code.
Code itself is very simple, but loops take long time.
I'll be greatfull if anybody can give some ideas to reorganize llops to make it faster.

I use intel compiler, but coding is mostly F77-style.
Below is subroutine which is very slow. I am realy interested if it is possible to do it faster by changing order of loops or any other types of loops, capabilities of intel fortran,...

do i=1,nth1/2
th0=th(i)
i2=nth1-i+1
call do_cur_plm(th0,lmax1,dummy_plm)
do l=0,lmax1
is=1
if(real(l/2.)-int(l/2).ne.0.) is=-1
do m=0,l
m2=m+1
th1=dummy_plm(l,m)
th2=th1*is
is=-is
do min=0,59
k=min+1
z1=ft_images(k,i,m2)*th1
z2=ft_images(k,i2,m2)*th2
ss(min,l,m)=ss(min,l,m)+z1+z2
enddo
enddo
enddo
enddo

TimP · ‎09-16-2008

Nothing you show here should prevent optimization of the inner loop with normal compiler options (e.g. -xW, which is default for 64-bit ifort and for ifort 11.0). When you set -opt-report, what does the compiler say?
You might get an improvement in cache behavior if you would arrange arrays so that the next to innermost loop doesn't increment the last subscript, but you don't give enough information to show that.
Your scheme for initializing is alternately to 1 or -1 is too complicated, but you don't show enough to guess whether that is a problem. Even if you simplify it, I don't see that you could permit the compiler to optimize by swapping l and m loops.

jimdempseyatthecove · ‎09-17-2008

Shukur,

When you check the code (Dissassembly Window) is the do min=0,59 vectorized? As well as unrolled to some degree?

If both vectorization as well as some level of unrolling is not (completely) present then try helping the compiler out by removing the temporary variables.

do min=0,59
 ss(min,l,m)= ss(min,l,m) &
 & + ft_images(min+1,i,m2)*th1 &
 & + ft_images(min+1,i2,m2)*th2
enddo

Granted, the compiler optimization should be able to figure this out for you but if you use k, z1, z2 outside the loop (e.g. for later loop) then the compiler might not optimize the use of the temporaries out of the generated code

Jim Dempsey

shukur · ‎09-17-2008

Thanks Jim,

After changes you mentioned it is about 5-10% faster.

I am using
-g -w -c -O4
options to compile. Are there other keys might help?

Sorry for stupid question, I am realy beginer.

Thank you in advance.

jimdempseyatthecove · ‎09-18-2008

Shukur,

There are no stupid questions - only stupid answers...

Now for your next lesson in optimization. Experiment with the following

! insert this in your main code
! up where you declare variables

interface
subroutine do_min(ss, ft_images1, th1, ft_images2, th2)
 real :: ss(0:59), ft_images(0:59), ft_images(0:59)
 real :: th1, th2
end subroutine do_min
end interface
...
! replace do min 0,59 loop with

 call do_min(ss(0:59,l,m), ft_images(1:60,i,m2), th1, ft_images(1:60,i2,m2), th2)


...
! create new subroutine (stick at bottom of source file with main code)

subroutine do_min(ss, ft_images1, th1, ft_images2, th2)
 real :: ss(0:59), ft_images(0:59), ft_images(0:59)
 real :: th1, th2
 do min=0,59
 ss(min)= ss(min,l,m) &
 & + ft_images(min)*th1 &
 & + ft_images(min)*th2
 enddo
end subroutine do_min

Jim Dempsey