Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Faster loops

shukur
Beginner
708 Views

Hi all,

I need little help on my fortran code.
Code itself is very simple, but loops take long time.
I'll be greatfull if anybody can give some ideas to reorganize llops to make it faster.

I use intel compiler, but coding is mostly F77-style.
Below is subroutine which is very slow. I am realy interested if it is possible to do it faster by changing order of loops or any other types of loops, capabilities of intel fortran,...


do i=1,nth1/2
th0=th(i)
i2=nth1-i+1
call do_cur_plm(th0,lmax1,dummy_plm)
do l=0,lmax1
is=1
if(real(l/2.)-int(l/2).ne.0.) is=-1
do m=0,l
m2=m+1
th1=dummy_plm(l,m)
th2=th1*is
is=-is
do min=0,59
k=min+1
z1=ft_images(k,i,m2)*th1
z2=ft_images(k,i2,m2)*th2
ss(min,l,m)=ss(min,l,m)+z1+z2
enddo
enddo
enddo
enddo



 

0 Kudos
4 Replies
TimP
Honored Contributor III
708 Views
Nothing you show here should prevent optimization of the inner loop with normal compiler options (e.g. -xW, which is default for 64-bit ifort and for ifort 11.0). When you set -opt-report, what does the compiler say?
You might get an improvement in cache behavior if you would arrange arrays so that the next to innermost loop doesn't increment the last subscript, but you don't give enough information to show that.
Your scheme for initializing is alternately to 1 or -1 is too complicated, but you don't show enough to guess whether that is a problem. Even if you simplify it, I don't see that you could permit the compiler to optimize by swapping l and m loops.
0 Kudos
jimdempseyatthecove
Honored Contributor III
708 Views

Shukur,

When you check the code (Dissassembly Window) is the do min=0,59 vectorized? As well as unrolled to some degree?

If both vectorization as well as some level of unrolling is not (completely) present then try helping the compiler out by removing the temporary variables.

do min=0,59
ss(min,l,m)= ss(min,l,m) &
& + ft_images(min+1,i,m2)*th1 &
& + ft_images(min+1,i2,m2)*th2
enddo

Granted, the compiler optimization should be able to figure this out for you but if you use k, z1, z2 outside the loop (e.g. for later loop) then the compiler might not optimize the use of the temporaries out of the generated code

Jim Dempsey

0 Kudos
shukur
Beginner
708 Views
Thanks Jim,

After changes you mentioned it is about 5-10% faster.

I am using
-g -w -c -O4
options to compile. Are there other keys might help?

Sorry for stupid question, I am realy beginer.

Thank you in advance.


0 Kudos
jimdempseyatthecove
Honored Contributor III
708 Views

Shukur,

There are no stupid questions - only stupid answers...

Now for your next lesson in optimization. Experiment with the following

! insert this in your main code
! up where you declare variables
interface
subroutine do_min(ss, ft_images1, th1, ft_images2, th2)
real :: ss(0:59), ft_images(0:59), ft_images(0:59)
real :: th1, th2
end subroutine do_min
end interface
...
! replace do min 0,59 loop with
 call do_min(ss(0:59,l,m), ft_images(1:60,i,m2), th1, ft_images(1:60,i2,m2), th2)

...
! create new subroutine (stick at bottom of source file with main code)
subroutine do_min(ss, ft_images1, th1, ft_images2, th2)
real :: ss(0:59), ft_images(0:59), ft_images(0:59)
real :: th1, th2
do min=0,59
ss(min)= ss(min,l,m) &
& + ft_images(min)*th1 &
& + ft_images(min)*th2
enddo
end subroutine do_min

Jim Dempsey

0 Kudos
Reply