Inline array syntax speedup or slowdown?

umar · ‎01-17-2011

I have eliminated all the major loops in my program in favor of the Fortran 90
inline array syntax, e.g.

do is = 1, 2
pout(:,:,:,is) = pout(:,:,:,is) - dbmasq(:,:,:,2)*pswk(:,:,:,is)
end do

where pout and pswk are complex. Previously, I had loops over ix,iy,iz (there are quite a
few loops, like hundred). After doing this I actually see a slowdown. The test run I was
using went from 23 minutes to 27 minutes (repeated many time both runs with nothing
else running). I presumed that this conversion will result in some speedup since the compiler
could convert these to more efficient loops. Is my thinking wrong? Is there a better way
to optimize the above loop?

Thanks,
Sammy

TimP · ‎01-17-2011

Comparison of the vec-report outputs should help locate differences resulting from the change of syntax. In my experience, the most common performance problem would be the case where you have multiple assignments in a single DO loop which is vectorized effectively, but when you split the loop into f90 array syntax assignments, fusion (or lack of it) doesn't produce as efficient code. There are also bugs in recent compilers where DOT_PRODUCT and the like don't vectorize when they should.
In the loop you have quoted, my guess would be that you should write the inner loop as a rank 1 stride 1 assignment
pout(:,i,j,is) = pout(:,i,j,is) - dbmasq(:,i,j,2)*pswk(:,i,j,is)
in case the compiler hasn't sorted out on which subscript it can vectorize efficiently. I doubt that a conclusive answer could be given without at least the vec-report2 result.

mriedman · ‎01-17-2011

The best effect of array syntax you can possibly get is that itis as fast as an explicit loop. I have never seen it being faster but often slower.

One side issue of array syntax arises when you pass non-contiguous array chunksto subroutines. It looks good in the source code butcauses the compiler to pass the data throughhidden temporary arrays with the associated copy overhead.

So for critical code sections I tend to avoid array syntax whereas in uncritical sections it certainly improves readability.

TimP · ‎01-17-2011

Quoting mriedman

One side issue of array syntax arises when you pass non-contiguous array chunksto subroutines. It looks good in the source code butcauses the compiler to pass the data throughhidden temporary arrays with the associated copy overhead.

That's a good point. ifort uses a temporary array not only for cases of non-contiguous data, but whenever it fails to resolve issues of possible overlap between right and left hand sides, even when most other Fortran vectorizing compilers find no difficulty. Even matching arrays on both sides might be a problem with the multiple rank assignment.

Tim_Gallagher · ‎01-17-2011

For what it's worth, our lab tried to do the same thing (remove explicit loops) and found the result to be much slower, even without the temporary arrays being made. This was true on all compilers we tested it on, except the older style Cray vector machines.

Tim

umar · ‎01-17-2011

Wouldn't it make sense than that compiler writers work on "fixing" this problem. Since Fortran
now has this syntax as standard people should not be forced to write loops where they can
perfectly use this shorthand. It should at least give the same timing compared to the looped
version if the original was perfectly written, otherwise there should be a speedup.

I have used v77to90 from VAST to translate loops into inline syntax, so if this translation can
be done the reverse should be possible too. In my opinion this is a bug.

TimP · ‎01-17-2011

Many of us have spent a lot of time advocating improvements in performance of f90 array syntax. ifort has special directives which may be useful, and apply to array assignments as well as DO loops, such as
!dir$ vector aligned
(asserts the assignment begins with aligned data)
and
!dir$ distribute point
to prevent fusion of array assignments which involve incompatible alignments.
The usual problem is for the compiler to recognize opportunities for useful fusion of array assignments. You must set -O3 to tell it to attempt this, and set those distribute point directives where you want to prevent it.

umar · ‎01-18-2011

I am not an expert on alignment, but I do use -O3 -xHost when compiling. I am attaching one of my
subroutines which take the most amount of time. Where would you put those statements?
Thanks

subroutine hpsi(ncolx, ncoly, ncolz, upotq, bmassq, xkinpq, hsigmaq, cqq, dcqq, bmunuq,&
dbmuq, dbmasq, dxknpq, der1x, der1y, der1z, der2x, der2y, der2z, eshift,&
pinn, pout, pswk, itimrev)
!-----------------------------------------------------------------------
! hpsi: program to form the product of h on the
! state vector pinn and return the result in pout
!-----------------------------------------------------------------------
implicit none
integer, parameter :: wp = kind(1.0D0)
!-----------------------------------------------
! A r g u m e n t s
!-----------------------------------------------
integer , intent(in) :: ncolx
integer , intent(in) :: ncoly
integer , intent(in) :: ncolz
integer , intent(in) :: itimrev
real(wp) , intent(in) :: eshift
real(wp) , intent(in) :: upotq(ncolx,ncoly,ncolz)
real(wp) , intent(in) :: bmassq(ncolx,ncoly,ncolz)
real(wp) , intent(in) :: xkinpq(ncolx,ncoly,ncolz,3)
real(wp) , intent(in) :: hsigmaq(ncolx,ncoly,ncolz,3)
real(wp) , intent(in) :: cqq(ncolx,ncoly,ncolz,3)
real(wp) , intent(in) :: dcqq(ncolx,ncoly,ncolz,3,3)
real(wp) , intent(in) :: bmunuq(ncolx,ncoly,ncolz,3,3)
real(wp) , intent(in) :: dbmuq(ncolx,ncoly,ncolz,3)
real(wp) , intent(in) :: dbmasq(ncolx,ncoly,ncolz,3)
real(wp) , intent(in) :: dxknpq(ncolx,ncoly,ncolz)
real(wp) , intent(in) :: der1x(ncolx,ncolx)
real(wp) , intent(in) :: der1y(ncoly,ncoly)
real(wp) , intent(in) :: der1z(ncolz,ncolz)
real(wp) , intent(in) :: der2x(ncolx,ncolx)
real(wp) , intent(in) :: der2y(ncoly,ncoly)
real(wp) , intent(in) :: der2z(ncolz,ncolz)
complex(wp) , intent(inout) :: pout(ncolx,ncoly,ncolz,2)
complex(wp) :: pinn(ncolx,ncoly,ncolz,2)
complex(wp) :: pswk(ncolx,ncoly,ncolz,2)
!-----------------------------------------------
! L o c a l P a r a m e t e r s
!-----------------------------------------------
complex(wp), parameter :: eye = (0.0_wp,1.0_wp)
!-----------------------------------------------------------------------
! diagonal part of h times psi (U_q part)
!-----------------------------------------------------------------------
pout(:,:,:,1) = (upotq(:,:,:) - eshift)*pinn(:,:,:,1)
pout(:,:,:,2) = (upotq(:,:,:) - eshift)*pinn(:,:,:,2)
! diagonal part of I_q term ( -i/2[Del.I_q]psi ) (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) - eye/2.0_wp*dxknpq(:,:,:)*pinn(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - eye/2.0_wp*dxknpq(:,:,:)*pinn(:,:,:,2)
! this is the big sigma_q term (time-odd)
pout(:,:,:,1) = pout(:,:,:,1) + (hsigmaq(:,:,:,1)*pinn(:,:,:,2) - &
eye*hsigmaq(:,:,:,2)*pinn(:,:,:,2) + &
hsigmaq(:,:,:,3)*pinn(:,:,:,1))
pout(:,:,:,2) = pout(:,:,:,2) + (hsigmaq(:,:,:,1)*pinn(:,:,:,1) + &
eye*hsigmaq(:,:,:,2)*pinn(:,:,:,1) - &
hsigmaq(:,:,:,3)*pinn(:,:,:,2))
endif
! this is the -i/2[Del.B_q]psi (tensor) part of the h*psi
pout(:,:,:,1) = pout(:,:,:,1) - eye/2.0_wp*(dbmuq(:,:,:,1)*pinn(:,:,:,2) - &
eye*dbmuq(:,:,:,2)*pinn(:,:,:,2) + &
dbmuq(:,:,:,3)*pinn(:,:,:,1))
pout(:,:,:,2) = pout(:,:,:,2) - eye/2.0_wp*(dbmuq(:,:,:,1)*pinn(:,:,:,1) + &
eye*dbmuq(:,:,:,2)*pinn(:,:,:,1) - &
dbmuq(:,:,:,3)*pinn(:,:,:,2))
!-----------------------------------------------------------------------
! terms which require d/dx psi
!-----------------------------------------------------------------------
call cmulx (ncolx, ncoly, ncolz, der1x, pinn, pswk, 0)
! part of the effective mass term
pout(:,:,:,1) = pout(:,:,:,1) - dbmasq(:,:,:,1)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - dbmasq(:,:,:,1)*pswk(:,:,:,2)
! the other part of the I_q term ( -eye*I_q*Dpsi term ) (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) - eye*xkinpq(:,:,:,1)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - eye*xkinpq(:,:,:,1)*pswk(:,:,:,2)
endif
! this is the -eye*B_q.Dsigma (tensor) part of h*psi
pout(:,:,:,1) = pout(:,:,:,1) - eye*(bmunuq(:,:,:,1,1)*pswk(:,:,:,2) - &
eye*bmunuq(:,:,:,1,2)*pswk(:,:,:,2) + &
bmunuq(:,:,:,1,3)*pswk(:,:,:,1))
pout(:,:,:,2) = pout(:,:,:,2) - eye*(bmunuq(:,:,:,1,1)*pswk(:,:,:,1) + &
eye*bmunuq(:,:,:,1,2)*pswk(:,:,:,1) - &
bmunuq(:,:,:,1,3)*pswk(:,:,:,2))
! this is the [Del.sigma.C_q]Del psi part of the C_q term (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) + (dcqq(:,:,:,1,1) - eye*dcqq(:,:,:,1,2))*pswk(:,:,:,2) + &
dcqq(:,:,:,1,3)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) + (dcqq(:,:,:,1,1) + eye*dcqq(:,:,:,1,2))*pswk(:,:,:,1) - &
dcqq(:,:,:,1,3)*pswk(:,:,:,2)
endif
!-----------------------------------------------------------------------
! terms which require d/dy psi
!-----------------------------------------------------------------------
call cmuly (ncolx, ncoly, ncolz, der1y, pinn, pswk, 0)
! part of the effective mass term
pout(:,:,:,1) = pout(:,:,:,1) - dbmasq(:,:,:,2)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - dbmasq(:,:,:,2)*pswk(:,:,:,2)
! part of -eye*I_q*Dpsi term (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) - eye*xkinpq(:,:,:,2)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - eye*xkinpq(:,:,:,2)*pswk(:,:,:,2)
endif
! this is the -eye*B_q.Dsigma (tensor) part of h*psi
pout(:,:,:,1) = pout(:,:,:,1) - eye*(bmunuq(:,:,:,2,1)*pswk(:,:,:,2) - &
eye*bmunuq(:,:,:,2,2)*pswk(:,:,:,2) + &
bmunuq(:,:,:,2,3)*pswk(:,:,:,1))
pout(:,:,:,2) = pout(:,:,:,2) - eye*(bmunuq(:,:,:,2,1)*pswk(:,:,:,1) + &
eye*bmunuq(:,:,:,2,2)*pswk(:,:,:,1) - &
bmunuq(:,:,:,2,3)*pswk(:,:,:,2))
! this is the [Del.sigma.C_q]Del psi part of the C_q term (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) + (dcqq(:,:,:,2,1) - eye*dcqq(:,:,:,2,2))*pswk(:,:,:,2) + &
dcqq(:,:,:,2,3)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) + (dcqq(:,:,:,2,1) + eye*dcqq(:,:,:,2,2))*pswk(:,:,:,1) - &
dcqq(:,:,:,2,3)*pswk(:,:,:,2)
endif
!-----------------------------------------------------------------------
! terms which require d/dz psi
!-----------------------------------------------------------------------
call cmulz (ncolx, ncoly, ncolz, der1z, pinn, pswk, 0)
! part of the effective mass term
pout(:,:,:,1) = pout(:,:,:,1) - dbmasq(:,:,:,3)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - dbmasq(:,:,:,3)*pswk(:,:,:,2)
! part of the -eye*I_q*Dpsi term (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) - eye*xkinpq(:,:,:,3)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - eye*xkinpq(:,:,:,3)*pswk(:,:,:,2)
endif
! this is the -eye*B_q.Dsigma (tensor) part of h*psi
pout(:,:,:,1) = pout(:,:,:,1) - eye*(bmunuq(:,:,:,3,1)*pswk(:,:,:,2) - &
eye*bmunuq(:,:,:,3,2)*pswk(:,:,:,2) + &
bmunuq(:,:,:,3,3)*pswk(:,:,:,1))
pout(:,:,:,2) = pout(:,:,:,2) - eye*(bmunuq(:,:,:,3,1)*pswk(:,:,:,1) + &
eye*bmunuq(:,:,:,3,2)*pswk(:,:,:,1) - &
bmunuq(:,:,:,3,3)*pswk(:,:,:,2))
! this is the [Del.sigma.C_q]Del psi part of the C_q term (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) + (dcqq(:,:,:,3,1) - eye*dcqq(:,:,:,3,2))*pswk(:,:,:,2) + &
dcqq(:,:,:,3,3)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) + (dcqq(:,:,:,3,1) + eye*dcqq(:,:,:,3,2))*pswk(:,:,:,1) - &
dcqq(:,:,:,3,3)*pswk(:,:,:,2)
endif
!-----------------------------------------------------------------------
! terms which require (d/dx)**2 psi
!-----------------------------------------------------------------------
call cmulx (ncolx, ncoly, ncolz, der2x, pinn, pswk, 0)
! Effective mass, kinetic energy term
pout(:,:,:,1) = pout(:,:,:,1) - bmassq(:,:,:)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - bmassq(:,:,:)*pswk(:,:,:,2)
! This is the second part of C_q term (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) + (cqq(:,:,:,1) - eye*cqq(:,:,:,2))*pswk(:,:,:,2) + &
cqq(:,:,:,3)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) + (cqq(:,:,:,1) + eye*cqq(:,:,:,2))*pswk(:,:,:,1) - &
cqq(:,:,:,3)*pswk(:,:,:,2)
endif
!-----------------------------------------------------------------------
! terms which require (d/dy)**2 psi
!-----------------------------------------------------------------------
call cmuly (ncolx, ncoly, ncolz, der2y, pinn, pswk, 0)
! Effective mass, kinetic energy term
pout(:,:,:,1) = pout(:,:,:,1) - bmassq(:,:,:)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - bmassq(:,:,:)*pswk(:,:,:,2)
! This is the second part of C_q term (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) + (cqq(:,:,:,1) - eye*cqq(:,:,:,2))*pswk(:,:,:,2) + &
cqq(:,:,:,3)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) + (cqq(:,:,:,1) + eye*cqq(:,:,:,2))*pswk(:,:,:,1) - &
cqq(:,:,:,3)*pswk(:,:,:,2)
endif
!-----------------------------------------------------------------------
! terms which require (d/dz)**2 psi
!-----------------------------------------------------------------------
call cmulz (ncolx, ncoly, ncolz, der2z, pinn, pswk, 0)
! Effective mass, kinetic energy term
pout(:,:,:,1) = pout(:,:,:,1) - bmassq(:,:,:)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) - bmassq(:,:,:)*pswk(:,:,:,2)
! This is the second part of C_q term (time-odd)
if(itimrev == 0) then
pout(:,:,:,1) = pout(:,:,:,1) + (cqq(:,:,:,1) - eye*cqq(:,:,:,2))*pswk(:,:,:,2) + &
cqq(:,:,:,3)*pswk(:,:,:,1)
pout(:,:,:,2) = pout(:,:,:,2) + (cqq(:,:,:,1) + eye*cqq(:,:,:,2))*pswk(:,:,:,1) - &
cqq(:,:,:,3)*pswk(:,:,:,2)
endif
!
return
end subroutine hpsi

TimP · ‎01-18-2011

First, you should get the vec-report before modification.
Comparing after modification, you would see whether the directives accomplished anything.

Then, for each array assignment where you know all the arrays are 16-byte aligned, you could add a preceding
!dir$ vector aligned
It looks like this could work only if ncolx, ncoly, ncolz are all multiples of 2.

If you would replace /2.0_wp by *0.5_wp, you would avoid a dependence on the option -no-prec-div (which apparently you are using, by default). I don't know why this transformation is forbidden by the preferable option -prec-div.

umar · ‎01-20-2011

After examining the vectorization reports I have come to these conclusions:

1. Expressions involving complex arrays do not vectorize (most of this routine).
2. The simple speed gain when things were explicit loops (not array inline syntax) must
be coming from loop unrolling or things of that nature.
3. If I do the painstaking job of breaking real and imaginary parts up, things do vectorize.

I will try to do (3) and report the speedup.

TimP · ‎01-20-2011

Vectorization of complex requires, at a minimum, -msse3. That should be inherent in -xhost. I could believe that multi-rank operations with complex might not optimize.

I've never seen a difference in default unroll between array syntax and DO loops. Maybe you mean with the usage of multi-rank assignments, where I could believe your finding.
Sometimes, DO loops will take e.g. !dir$ unroll(4) to specify amount of unroll.

You would have to compare your asm code to see what the differences are, if it's not simply a question of vectorization and distribution.