- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

inline array syntax, e.g.

do is = 1, 2

pout(:,:,:,is) = pout(:,:,:,is) - dbmasq(:,:,:,2)*pswk(:,:,:,is)

end do

where pout and pswk are complex. Previously, I had loops over ix,iy,iz (there are quite a

few loops, like hundred). After doing this I actually see a slowdown. The test run I was

using went from 23 minutes to 27 minutes (repeated many time both runs with nothing

else running). I presumed that this conversion will result in some speedup since the compiler

could convert these to more efficient loops. Is my thinking wrong? Is there a better way

to optimize the above loop?

Thanks,

Sammy

Link Copied

10 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

In the loop you have quoted, my guess would be that you should write the inner loop as a rank 1 stride 1 assignment

pout(:,i,j,is) = pout(:,i,j,is) - dbmasq(:,i,j,2)*pswk(:,i,j,is)

in case the compiler hasn't sorted out on which subscript it can vectorize efficiently. I doubt that a conclusive answer could be given without at least the vec-report2 result.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

One side issue of array syntax arises when you pass non-contiguous array chunksto subroutines. It looks good in the source code butcauses the compiler to pass the data throughhidden temporary arrays with the associated copy overhead.

So for critical code sections I tend to avoid array syntax whereas in uncritical sections it certainly improves readability.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Quoting mriedman

One side issue of array syntax arises when you pass non-contiguous array chunksto subroutines. It looks good in the source code butcauses the compiler to pass the data throughhidden temporary arrays with the associated copy overhead.

One side issue of array syntax arises when you pass non-contiguous array chunksto subroutines. It looks good in the source code butcauses the compiler to pass the data throughhidden temporary arrays with the associated copy overhead.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Tim

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

now has this syntax as standard people should not be forced to write loops where they can

perfectly use this shorthand. It should at least give the same timing compared to the looped

version if the original was perfectly written, otherwise there should be a speedup.

I have used v77to90 from VAST to translate loops into inline syntax, so if this translation can

be done the reverse should be possible too. In my opinion this is a bug.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

!dir$ vector aligned

(asserts the assignment begins with aligned data)

and

!dir$ distribute point

to prevent fusion of array assignments which involve incompatible alignments.

The usual problem is for the compiler to recognize opportunities for useful fusion of array assignments. You must set -O3 to tell it to attempt this, and set those distribute point directives where you want to prevent it.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

subroutines which take the most amount of time. Where would you put those statements?

Thanks

subroutine hpsi(ncolx, ncoly, ncolz, upotq, bmassq, xkinpq, hsigmaq, cqq, dcqq, bmunuq,&

dbmuq, dbmasq, dxknpq, der1x, der1y, der1z, der2x, der2y, der2z, eshift,&

pinn, pout, pswk, itimrev)

!-----------------------------------------------------------------------

! hpsi: program to form the product of h on the

! state vector pinn and return the result in pout

!-----------------------------------------------------------------------

implicit none

integer, parameter :: wp = kind(1.0D0)

!-----------------------------------------------

! A r g u m e n t s

!-----------------------------------------------

integer , intent(in) :: ncolx

integer , intent(in) :: ncoly

integer , intent(in) :: ncolz

integer , intent(in) :: itimrev

real(wp) , intent(in) :: eshift

real(wp) , intent(in) :: upotq(ncolx,ncoly,ncolz)

real(wp) , intent(in) :: bmassq(ncolx,ncoly,ncolz)

real(wp) , intent(in) :: xkinpq(ncolx,ncoly,ncolz,3)

real(wp) , intent(in) :: hsigmaq(ncolx,ncoly,ncolz,3)

real(wp) , intent(in) :: cqq(ncolx,ncoly,ncolz,3)

real(wp) , intent(in) :: dcqq(ncolx,ncoly,ncolz,3,3)

real(wp) , intent(in) :: bmunuq(ncolx,ncoly,ncolz,3,3)

real(wp) , intent(in) :: dbmuq(ncolx,ncoly,ncolz,3)

real(wp) , intent(in) :: dbmasq(ncolx,ncoly,ncolz,3)

real(wp) , intent(in) :: dxknpq(ncolx,ncoly,ncolz)

real(wp) , intent(in) :: der1x(ncolx,ncolx)

real(wp) , intent(in) :: der1y(ncoly,ncoly)

real(wp) , intent(in) :: der1z(ncolz,ncolz)

real(wp) , intent(in) :: der2x(ncolx,ncolx)

real(wp) , intent(in) :: der2y(ncoly,ncoly)

real(wp) , intent(in) :: der2z(ncolz,ncolz)

complex(wp) , intent(inout) :: pout(ncolx,ncoly,ncolz,2)

complex(wp) :: pinn(ncolx,ncoly,ncolz,2)

complex(wp) :: pswk(ncolx,ncoly,ncolz,2)

!-----------------------------------------------

! L o c a l P a r a m e t e r s

!-----------------------------------------------

complex(wp), parameter :: eye = (0.0_wp,1.0_wp)

!-----------------------------------------------------------------------

! diagonal part of h times psi (U_q part)

!-----------------------------------------------------------------------

pout(:,:,:,1) = (upotq(:,:,:) - eshift)*pinn(:,:,:,1)

pout(:,:,:,2) = (upotq(:,:,:) - eshift)*pinn(:,:,:,2)

! diagonal part of I_q term ( -i/2[Del.I_q]psi ) (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) - eye/2.0_wp*dxknpq(:,:,:)*pinn(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - eye/2.0_wp*dxknpq(:,:,:)*pinn(:,:,:,2)

! this is the big sigma_q term (time-odd)

pout(:,:,:,1) = pout(:,:,:,1) + (hsigmaq(:,:,:,1)*pinn(:,:,:,2) - &

eye*hsigmaq(:,:,:,2)*pinn(:,:,:,2) + &

hsigmaq(:,:,:,3)*pinn(:,:,:,1))

pout(:,:,:,2) = pout(:,:,:,2) + (hsigmaq(:,:,:,1)*pinn(:,:,:,1) + &

eye*hsigmaq(:,:,:,2)*pinn(:,:,:,1) - &

hsigmaq(:,:,:,3)*pinn(:,:,:,2))

endif

! this is the -i/2[Del.B_q]psi (tensor) part of the h*psi

pout(:,:,:,1) = pout(:,:,:,1) - eye/2.0_wp*(dbmuq(:,:,:,1)*pinn(:,:,:,2) - &

eye*dbmuq(:,:,:,2)*pinn(:,:,:,2) + &

dbmuq(:,:,:,3)*pinn(:,:,:,1))

pout(:,:,:,2) = pout(:,:,:,2) - eye/2.0_wp*(dbmuq(:,:,:,1)*pinn(:,:,:,1) + &

eye*dbmuq(:,:,:,2)*pinn(:,:,:,1) - &

dbmuq(:,:,:,3)*pinn(:,:,:,2))

!-----------------------------------------------------------------------

! terms which require d/dx psi

!-----------------------------------------------------------------------

call cmulx (ncolx, ncoly, ncolz, der1x, pinn, pswk, 0)

! part of the effective mass term

pout(:,:,:,1) = pout(:,:,:,1) - dbmasq(:,:,:,1)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - dbmasq(:,:,:,1)*pswk(:,:,:,2)

! the other part of the I_q term ( -eye*I_q*Dpsi term ) (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) - eye*xkinpq(:,:,:,1)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - eye*xkinpq(:,:,:,1)*pswk(:,:,:,2)

endif

! this is the -eye*B_q.Dsigma (tensor) part of h*psi

pout(:,:,:,1) = pout(:,:,:,1) - eye*(bmunuq(:,:,:,1,1)*pswk(:,:,:,2) - &

eye*bmunuq(:,:,:,1,2)*pswk(:,:,:,2) + &

bmunuq(:,:,:,1,3)*pswk(:,:,:,1))

pout(:,:,:,2) = pout(:,:,:,2) - eye*(bmunuq(:,:,:,1,1)*pswk(:,:,:,1) + &

eye*bmunuq(:,:,:,1,2)*pswk(:,:,:,1) - &

bmunuq(:,:,:,1,3)*pswk(:,:,:,2))

! this is the [Del.sigma.C_q]Del psi part of the C_q term (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) + (dcqq(:,:,:,1,1) - eye*dcqq(:,:,:,1,2))*pswk(:,:,:,2) + &

dcqq(:,:,:,1,3)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) + (dcqq(:,:,:,1,1) + eye*dcqq(:,:,:,1,2))*pswk(:,:,:,1) - &

dcqq(:,:,:,1,3)*pswk(:,:,:,2)

endif

!-----------------------------------------------------------------------

! terms which require d/dy psi

!-----------------------------------------------------------------------

call cmuly (ncolx, ncoly, ncolz, der1y, pinn, pswk, 0)

! part of the effective mass term

pout(:,:,:,1) = pout(:,:,:,1) - dbmasq(:,:,:,2)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - dbmasq(:,:,:,2)*pswk(:,:,:,2)

! part of -eye*I_q*Dpsi term (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) - eye*xkinpq(:,:,:,2)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - eye*xkinpq(:,:,:,2)*pswk(:,:,:,2)

endif

! this is the -eye*B_q.Dsigma (tensor) part of h*psi

pout(:,:,:,1) = pout(:,:,:,1) - eye*(bmunuq(:,:,:,2,1)*pswk(:,:,:,2) - &

eye*bmunuq(:,:,:,2,2)*pswk(:,:,:,2) + &

bmunuq(:,:,:,2,3)*pswk(:,:,:,1))

pout(:,:,:,2) = pout(:,:,:,2) - eye*(bmunuq(:,:,:,2,1)*pswk(:,:,:,1) + &

eye*bmunuq(:,:,:,2,2)*pswk(:,:,:,1) - &

bmunuq(:,:,:,2,3)*pswk(:,:,:,2))

! this is the [Del.sigma.C_q]Del psi part of the C_q term (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) + (dcqq(:,:,:,2,1) - eye*dcqq(:,:,:,2,2))*pswk(:,:,:,2) + &

dcqq(:,:,:,2,3)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) + (dcqq(:,:,:,2,1) + eye*dcqq(:,:,:,2,2))*pswk(:,:,:,1) - &

dcqq(:,:,:,2,3)*pswk(:,:,:,2)

endif

!-----------------------------------------------------------------------

! terms which require d/dz psi

!-----------------------------------------------------------------------

call cmulz (ncolx, ncoly, ncolz, der1z, pinn, pswk, 0)

! part of the effective mass term

pout(:,:,:,1) = pout(:,:,:,1) - dbmasq(:,:,:,3)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - dbmasq(:,:,:,3)*pswk(:,:,:,2)

! part of the -eye*I_q*Dpsi term (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) - eye*xkinpq(:,:,:,3)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - eye*xkinpq(:,:,:,3)*pswk(:,:,:,2)

endif

! this is the -eye*B_q.Dsigma (tensor) part of h*psi

pout(:,:,:,1) = pout(:,:,:,1) - eye*(bmunuq(:,:,:,3,1)*pswk(:,:,:,2) - &

eye*bmunuq(:,:,:,3,2)*pswk(:,:,:,2) + &

bmunuq(:,:,:,3,3)*pswk(:,:,:,1))

pout(:,:,:,2) = pout(:,:,:,2) - eye*(bmunuq(:,:,:,3,1)*pswk(:,:,:,1) + &

eye*bmunuq(:,:,:,3,2)*pswk(:,:,:,1) - &

bmunuq(:,:,:,3,3)*pswk(:,:,:,2))

! this is the [Del.sigma.C_q]Del psi part of the C_q term (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) + (dcqq(:,:,:,3,1) - eye*dcqq(:,:,:,3,2))*pswk(:,:,:,2) + &

dcqq(:,:,:,3,3)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) + (dcqq(:,:,:,3,1) + eye*dcqq(:,:,:,3,2))*pswk(:,:,:,1) - &

dcqq(:,:,:,3,3)*pswk(:,:,:,2)

endif

!-----------------------------------------------------------------------

! terms which require (d/dx)**2 psi

!-----------------------------------------------------------------------

call cmulx (ncolx, ncoly, ncolz, der2x, pinn, pswk, 0)

! Effective mass, kinetic energy term

pout(:,:,:,1) = pout(:,:,:,1) - bmassq(:,:,:)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - bmassq(:,:,:)*pswk(:,:,:,2)

! This is the second part of C_q term (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) + (cqq(:,:,:,1) - eye*cqq(:,:,:,2))*pswk(:,:,:,2) + &

cqq(:,:,:,3)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) + (cqq(:,:,:,1) + eye*cqq(:,:,:,2))*pswk(:,:,:,1) - &

cqq(:,:,:,3)*pswk(:,:,:,2)

endif

!-----------------------------------------------------------------------

! terms which require (d/dy)**2 psi

!-----------------------------------------------------------------------

call cmuly (ncolx, ncoly, ncolz, der2y, pinn, pswk, 0)

! Effective mass, kinetic energy term

pout(:,:,:,1) = pout(:,:,:,1) - bmassq(:,:,:)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - bmassq(:,:,:)*pswk(:,:,:,2)

! This is the second part of C_q term (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) + (cqq(:,:,:,1) - eye*cqq(:,:,:,2))*pswk(:,:,:,2) + &

cqq(:,:,:,3)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) + (cqq(:,:,:,1) + eye*cqq(:,:,:,2))*pswk(:,:,:,1) - &

cqq(:,:,:,3)*pswk(:,:,:,2)

endif

!-----------------------------------------------------------------------

! terms which require (d/dz)**2 psi

!-----------------------------------------------------------------------

call cmulz (ncolx, ncoly, ncolz, der2z, pinn, pswk, 0)

! Effective mass, kinetic energy term

pout(:,:,:,1) = pout(:,:,:,1) - bmassq(:,:,:)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) - bmassq(:,:,:)*pswk(:,:,:,2)

! This is the second part of C_q term (time-odd)

if(itimrev == 0) then

pout(:,:,:,1) = pout(:,:,:,1) + (cqq(:,:,:,1) - eye*cqq(:,:,:,2))*pswk(:,:,:,2) + &

cqq(:,:,:,3)*pswk(:,:,:,1)

pout(:,:,:,2) = pout(:,:,:,2) + (cqq(:,:,:,1) + eye*cqq(:,:,:,2))*pswk(:,:,:,1) - &

cqq(:,:,:,3)*pswk(:,:,:,2)

endif

!

return

end subroutine hpsi

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Comparing after modification, you would see whether the directives accomplished anything.

Then, for each array assignment where you know all the arrays are 16-byte aligned, you could add a preceding

!dir$ vector aligned

It looks like this could work only if ncolx, ncoly, ncolz are all multiples of 2.

If you would replace /2.0_wp by *0.5_wp, you would avoid a dependence on the option -no-prec-div (which apparently you are using, by default). I don't know why this transformation is forbidden by the preferable option -prec-div.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

1. Expressions involving complex arrays do not vectorize (most of this routine).

2. The simple speed gain when things were explicit loops (not array inline syntax) must

be coming from loop unrolling or things of that nature.

3. If I do the painstaking job of breaking real and imaginary parts up, things do vectorize.

I will try to do (3) and report the speedup.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I've never seen a difference in default unroll between array syntax and DO loops. Maybe you mean with the usage of multi-rank assignments, where I could believe your finding.

Sometimes, DO loops will take e.g. !dir$ unroll(4) to specify amount of unroll.

You would have to compare your asm code to see what the differences are, if it's not simply a question of vectorization and distribution.

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page