I am optimizing a Fortran code that essentially spends all its time in the 90 line self contained example found below. The critical part of the example is the innermost loop (line 57 - !The critical loop is here!) and I am struggling to obtain aligned access to my arrays from here. No matter what I do the optimization report (attached) keeps telling me the arrays are unaligned both in the case of MIC offload and Intel AVX compilation. I am using the latest version of Parallel studio on Windows with visual studio 2012 and I have tried to align the arrays both by compiler switches (/align:array64byte) and by directives (!DIR$ ATTRIBUTES ALIGN : 64 ::).
Any help in getting this code properly vectorized will be greatly appreciated!
real*8,dimension(-50:50) :: RAll
real*8,dimension(-200:100) :: LambdaAll,LambdaAll2
real*8,dimension(-200:100) :: Ri,Ei,Fi,Hi,Un,Ui
real*8,dimension(-200:100,1:20) :: RefAll,UAll,E2
real*8,dimension(1:20) :: FiltResp
!dir$ attributes offload : mic :: LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll
end module mData
!dir$ attributes offload : mic :: DoComputations
integer, intent(in) :: iNoModels
integer :: I2,I,J,IJMinCalc,IJMaxCalc,NoModels,t,k,Models
real*8 :: SMy,Rs,ki2,Exparg,Nom,Denom,NLayM,time,E
real*8 :: Sigma(30),Thick(30), Timebegin,TimeEnd,Val,Kn2
!Loop over models
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(NoModels,NLayM,Thick,Sigma,ijmincalc,ijmaxcalc)
!Loop over times - 50
do I = ijmincalc,ijmaxcalc
Val = E**(I)*0.1d0
lambdaAll2(I) = Val*Val
!Loop over frequencies - 16
! start from the lowest layer
kn2 = Smy*real(sigma(NLayM))
Un(J) = sqrt(LambdaAll2(J)+kn2)
Fi(J) = 0
do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar
rs = SMy*(sigma(I2)-sigma(I2+1))
ki2 = Smy*real(sigma(I2))
!The critical loop is here!
Ui(J) = sqrt(LambdaAll2(J)+ki2)
Hi(J) = Ui(J)+Un(J)
Ri(J) = rs/(Hi(J)*Hi(J))
exparg = -2.d0*ui(j)*Thick(I2)
Ei(J) = exp(exparg)
nom = (Ei(J)*(Ri(J)+Fi(J)))
denom = (1.d0+Ri(J)*Fi(J))
Fi(J) = nom/denom
Un(J) = Ui(J)
!$OMP END DO
!$OMP END PARALLEL
end subroutine DoComputations
end module mComputations
print *,'CPU execution:'
print *,'Xeon offload:'
!DIR$ OFFLOAD BEGIN TARGET(mic:0)
!DIR$ END OFFLOAD
end program kernelopt
If I did the arithmetic correctly, aligning array the base of array Ui to match the 256-bit or 512-bit alignment will guarantee that the critical loop (which starts operating on Ui(-55)) will *not* be aligned. For both 256-bit and 512-bit vectors, if Ui(-200) is placed on an aligned address, then Ui(-55) maps to the second 8-Byte address in the vector field.
I don't know if it is possible to request than an array be specifically mis-aligned so that the target starting point within the array is aligned.
In this case padding the array size to Ui(-207:100) will ensure that the beginning of the array and Ui(-55) have the same alignment with respect to both 256-bit and 512-bit vector addresses.
Similar considerations apply to the other arrays used in the critical loop.
Thank you very much John! Now I get it and based on your comment I am able to get aligned access using the !DIR$ VECTOR ALIGNED directive. I am also seeing a very sizable performance improvement.
As an experiment I tried changing the indexing to start from 1 instead of -55 and redeclared the array Ui(307) rather than Ui(-207:100). This modification gives another ~15% performance over padding the array down to -207. Any explanation of what is causing this?
I suppose that eliminating empty padding from your arrays may improve effectiveness of cache. You would need more detailed analysis, e.g. by counting cache and TLB hits, misses, fills and evictions in VTune. It may not be worth the effort when you can get good results by the more straightforward method.