- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I am optimizing a Fortran code that essentially spends all its time in the 90 line self contained example found below. The critical part of the example is the innermost loop (line 57 - !The critical loop is here!) and I am struggling to obtain aligned access to my arrays from here. No matter what I do the optimization report (attached) keeps telling me the arrays are unaligned both in the case of MIC offload and Intel AVX compilation. I am using the latest version of Parallel studio on Windows with visual studio 2012 and I have tried to align the arrays both by compiler switches (/align:array64byte) and by directives (!DIR$ ATTRIBUTES ALIGN : 64 ::).
Any help in getting this code properly vectorized will be greatly appreciated!
Regards,
C
module mData
real*8,dimension(-50:50) :: RAll
real*8,dimension(-200:100) :: LambdaAll,LambdaAll2
real*8,dimension(-200:100) :: Ri,Ei,Fi,Hi,Un,Ui
real*8,dimension(-200:100,1:20) :: RefAll,UAll,E2
real*8,dimension(1:20) :: FiltResp
!dir$ attributes offload : mic :: LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll
!$OMP THREADPRIVATE(LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,RefAll,Uall,E2,Filtresp,RAll)
end module mData
module mComputations
contains
!dir$ attributes offload : mic :: DoComputations
subroutine DoComputations(iNoModels)
use omp_lib
use mData
implicit none
integer, intent(in) :: iNoModels
integer :: I2,I,J,IJMinCalc,IJMaxCalc,NoModels,t,k,Models
real*8 :: SMy,Rs,ki2,Exparg,Nom,Denom,NLayM,time,E
real*8 :: Sigma(30),Thick(30), Timebegin,TimeEnd,Val,Kn2
NLayM=30
Sigma(:)=0.1
Thick(:)=2.5
ijMinCalc=-55
ijMaxCalc=16
! Variables
TimeBegin=omp_get_wtime()
!Loop over models
NoModels=iNoModels
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(NoModels,NLayM,Thick,Sigma,ijmincalc,ijmaxcalc)
!$OMP DO
do Models=1,NoModels
!Loop over times - 50
do t=1,30
time=log(2d0)/(1e-6*10**((t-1d0)/10d0))
E=10**0.1
!DEC$ SIMD
do I = ijmincalc,ijmaxcalc
Val = E**(I)*0.1d0
lambdaAll2(I) = Val*Val
enddo
!Loop over frequencies - 16
do k=1,16
SMy=4*3.14e-7*time
! start from the lowest layer
kn2 = Smy*real(sigma(NLayM))
!DEC$ SIMD
do J=ijmincalc,ijmaxcalc
Un(J) = sqrt(LambdaAll2(J)+kn2)
Fi(J) = 0
enddo
do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar
rs = SMy*(sigma(I2)-sigma(I2+1))
ki2 = Smy*real(sigma(I2))
!DEC$ SIMD
do J=ijmincalc,ijmaxcalc
!The critical loop is here!
Ui(J) = sqrt(LambdaAll2(J)+ki2)
Hi(J) = Ui(J)+Un(J)
Ri(J) = rs/(Hi(J)*Hi(J))
exparg = -2.d0*ui(j)*Thick(I2)
Ei(J) = exp(exparg)
nom = (Ei(J)*(Ri(J)+Fi(J)))
denom = (1.d0+Ri(J)*Fi(J))
Fi(J) = nom/denom
Un(J) = Ui(J)
end do
end do
enddo
end do
end do
!$OMP END DO
!$OMP END PARALLEL
TimeEnd=omp_get_wtime()
print *,'Models/s=',NoModels*1d0/(TimeEnd-TimeBegin)
end subroutine DoComputations
end module mComputations
program kernelopt
use mComputations
implicit none
print *,'CPU execution:'
call omp_set_num_threads(8)
call DoComputations(224*8)
print *,'Xeon offload:'
!DIR$ OFFLOAD BEGIN TARGET(mic:0)
call omp_set_num_threads(224)
call DoComputations(224*100)
!DIR$ END OFFLOAD
end program kernelopt
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I did the arithmetic correctly, aligning array the base of array Ui to match the 256-bit or 512-bit alignment will guarantee that the critical loop (which starts operating on Ui(-55)) will *not* be aligned. For both 256-bit and 512-bit vectors, if Ui(-200) is placed on an aligned address, then Ui(-55) maps to the second 8-Byte address in the vector field.
I don't know if it is possible to request than an array be specifically mis-aligned so that the target starting point within the array is aligned.
In this case padding the array size to Ui(-207:100) will ensure that the beginning of the array and Ui(-55) have the same alignment with respect to both 256-bit and 512-bit vector addresses.
Similar considerations apply to the other arrays used in the critical loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much John! Now I get it and based on your comment I am able to get aligned access using the !DIR$ VECTOR ALIGNED directive. I am also seeing a very sizable performance improvement.
As an experiment I tried changing the indexing to start from 1 instead of -55 and redeclared the array Ui(307) rather than Ui(-207:100). This modification gives another ~15% performance over padding the array down to -207. Any explanation of what is causing this?
C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose that eliminating empty padding from your arrays may improve effectiveness of cache. You would need more detailed analysis, e.g. by counting cache and TLB hits, misses, fills and evictions in VTune. It may not be worth the effort when you can get good results by the more straightforward method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok - thanks ...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page