Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Problem in aligning Fortran arrays in simple code example

PKM
Beginner
242 Views

Hi

I am optimizing a Fortran code that essentially spends all its time in the 90 line self contained example found below. The critical part of the example is the innermost loop (line 57 - !The critical loop is here!) and I am struggling to obtain aligned access to my arrays from here. No matter what I do the optimization report (attached) keeps telling me the arrays are unaligned both in the case of MIC offload and Intel AVX compilation. I am using the latest version of Parallel studio on Windows with visual studio 2012 and I have tried to align the arrays both by compiler switches (/align:array64byte) and by directives (!DIR$ ATTRIBUTES ALIGN : 64 ::).

Any help in getting this code properly vectorized will be greatly appreciated!

Regards,

C

module mData
      real*8,dimension(-50:50)          :: RAll
      real*8,dimension(-200:100)        :: LambdaAll,LambdaAll2
      real*8,dimension(-200:100)        :: Ri,Ei,Fi,Hi,Un,Ui
      real*8,dimension(-200:100,1:20)   :: RefAll,UAll,E2
      real*8,dimension(1:20)            :: FiltResp
!dir$ attributes offload : mic :: LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,Ui,RefAll,Uall,E2,Filtresp,RAll
!$OMP THREADPRIVATE(LambdaAll,LambdaAll2,Ri,Ei,Fi,Hi,Un,RefAll,Uall,E2,Filtresp,RAll)   
    end module mData
    
    module mComputations
    contains
!dir$ attributes offload : mic :: DoComputations
    subroutine DoComputations(iNoModels)
    use omp_lib
    use mData
    implicit none
    integer, intent(in) :: iNoModels
    integer :: I2,I,J,IJMinCalc,IJMaxCalc,NoModels,t,k,Models
    real*8  :: SMy,Rs,ki2,Exparg,Nom,Denom,NLayM,time,E
    real*8  :: Sigma(30),Thick(30), Timebegin,TimeEnd,Val,Kn2   
    NLayM=30
    Sigma(:)=0.1
    Thick(:)=2.5
    ijMinCalc=-55
    ijMaxCalc=16
    ! Variables
    TimeBegin=omp_get_wtime()
    !Loop over models
    NoModels=iNoModels
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(NoModels,NLayM,Thick,Sigma,ijmincalc,ijmaxcalc)
!$OMP DO
    do Models=1,NoModels
      !Loop over times - 50
      do t=1,30
        time=log(2d0)/(1e-6*10**((t-1d0)/10d0))  
        E=10**0.1
!DEC$ SIMD
        do I = ijmincalc,ijmaxcalc
          Val = E**(I)*0.1d0
          lambdaAll2(I) = Val*Val
        enddo
      !Loop over frequencies - 16
        do k=1,16
          SMy=4*3.14e-7*time  
          ! start from the lowest layer
          kn2 = Smy*real(sigma(NLayM))
!DEC$ SIMD
          do J=ijmincalc,ijmaxcalc
            Un(J) = sqrt(LambdaAll2(J)+kn2)
            Fi(J) = 0
          enddo         
          do I2=NLayM-1,1,-1 ! this loop calculates from N-1 to 1 going upwar
            rs = SMy*(sigma(I2)-sigma(I2+1))
            ki2 = Smy*real(sigma(I2))
!DEC$ SIMD
            do J=ijmincalc,ijmaxcalc
              !The critical loop is here!
              Ui(J) = sqrt(LambdaAll2(J)+ki2)
              Hi(J) = Ui(J)+Un(J)
              Ri(J) = rs/(Hi(J)*Hi(J))
              exparg = -2.d0*ui(j)*Thick(I2)
              Ei(J) = exp(exparg)
              nom = (Ei(J)*(Ri(J)+Fi(J)))
              denom = (1.d0+Ri(J)*Fi(J))
              Fi(J) = nom/denom
              Un(J) = Ui(J)       
            end do  
          end do               
        enddo
      end do
    end do
!$OMP END DO
!$OMP END PARALLEL    
    TimeEnd=omp_get_wtime()
    print *,'Models/s=',NoModels*1d0/(TimeEnd-TimeBegin)
    end subroutine DoComputations
    end module mComputations  
      
    program kernelopt
    use mComputations
    implicit none
    print *,'CPU execution:'
    call omp_set_num_threads(8)
    call DoComputations(224*8)
    print *,'Xeon offload:'
!DIR$ OFFLOAD BEGIN TARGET(mic:0)
    call omp_set_num_threads(224)
    call DoComputations(224*100)
!DIR$ END OFFLOAD
    end program kernelopt

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
242 Views

If I did the arithmetic correctly, aligning array the base of array Ui to match the 256-bit or 512-bit alignment will guarantee that the critical loop (which starts operating on Ui(-55)) will *not* be aligned.   For both 256-bit and 512-bit vectors, if Ui(-200) is placed on an aligned address, then Ui(-55) maps to the second 8-Byte address in the vector field.

I don't know if it is possible to request than an array be specifically mis-aligned so that the target starting point within the array is aligned.

In this case padding the array size to Ui(-207:100) will ensure that the beginning of the array and Ui(-55) have the same alignment with respect to both 256-bit and 512-bit vector addresses.  

Similar considerations apply to the other arrays used in the critical loop.

0 Kudos
PKM
Beginner
242 Views

Thank you very much John! Now I get it and based on your comment I am able to get aligned access using the !DIR$ VECTOR ALIGNED directive. I am also seeing a very sizable performance improvement.

As an experiment I tried changing the indexing to start from 1 instead of -55 and redeclared the array Ui(307) rather than Ui(-207:100). This modification gives another ~15% performance over padding the array down to -207. Any explanation of what is causing this?

C

 

0 Kudos
TimP
Honored Contributor III
242 Views

I suppose that eliminating empty padding from your arrays may improve effectiveness of cache.  You would need more detailed analysis, e.g. by counting cache and TLB hits, misses, fills and evictions in VTune.  It may not be worth the effort when you can get good results by the more straightforward method.

0 Kudos
PKM
Beginner
242 Views

Ok - thanks ...

0 Kudos
Reply