Solved: declared type and the issue of contiguous memory - Page 3

AThar2 · ‎01-14-2019

Hello

I know this a has been covered before, but I am specifically looking for two answers to my question. First Have a look at the following code

      program test

      implicit none

            type :: data
               integer            :: ias     = -7777
               integer            :: is       = 0
               integer            :: ic       = 1
               real               :: x1(3)    = 0.
               real               :: xx(3)    = 0.
               real               :: u1(3)    = 0.    !x1
               real               :: uu(3)    = 0.    !x2
               real               :: d1       = 0.
               real               :: dd       = 0.
               real               :: m1       = 0.
               real               :: mm       = 0.
               real               :: q        = 0.
               real               :: nd       = 0.
               real               :: dtr      = -1.
               real               :: ydef     = 0.
               real               :: dydefdt  = 0.
               real               :: uflcf(3) = 0.
               real               :: elapst   = 0.
               real               :: dist     = 0.
               real               :: dt       = 1.e-3
               real               :: ta2ud     = 0.
               real               :: eddyt    = huge(1e99)
               real               :: usssa(3) = 0.
               real               :: qq2        = 0.    ! -
               real               :: nsda       = 0.    ! -
            real               :: ddtqq      = -1.   ! -
      end type data
      integer(kind=8) :: i, n
      type(data), allocatable      :: dat(:)
      real :: rmd, rand,t1,t2, omp_get_wtime
      real, allocatable::x(:,:)
      n = 50*10**6
      allocate(dat(n))
      allocate(x(3,n))
      rmd = rand(2.0)
      t1 = omp_get_wtime()
      do i = 1, n
          dat(i)%x1(1) = rmd**2 - exp(real(i))
          dat(i)%x1(2) = dat(i)%x1(1) - exp(real(i))
          dat(i)%x1(3) = dat(i)%x1(2) - exp(real(i))
       !  x(1,i) = rmd**2 - exp(real(i))
       !  x(2,i)= x(1,i)  - exp(real(i))
       !  x(3,i)=x(2,i)- exp(real(i))
      enddo


      t2 = omp_get_wtime()
      print*, t2-t1

      end program

Now the difference between doing x(1,i) = ** and dat(i)% x1(1:3) is quite a lot when comparing t2-t1, which is because of the way memory is stored and cache access.

I firstly don't understand that printed t2-t1 exclusively for the case with dat(i)%.. can be way smaller then when compared to a physical stop watch. I felt sometime that the t2-t1 was quite smaller than what it felt, so when trying to use my stop watch I observed a a difference.

Secondly, and most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ?

FortranFan · ‎01-16-2019

AT90 wrote:
.. most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? ..
.. how is this achievable is this in Fortran without being verbose.
.. isnt there a more elegant way?

@AT90,

Would you admit what's "elegant" is in the eyes of the beholder?! With situations like this, I've often suggested to readers to consider the facility introduced in standard Fortran with the 2003 revision which is parameterized derived type (PDT):

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-parameterized-derived-type-declarations#82788EA3-1C9C-4C06-AAF1-4269F412C03C

With this facility, one can adopt a design which can, with some understanding and consideration, be used with both the "Array of Structs" case that you are using now as well as with the "Struct of Arrays" case which is what will be recommended when better performance via faster processing of in-memory data is desired.

Take a look at this somewhat modified example of the code you show in the original post and please report here if this is "elegant" enough for you!

module kinds_m
   use, intrinsic :: iso_fortran_env, only : I8 => int64
   implicit none
   integer, parameter :: R4 = selected_real_kind( p=6 )
   integer, parameter :: R8 = selected_real_kind( p=12 )
end module

module cpu_m
   use kinds_m, only : I8, R8
   implicit none
contains
   subroutine cpu_t( time )
      !.. Argument list
      real(R8), intent(inout) :: time
      !.. Local variables
      integer(I8) :: tick
      integer(I8) :: rate
      call system_clock (tick, rate)
      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
      return
   end subroutine cpu_t
end module

module data_m

   use kinds_m, only : R4, R8
   use cpu_m, only : cpu_t

   implicit none
   
   type :: data_t(K,N)
      integer, kind :: K = R4
      integer, len :: N = 1
      real(kind=K) :: x1(3,N)
   end type data_t

   real(R8), protected, public :: init_time = 0.0_r8

   interface init_data
      module procedure init_data_scalar
      module procedure init_data_rank1
   end interface

contains

   subroutine init_data_rank1( dat, val )

      ! Arguments
      type(data_t(K=R4,N=*)), intent(inout) :: dat(:)
      real(R4), intent(in)                  :: val

      ! Local variables
      real(kind=R8) :: start_time
      real(kind=R8) :: end_time
      integer :: i

      if (dat(1)%N /= 1 ) return !<== this specific routine can be for N == 1 only

      call cpu_t( start_time )
      do i = 1, size(dat)
         dat(i)%x1(1,1) = val**2 - log(real(i))
         dat(i)%x1(2,1) = dat(i)%x1(1,1) - log(real(i))
         dat(i)%x1(3,1) = dat(i)%x1(2,1) - log(real(i))
      end do
      call cpu_t( end_time )

      init_time = end_time - start_time

      return

   end subroutine init_data_rank1

   subroutine init_data_scalar( dat, val )

      ! Arguments
      type(data_t(K=R4,N=*)), intent(inout) :: dat
      real(R4), intent(in)                  :: val

      ! Local variables
      real(kind=R8) :: start_time
      real(kind=R8) :: end_time
      integer :: i

      if (dat%N <= 1 ) return !<== this specific routine can be for N > 1 only

      call cpu_t( start_time )
      do i = 1, dat%N
         dat%x1(1,i) = val**2 - log(real(i))
         dat%x1(2,i) = dat%x1(1,i) - log(real(i))
         dat%x1(3,i) = dat%x1(2,i) - log(real(i))
      end do
      call cpu_t( end_time )

      init_time = end_time - start_time

      return

   end subroutine init_data_scalar

end module data_m

program test
   
   use kinds_m, only : R4, I8
   use data_m, only : data_t, init_data, init_time
   
   implicit none
   
   ! Local variables
   real(R4) :: rnd
   character(len=*), parameter :: fmtv = "(g0,t12,g0)"
   character(len=*), parameter :: fmtt = "(g0,f10.3)"
   
   call random_number( rnd )
   
   blk_AoS: block
      integer(kind=I8) :: n
      type(data_t(K=R4,N=1)), allocatable :: dat(:)
      print *, "Case: Array of Structs (AoS) "
      n = 50*10**6
      allocate( dat(n) )
      call init_data( dat, val=rnd )
      print fmtt, "Init time (seconds): ", init_time
      print *, "Output some values"
      print fmtv, "i", "x1(1,i)"
      print fmtv, 1, dat(1)%x1(1,1)
      print fmtv, n, dat(n)%x1(1,1)
   end block blk_AoS
   print *
   blk_SoA: block
      integer(kind=I8) :: n
      type(data_t(K=R4,N=:)), allocatable :: dat
      print *, "Case: Struct of Arrays (SoA) "
      n = 50*10**6
      allocate( data_t(K=R4,N=n) :: dat )
      call init_data( dat, val=rnd )
      print fmtt, "Init time (seconds): ", init_time
      print *, "Output some values"
      print fmtv, "i", "x1(1,i)"
      print fmtv, 1, dat%x1(1,1)
      print fmtv, n, dat%x1(1,n)
   end block blk_SoA
   
   stop
   
end program

Upon one execution of this example on Windows platform, the output is:

 Case: Array of Structs (AoS)
Init time (seconds):      8.538
 Output some values
i          x1(1,i)
1          .1537321E-12
50000000   -17.72753

 Case: Struct of Arrays (SoA)
Init time (seconds):      0.147
 Output some values
i          x1(1,i)
1          .1537321E-12
50000000   -17.72753
Press any key to continue . . .

As you will expect, the CPU time with the "Struct of Arrays" case is significantly faster than the "Array of Structs" case. Note the results are identical in the 2 cases to show you it's a valid comparison.

You can take the above example and extend it as appropriate with your derived type for 'data' and check whether this is something you can use in your actual code.

P.S.> By the way, what's up with your instruction "exp(real(i))" when you are using default real kind with a range of about 1E38 in Intel Fortran? What do you expect to happen 'i' becomes big with the large data sizes you are using? For illustration purposes, the log function has been used in place of the exponential one.

View solution in original post

AThar2 · ‎03-09-2019

Thanks Jim.

So are you saying that it is not a big deal whether having 64 or 32 byte aligned?

also, do you know the reason for the optimisation report saying that the x,y... are unaligned?

jimdempseyatthecove · ‎03-10-2019

I do not know why the report says that (for do concurrent). One hypothesis would be an oversight in the compiler optimization whereby it assumes if this is not aligned, then any/all of its members have unknown alignment. (as may be the case for an array with specified dimensions). Another possible hypothesis is that your do concurrent has exceeded some number of entities the optimizer keeps track of for possible alignment determination.

Have you tried the OpenMP method?

RE: 64-byte alignment vs 32-byte alignment

With "random" allocations, it would approximately result in 1/2 the 64-byte aligned allocations would lose 32 bytes unnecessarily. So for the sample code listed above you have 9 aligned allocations. Say 5 lose 32 bytes (160 bytes) out of the total allocations 9 x 50 million. Or a loss of 160/450million'th of allocated memory. Do you really need to conserve this amount of memory?.

Jim Dempsey

AThar2 · ‎03-10-2019

Hello Jim ,

Thanks for the reply. Yes I did try the OpenMP method, and I got the same message in the optimisation report. At the beginning it says cannot be parallelised due to an assumed dependency, but some lines below it says the Loop has been VECTORIZED. A bit confusing I would say. The aligned/unaligned issue remained the same for both cases

However, If i explicitly write ` !DIR$ ASSUME_ALIGNED this%x(1):32 ` that makes them aligned. But if I put the directive ` !DIR$ ATTRIBUTES ALIGN:32::x` in the declaration of the type structure as shown in quote #40 it does not help.

jimdempseyatthecove · ‎03-11-2019

Confusing, I agree.

Jim Dempsey