Solved: Quote:AT90 wrote: - Page 2

AThar2 · ‎01-14-2019

Hello

I know this a has been covered before, but I am specifically looking for two answers to my question. First Have a look at the following code

      program test

      implicit none

            type :: data
               integer            :: ias     = -7777
               integer            :: is       = 0
               integer            :: ic       = 1
               real               :: x1(3)    = 0.
               real               :: xx(3)    = 0.
               real               :: u1(3)    = 0.    !x1
               real               :: uu(3)    = 0.    !x2
               real               :: d1       = 0.
               real               :: dd       = 0.
               real               :: m1       = 0.
               real               :: mm       = 0.
               real               :: q        = 0.
               real               :: nd       = 0.
               real               :: dtr      = -1.
               real               :: ydef     = 0.
               real               :: dydefdt  = 0.
               real               :: uflcf(3) = 0.
               real               :: elapst   = 0.
               real               :: dist     = 0.
               real               :: dt       = 1.e-3
               real               :: ta2ud     = 0.
               real               :: eddyt    = huge(1e99)
               real               :: usssa(3) = 0.
               real               :: qq2        = 0.    ! -
               real               :: nsda       = 0.    ! -
            real               :: ddtqq      = -1.   ! -
      end type data
      integer(kind=8) :: i, n
      type(data), allocatable      :: dat(:)
      real :: rmd, rand,t1,t2, omp_get_wtime
      real, allocatable::x(:,:)
      n = 50*10**6
      allocate(dat(n))
      allocate(x(3,n))
      rmd = rand(2.0)
      t1 = omp_get_wtime()
      do i = 1, n
          dat(i)%x1(1) = rmd**2 - exp(real(i))
          dat(i)%x1(2) = dat(i)%x1(1) - exp(real(i))
          dat(i)%x1(3) = dat(i)%x1(2) - exp(real(i))
       !  x(1,i) = rmd**2 - exp(real(i))
       !  x(2,i)= x(1,i)  - exp(real(i))
       !  x(3,i)=x(2,i)- exp(real(i))
      enddo


      t2 = omp_get_wtime()
      print*, t2-t1

      end program

Now the difference between doing x(1,i) = ** and dat(i)% x1(1:3) is quite a lot when comparing t2-t1, which is because of the way memory is stored and cache access.

I firstly don't understand that printed t2-t1 exclusively for the case with dat(i)%.. can be way smaller then when compared to a physical stop watch. I felt sometime that the t2-t1 was quite smaller than what it felt, so when trying to use my stop watch I observed a a difference.

Secondly, and most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ?

FortranFan · ‎01-16-2019

AT90 wrote:
.. most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? ..
.. how is this achievable is this in Fortran without being verbose.
.. isnt there a more elegant way?

@AT90,

Would you admit what's "elegant" is in the eyes of the beholder?! With situations like this, I've often suggested to readers to consider the facility introduced in standard Fortran with the 2003 revision which is parameterized derived type (PDT):

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-parameterized-derived-type-declarations#82788EA3-1C9C-4C06-AAF1-4269F412C03C

With this facility, one can adopt a design which can, with some understanding and consideration, be used with both the "Array of Structs" case that you are using now as well as with the "Struct of Arrays" case which is what will be recommended when better performance via faster processing of in-memory data is desired.

Take a look at this somewhat modified example of the code you show in the original post and please report here if this is "elegant" enough for you!

module kinds_m
   use, intrinsic :: iso_fortran_env, only : I8 => int64
   implicit none
   integer, parameter :: R4 = selected_real_kind( p=6 )
   integer, parameter :: R8 = selected_real_kind( p=12 )
end module

module cpu_m
   use kinds_m, only : I8, R8
   implicit none
contains
   subroutine cpu_t( time )
      !.. Argument list
      real(R8), intent(inout) :: time
      !.. Local variables
      integer(I8) :: tick
      integer(I8) :: rate
      call system_clock (tick, rate)
      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
      return
   end subroutine cpu_t
end module

module data_m

   use kinds_m, only : R4, R8
   use cpu_m, only : cpu_t

   implicit none
   
   type :: data_t(K,N)
      integer, kind :: K = R4
      integer, len :: N = 1
      real(kind=K) :: x1(3,N)
   end type data_t

   real(R8), protected, public :: init_time = 0.0_r8

   interface init_data
      module procedure init_data_scalar
      module procedure init_data_rank1
   end interface

contains

   subroutine init_data_rank1( dat, val )

      ! Arguments
      type(data_t(K=R4,N=*)), intent(inout) :: dat(:)
      real(R4), intent(in)                  :: val

      ! Local variables
      real(kind=R8) :: start_time
      real(kind=R8) :: end_time
      integer :: i

      if (dat(1)%N /= 1 ) return !<== this specific routine can be for N == 1 only

      call cpu_t( start_time )
      do i = 1, size(dat)
         dat(i)%x1(1,1) = val**2 - log(real(i))
         dat(i)%x1(2,1) = dat(i)%x1(1,1) - log(real(i))
         dat(i)%x1(3,1) = dat(i)%x1(2,1) - log(real(i))
      end do
      call cpu_t( end_time )

      init_time = end_time - start_time

      return

   end subroutine init_data_rank1

   subroutine init_data_scalar( dat, val )

      ! Arguments
      type(data_t(K=R4,N=*)), intent(inout) :: dat
      real(R4), intent(in)                  :: val

      ! Local variables
      real(kind=R8) :: start_time
      real(kind=R8) :: end_time
      integer :: i

      if (dat%N <= 1 ) return !<== this specific routine can be for N > 1 only

      call cpu_t( start_time )
      do i = 1, dat%N
         dat%x1(1,i) = val**2 - log(real(i))
         dat%x1(2,i) = dat%x1(1,i) - log(real(i))
         dat%x1(3,i) = dat%x1(2,i) - log(real(i))
      end do
      call cpu_t( end_time )

      init_time = end_time - start_time

      return

   end subroutine init_data_scalar

end module data_m

program test
   
   use kinds_m, only : R4, I8
   use data_m, only : data_t, init_data, init_time
   
   implicit none
   
   ! Local variables
   real(R4) :: rnd
   character(len=*), parameter :: fmtv = "(g0,t12,g0)"
   character(len=*), parameter :: fmtt = "(g0,f10.3)"
   
   call random_number( rnd )
   
   blk_AoS: block
      integer(kind=I8) :: n
      type(data_t(K=R4,N=1)), allocatable :: dat(:)
      print *, "Case: Array of Structs (AoS) "
      n = 50*10**6
      allocate( dat(n) )
      call init_data( dat, val=rnd )
      print fmtt, "Init time (seconds): ", init_time
      print *, "Output some values"
      print fmtv, "i", "x1(1,i)"
      print fmtv, 1, dat(1)%x1(1,1)
      print fmtv, n, dat(n)%x1(1,1)
   end block blk_AoS
   print *
   blk_SoA: block
      integer(kind=I8) :: n
      type(data_t(K=R4,N=:)), allocatable :: dat
      print *, "Case: Struct of Arrays (SoA) "
      n = 50*10**6
      allocate( data_t(K=R4,N=n) :: dat )
      call init_data( dat, val=rnd )
      print fmtt, "Init time (seconds): ", init_time
      print *, "Output some values"
      print fmtv, "i", "x1(1,i)"
      print fmtv, 1, dat%x1(1,1)
      print fmtv, n, dat%x1(1,n)
   end block blk_SoA
   
   stop
   
end program

Upon one execution of this example on Windows platform, the output is:

 Case: Array of Structs (AoS)
Init time (seconds):      8.538
 Output some values
i          x1(1,i)
1          .1537321E-12
50000000   -17.72753

 Case: Struct of Arrays (SoA)
Init time (seconds):      0.147
 Output some values
i          x1(1,i)
1          .1537321E-12
50000000   -17.72753
Press any key to continue . . .

As you will expect, the CPU time with the "Struct of Arrays" case is significantly faster than the "Array of Structs" case. Note the results are identical in the 2 cases to show you it's a valid comparison.

You can take the above example and extend it as appropriate with your derived type for 'data' and check whether this is something you can use in your actual code.

P.S.> By the way, what's up with your instruction "exp(real(i))" when you are using default real kind with a range of about 1E38 in Intel Fortran? What do you expect to happen 'i' becomes big with the large data sizes you are using? For illustration purposes, the log function has been used in place of the exponential one.

View solution in original post

jimdempseyatthecove · ‎02-16-2019

>>I really just want to understand why the compiler cannot vectorize the middle loop of the latest I have shown
>>how is that related to my question?

Understanding what vectorization by the CPU entails, may help you understand how to write code that is favorable to vectorization.

>> and why the compiler says that the stride is unknown

It is not clear to me why the compiler could not discern that tt%...(i) references are contiguous. They clearly are (to us readers). This may be a case that during optimization, that the number of registers required exceeded the number available, and that the reason for failure to vectorize was misattributed to unknown stride as opposed to other reason (register pressure). Breaking the one loop into three loops (or one per array) will shed light on the issue.

Jim Dempsey

AThar2 · ‎02-16-2019

But should the logic not be that if the loop over tt2(i)% ... can be vectorized (according to the report) then why cannot the tt%, both have same sizes, they just happen to be stored in different data structures, right?

AThar2 · ‎02-16-2019

I must admit that I have got the impression that the implementation of parametrized declared type is a bit dodgy. Can you recognize this by your experience?

I actually spent an appreciable amount of time to move from declared type to parametrized declared types to increase efficiency, and I am getting a bit frustrated because it seems that I just should have stayed where I was.

Juergen_R_R · ‎02-16-2019

I am a bit surprised by this last statement: I would have expected that going from declared types to parameterized types might make code a lot simpler and better structured, but I am not sure that I'd expected it to become more efficient.

AThar2 · ‎02-16-2019

This is what I thought. Going to parametrized type would essentially make each member contiguous in memory while the declared type would not with respect to a member of tht type.

I basically learnt that from the initial answers in the thread

Steve_Lionel · ‎02-16-2019

The issue I see with the parameterized type in post #13 is that the dimensions of the arrays are all deferred, so the sizes of the individual components are not known until runtime. This makes vectorization much more difficult. It isn't really a matter of contiguity.

FortranFan · ‎02-16-2019

@AT90,

Try the following code with the shown compile and link instructions and note down the CPU time with the PDT approach averaged over a reasonable number of executions. Then report here if you are able to do better than that CPU time with any other approach.

What you should take away from the code if you need to understand all the computations thoroughly and structure your code accordingly and lend as much hand to the compiler while improving the code readability; don't expect the compiler to do all the optimizations for you. And if your code is structured well, you can consume PDTs as well OR BETTER than any other approach including say one where data are structured in ordinary array containers or with ordinary derived types, etc. But the onus is mainly on the coder.

module kinds_m
   use, intrinsic :: iso_fortran_env, only : I8 => int64
   implicit none
   integer, parameter :: R4 = selected_real_kind( p=6 )
   integer, parameter :: R8 = selected_real_kind( p=12 )
end module

module cpu_m
   use kinds_m, only : I8, R8
   implicit none
contains
   subroutine cpu_t( time )
      !.. Argument list
      real(R8), intent(inout) :: time
      !.. Local variables
      integer(I8) :: tick
      integer(I8) :: rate
      call system_clock (tick, rate)
      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
      return
   end subroutine cpu_t
end module

module aux_m

   use kinds_m, only : R4, R8
   use cpu_m, only : cpu_t

   implicit none

   interface calc_aux
      module procedure calc_aux_R4
      module procedure calc_aux_R8
   end interface

   real(R8), protected, public :: calc_time = 0.0_r8

contains

   subroutine calc_aux_R8( x, y, z, xx, yy, zz, xxx, yyy, zzz, aux )

      ! Argument list
      real(kind=R8), contiguous, intent(inout) :: x(:)
      real(kind=R8), contiguous, intent(inout) :: y(:)
      real(kind=R8), contiguous, intent(inout) :: z(:)
      real(kind=R8), contiguous, intent(inout) :: xx(:)
      real(kind=R8), contiguous, intent(inout) :: yy(:)
      real(kind=R8), contiguous, intent(inout) :: zz(:)
      real(kind=R8), contiguous, intent(inout) :: xxx(:)
      real(kind=R8), contiguous, intent(inout) :: yyy(:)
      real(kind=R8), contiguous, intent(inout) :: zzz(:)
      real(kind=R8), intent(out)               :: aux

      include 'calc_aux.f90'

      return

   end subroutine

   subroutine calc_aux_R4( x, y, z, xx, yy, zz, xxx, yyy, zzz, aux )

      ! Argument list
      real(kind=R4), contiguous, intent(inout) :: x(:)
      real(kind=R4), contiguous, intent(inout) :: y(:)
      real(kind=R4), contiguous, intent(inout) :: z(:)
      real(kind=R4), contiguous, intent(inout) :: xx(:)
      real(kind=R4), contiguous, intent(inout) :: yy(:)
      real(kind=R4), contiguous, intent(inout) :: zz(:)
      real(kind=R4), contiguous, intent(inout) :: xxx(:)
      real(kind=R4), contiguous, intent(inout) :: yyy(:)
      real(kind=R4), contiguous, intent(inout) :: zzz(:)
      real(kind=R4), intent(out)               :: aux

      include 'calc_aux.f90'

      return

   end subroutine

end module aux_m

module pdt_m

   use kinds_m, only : R4, R8
   use cpu_m, only : cpu_t
   use aux_m, only : caux => calc_aux

   implicit none

   type :: pdt_t(K,N)
      integer, kind :: K = R4
      integer, len :: N = 1
      real(kind=K) :: x(N)
      real(kind=K) :: y(N)
      real(kind=K) :: z(N)
      real(kind=K) :: xx(N)
      real(kind=K) :: yy(N)
      real(kind=K) :: zz(N)
      real(kind=K) :: xxx(N)
      real(kind=K) :: yyy(N)
      real(kind=K) :: zzz(N)
   contains
      private
      procedure, pass(this) :: calc_aux_R4
      procedure, pass(this) :: calc_aux_R8
      generic, public :: calc_aux => calc_aux_R4, calc_aux_R8
   end type pdt_t

   real(R8), protected, public :: calc_time = 0.0_r8

contains

   subroutine calc_aux_R4( this, aux )

      ! Argument list
      class(pdt_t(K=R4,N=*)), intent(inout) :: this
      real(kind=R4), intent(out)            :: aux

      ! Local variables
      real(R8) :: t1, t2

      call cpu_t( time=t1 )
      aux = 0.0_R4
      call caux( this%x, this%y, this%z, this%xx, this%yy, this%zz, this%xxx, this%yyy, this%zzz, aux )
      call cpu_t( time=t2 )

      return

   end subroutine

   subroutine calc_aux_R8( this, aux )

      ! Argument list
      class(pdt_t(K=R8,N=*)), intent(inout) :: this
      real(kind=R8), intent(out)            :: aux

      ! Local variables
      real(R8) :: t1, t2

      call cpu_t( time=t1 )
      aux = 0.0_R8
      call caux( this%x, this%y, this%z, this%xx, this%yy, this%zz, this%xxx, this%yyy, this%zzz, aux )
      call cpu_t( time=t2 )

      calc_time = t2 - t1

      return

   end subroutine

end module pdt_m

module dt_m

   use kinds_m, only : R8
   use cpu_m, only : cpu_t

   implicit none

   type :: dt_t
      real(kind=R8) :: x
      real(kind=R8) :: y
      real(kind=R8) :: z
      real(kind=R8) :: xx
      real(kind=R8) :: yy
      real(kind=R8) :: zz
      real(kind=R8) :: xxx
      real(kind=R8) :: yyy
      real(kind=R8) :: zzz
   end type dt_t

   real(R8), protected, public :: calc_time = 0.0_r8

contains

   subroutine calc_aux( this, aux )

      ! Argument list
      type(dt_t), intent(inout)  :: this(:)
      real(kind=R8), intent(out) :: aux

      ! Local variables
      integer :: i
      real(R8) :: t1, t2

      call cpu_t( time=t1 )
      aux = 0.0_R8
      do i = 1, size(this)

         this(i)%x = 56*real(i)-1000
         aux = aux  + this(i)%x

         this(i)%y = 56*real(i)-1000*log(real(i))
         aux = aux  - this(i)%y

         this(i)%z = 56*real(i)-1000*log(real(i+1))
         aux = aux  + this(i)%z

         this(i)%xx = 56*real(i)-1000*log(real(i+10))
         aux = aux  - this(i)%xx

         this(i)%yy = 56*real(i)-1000*log(real(i+3))
         aux = aux  + this(i)%yy

         this(i)%zz = 56*real(i)-1000*log(real(i+2))
         aux = aux  - this(i)%zz

         this(i)%xxx = 56*real(i)-1000*log(real(i+5))
         aux = aux  + this(i)%xxx

         this(i)%yyy = 56*real(i)-1000*log(real(i+4))
         aux = aux  - this(i)%yyy

         this(i)%zzz = 56*real(i)-1000*log(real(i+9))
         aux = aux  - this(i)%zzz

      end do
      call cpu_t( time=t2 )

      calc_time = t2 - t1

      return

   end subroutine

end module dt_m

program p

   !dir$ if defined (rbytes)
   !dir$ else
      !dir$ define rbytes = 64
   !dir$ end if
   !dir$ if (rbytes == 32)
   use kinds_m, only : WP => R4
   !dir$ else
   use kinds_m, only : WP => R8
   !dir$ end if

   implicit none

   integer, parameter :: N = 50000000

   blk1: block
      use pdt_m, only : pdt_t, calc_time
      type(pdt_t(K=WP,N=:)), allocatable :: pdt
      real(WP) :: aux
      allocate( pdt_t(K=WP,N=N) :: pdt )
      call pdt%calc_aux( aux )
      print *, "Block 1: PDT"
      print *, "aux = ", aux
      print "(g0,g10.3,g0)", "Calc time = ", calc_time, " seconds"
      print *
   end block blk1

   blk2: block
      use dt_m, only : dt_t, calc_aux, calc_time
      type(dt_t), allocatable :: dt(:)
      real(WP) :: aux
      allocate( dt(N) )
      call calc_aux( dt, aux )
      print *, "Block 2: Derived Type"
      print *, "aux = ", aux
      print "(g0,g10.3,g0)", "Calc time = ", calc_time, " seconds"
      print *
   end block blk2

   blk3: block
      use aux_m, only : calc_aux, calc_time
      real(WP), allocatable :: x(:), y(:), z(:), xx(:), yy(:), zz(:), xxx(:), yyy(:), zzz(:)
      real(WP) :: aux
      allocate( x(N), y(N), z(N), xx(N), yy(N), zz(N), xxx(N), yyy(N), zzz(N) )
      call calc_aux( x, y, z, xx, yy, zz, xxx, yyy, zzz, aux )
      print *, "Block 3: Arrays"
      print *, "aux = ", aux
      print "(g0,g10.3,g0)", "Calc time = ", calc_time, " seconds"
   end block blk3

   stop

end program

Include file 'calc_aux.f90':

      ! Local variables
      integer :: i
      real(R8) :: t1, t2

      call cpu_t( time=t1 )
      do concurrent ( i = 1:size(x) )
         x(i) = 56*real(i)-1000
         y(i) = 56*real(i)-1000*log(real(i))
         z(i) = 56*real(i)-1000*log(real(i+1))
         xx(i) = 56*real(i)-1000*log(real(i+10))
         yy(i) = 56*real(i)-1000*log(real(i+3))
         zz(i) = 56*real(i)-1000*log(real(i+2))
         xxx(i) = 56*real(i)-1000*log(real(i+5))
         yyy(i) = 56*real(i)-1000*log(real(i+4))
         zzz(i) = 56*real(i)-1000*log(real(i+9))
      end do      
      aux = sum(x) - sum(y) + sum(z) - sum(xx) + sum(yy) - sum(zz) + sum(xxx) - sum(yyy) - sum(zzz)
      call cpu_t( time=t2 )

      calc_time = t2 - t1

Compilation and Linking:

C:\temp>ifort /standard-semantics /warn:all /stand /O3 /Qparallel /Qdiag-disable:7025 p.f90
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 
19.0.1.144 Build 20181018
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.16.27026.1
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\temp>

Execution result:

C:\temp>p.exe
 Block 1: PDT
 aux =  -6.999837968078546E+016
Calc time =  0.499     seconds

 Block 2: Derived Type
 aux =  -6.999837968078546E+016
Calc time =  0.852     seconds

 Block 3: Arrays
 aux =  -6.999837968078546E+016
Calc time =  0.499     seconds

AThar2 · ‎02-16-2019

@FortranFan Thanks so much. I will need a bit of time to study the code and think how to adopt this to a much larger and complex code. But that is definately a way to start. I have one question on this, would you reckon variations in the performance if say I did not use the memembers in consecutive order. E.g. Let say I dont necessairly begin x,y,z,xx etc. I gues that would not matter with your example now due to the way you right out the calc_aux routine which takes arbitrar arrays.

@Steve Lion,

thanks for the reply. This is a new info. I did not know that these are deferred. I thought they just worked like allocatables. So I guess based on your comments, they dont work like allocatables. So can I roughly speaking say that if the parametrized are not used a bit more carefully I would end up make the program slowet rather than faster compared to a regular declared type? And here I mean being a bit more explicit to the compiler that these arrays are contiguous and vectorizable. ? I would really like to hear your say on this

Steve_Lionel · ‎02-16-2019

Your tt and tt2 are VERY different. tt2 is an array of items of type test2. tt is a scalar of type test1 that contains multiple arrays of run-time determined size. The memory layout will be very different, and the compiler has to look at the size of each component array at run-time.

AThar2 · ‎02-17-2019

@Steve Lionel

How is that different from allocatable arrays? Wouldn’t compiler also check its array size. Yes, these types are guaranteed to be contiguous. But so should be the items of test1 right?

Also, if that is not correct, can you then explain me the purpose of having this feature. I clearly thought it to be a way to structure your code but also overcome the issue you have with e cases like test2 where each member of the type has non-unite stride memory so that they are not cache friendly, while the parametrized case is friendly in that regard.

AThar2 · ‎02-17-2019

@FortranFan, I have tested the code and I indeed get similar results as you on my Mac with Core i5@2.6GHz. However, (and this is my motivation for opening this thread again actually), when running it on my Linux station with Intel Xeon@3.6GHz. I get all three cases to be running at same speed.
The latest code I showed where I compared the three different cases I did, despite the issue with vectorization, get the pdt to run much faster than declared type on my mac. But, on the Linux it was the reverse scenario; I got the pdt to run MUCH slower. Your code just made it run as fast as the declared type on the Linux.
PS: Don't know if this is going to help; but the third block/loop with no types ran with my and your code at same speed as declared type on the Linux, but on Mac both codes showed that the declared type was much slower than the other two.
This is a bit discouraging I must admit, since I ultimately would be running the real cases on big linux machines.

@jim mentioned the registers which could be the cause, how can I compare the registers on both machines. I would think the Linux has a bigger one, just because the processor is stronger, larger cache size etc, and if my Mac's register is not exhausted how could that be on the Linux. I could also be completely wrong on the assumption that the register on my Linux is larger.

FortranFan · ‎02-17-2019

AT90 wrote:
@FortranFan Thanks so much. I will need a bit of time to study the code and think how to adopt this to a much larger and complex code. But that is definately a way to start. I have one question on this, would you reckon variations in the performance if say I did not use the memembers in consecutive order. E.g. Let say I dont necessairly begin x,y,z,xx etc. I gues that would not matter with your example now due to the way you right out the calc_aux routine which takes arbitrar arrays.
..

@AT90,

Once you take a close look at the lengthy code I posted in Quote #28, you will realize most of it is just a certain coding style/practice with "modern Fortran" (see these references) based rearrangement of what you posted in Quote #13. Basically it's something I find easier to read and follow and hope other readers will too.

The main aspect you would have realized though is the focus on the calculations, the 'dummy' ones I presume you utilized in the 3 DO loops with '56*real(i)-1000*log..', etc. in 9 arrays. Given how you presented them, these calculations are effectively independent of array element order and follow the basic scatter/gather pattern you will come across in any introductory text to parallel programming. In your first DO loop with array object 'tt2' of type 'test2' i.e., a derived type with scalar components, you (are required to) follow the 'classic' sequential programming paradigm and which works ok and compilers such as Intel only have several decades of experience starting with Fortran 90 with optimizing such loops and which you notice in your results. Note with this approach, the aspect of your interest with CONTIGUOUS data does not come into play. However with the other 2 loops, you still follow, needlessly in my opinion, the above same sequential approach which is really not taking advantage in any way of the CONTIGUOUS data arrangement relative to the calculations you are attempting. With the 3rd DO loop though involving ordinary arrays, again a compiler such as Intel has more than several decades of experience with optimization. However things are not so straightfoward with the 2nd loop, whether you use PDTs or derived types with ALLOCATABLE components, both presenting CONTIGUOUS data arrangement. You will find the compiler(s) need extra help in terms of how you write the code, especially if you want them to give you further benefits of optimizing code involving such data structures. And here, it's best if you investigate further by yourself but the chances are you will find both PDTs and derived types with ALLOCATABLE components suffer about the same if you write your code in the way you did in Quote #13. Or that they both work well if you refactor your code taking into account the specific characteristics of your computations, as shown in the include file in Quote #28 as part of the calc_aux subroutine.

My view then with PDTs vs derived types with ALLOCATABLE components is the choice really boils down to 1) quality of implementation of PDTs (e.g., PDTs run into too many bugs in gfortran) and 2) writeability and readability of coding certain data structures; PDTs can be great in this aspect as shown with flexible SoA and AoS pattern in Quote #9.

Nonetheless you need to study your options well and analyze your code and computing requirements thoroughly before investing too much time in an option all too quickly only to realize later that it was inadequate for your needs.

FortranFan · ‎02-17-2019

AT90 wrote:
@FortranFan, I have tested the code and I indeed get similar results as you on my Mac with Core i5@2.6GHz. However, (and this is my motivation for opening this thread again actually), when running it on my Linux station with Intel Xeon@3.6GHz. I get all three cases to be running at same speed.
The latest code I showed where I compared the three different cases I did, despite the issue with vectorization, get the pdt to run much faster than declared type on my mac. But, on the Linux it was the reverse scenario; I got the pdt to run MUCH slower. Your code just made it run as fast as the declared type on the Linux.
PS: Don't know if this is going to help; but the third block/loop with no types ran with my and your code at same speed as declared type on the Linux, but on Mac both codes showed that the declared type was much slower than the other two.
This is a bit discouraging I must admit, since I ultimately would be running the real cases on big linux machines.

@jim mentioned the registers which could be the cause, how can I compare the registers on both machines. I would think the Linux has a bigger one, just because the processor is stronger, larger cache size etc, and if my Mac's register is not exhausted how could that be on the Linux. I could also be completely wrong on the assumption that the register on my Linux is larger.

You may want to request support toward your Linux issues with Intel at their Online Service Center: https://supporttickets.intel.com/servicecenter?lang=en-US

jimdempseyatthecove · ‎02-17-2019

AT90,

Comments on FortranFan's excellent post #28:

blk1, using pdt_t produces a single pdt with contiguous arrays x(N), ... zzz(N)
Thus X(1), X(2), X(3), ... are contiguous and favor vectorization
call pdt%calc_aux calls caux with the individual arrays within this%
Thus, with exception to the overhead of this call, executes the same code as does blk3 (direct arrays)

blk2, using dt_t, a (contiguous) array of dt_t's, with discontiguous dt(1)%x, dt(2)%x, ...
Thus dt(1)%x, dt(2)%x, ... is incompatible with vectorization (other than with scatter/gather)
Note, vectorization of a member (e.g. X, or Y, ...) with scatter/gather can have degraded performance to that of scalar code.
A redeaming value though is this(i)%x is contiguous with this(i)%y, is condiguouse with this%(i)...
thus presenting a high probability of L1 cache line hit
The specific computation chosen by the test code could (for this test program) could potentially
be optimized by the compiler such that the (for AVX/AVX2), using R8 types, with vector width of 4 doubles,
vector containing x, y, z, xx
vector containing yy, zz, xxx, yyy
scalar containing zzz
However in practice, the computations may not be as favorable to this test program's optimization...
as well as the compiler optimization might not consider this optimization.

blk3, using directly allocatable arrays produce contiguous arrays X, Y, ... zzz
Thus X(1), X(2), X(3), ... are contiguous and favor vectorization
call pdt%calc_aux calls caux with the individual arrays X, Y, ... zzz
Thus, without the overhead of the wrapper call as done by blk1

Additionally

calc_aux.f90 is using DO CONCURRENT(i = 1,size(x))

Recall in my post #16 that register pressure can be reduced by using multiple DO loops. DO concurrent instructs the compiler that it is permissible to automatically perform this conversion for you. However, a potential disadvantage is the output array had to be reprocessed after the do concurrent to produce the sum value into aux. Benchmarking would show you if using DO CONCURRENT or manually breaking the loop into multiple loops is more effective.

Jim Dempsey

AThar2 · ‎02-17-2019

@Jim Thanks for the comment. Does exploitation of L1 cache and vectorization not go hand in hand. Also, is the code in Block 1 not more cache friendly then the one in block 2?

And is that not the case generally for pdt comapred to derived types

again thanks to you all, really fruitfull!

I have been in contact with support team regarding the discrepancy I mentioned between Linux and mac and now when you clearified that the compiler can freely split the loop, it makes the problem related to the register less likely.

jimdempseyatthecove · ‎02-17-2019

If your perspective is from the viewpoint of both blocks being executed using scalar (single) floating point instructions, then the two blocks will have similar L1 cache hit performances.

However, if you consider should block 1 be vectorized, and block 2 not, then consider that

per vector loop iteration, block 1 will have a lower L1 cache hit ratio, however throughput will be faster due to the vector width processing 4 (R8) elements at a time for AVX/AVX2 or 8 (R8) elements at a time for AVX512. For your sample code, using AVX/AVX2 vector code (256 bits), L1 cache hit ratio would be approximately 50% (2 vectors per cache line). On the other hand, using scalar code, assuming R8 types, you would attain approximately 87.5 (7/8) L1 cache hit ratio. Although the hit ratio is better using scalar due to the throughput being less for scalar because the scalar issuing 4x more data fetches and 4x more instruction fetches than vector (double that for AVX512).

If performance is a high priority, then you may need to consider exchanging some elegance in source code off against gaining elegance in machine code generated by the compiler.

If your coding project is just starting, and if code performance is paramount, then it will behoove you to spend some time with data layout of your anticipated heavy computational requirements. IOW construct representative test code from the bottom-up. Then using that information, design your data layout such that it favors performance.

If you are re-coding an existing set of code, then you might be able to look at, and if necessary, reorganize data layout to yield better performance. This article I wrote in 2016 may provide some insight to you.

Jim Dempsey

AThar2 · ‎02-17-2019

Thanks @jim,

jimdempseyatthecove wrote:
If your perspective is from the viewpoint of both blocks being executed using scalar (single) floating point instructions, then the two blocks will have similar L1 cache hit performances.
Will block 1 not have better L1 cache because memory is contiguous (e.g. provided we do something similar to what @FortranFan showed us). I remember you describing very nicely once about the L1 cache issues with declared types versus arrays. Here with block 1 we are almost imitating the scenario with having arrays.

jimdempseyatthecove wrote:
per vector loop iteration, block 1 will have a lower L1 cache hit ratio, however throughput will be faster due to the vector width processing 4 (R8) elements at a time for AVX/AVX2 or 8 (R8) elements at a time for AVX512. For your sample code, using AVX/AVX2 vector code (256 bits), L1 cache hit ratio would be approximately 50% (2 vectors per cache line). On the other hand, using scalar code, assuming R8 types, you would attain approximately 87.5 (7/8) L1 cache hit ratio. Although the hit ratio is better using scalar due to the throughput being less for scalar because the scalar issuing 4x more data fetches and 4x more instruction fetches than vector (double that for AVX512).
So here is my understanding: When we send, say, 4 elements for array xx(:) at a time to vectorize, the L1 cache might only be able to store 6 at a time. Hence, the last two elements will be wasted so in this case we have a hit ratio of around 66%. Is that how we should understand it?
Based on your experience, is it possible to tell beforehand which directions a programmer should go in regards with intense computations involving up to and above 1k cores and multiple nodes.
I guess it is all question about whether to go for better L1 hit ratio or vectorization. ?

Finally, if pdt (block1) is more L1 cache friendly (my first question), one could go with that and once turning on the computation decide wether to go most vectorization or being mostly L1 friendly - while if I went for declared type the option regarding L1 cache is no longer as viable.

Thanks a lot for referring that paper. Will be my first thing to do tomorrow.

jimdempseyatthecove · ‎02-18-2019

>>Will block 1 not have better L1 cache because memory is contiguous...

Blocks 1 and 3 (in scalar execution) will be cache friendly to X(1), X(2), ..., ZZZ(N) access patterns. Meaning you can loop on X, then Y, then X, ... .OR. loop on X, Y, Z, then loop on XX, YY, ZZ, then loop on XXX, YYY, ZZZ

Block 2 (in scalar execution, will be cache friendly for dt(1)%x, dt(1)%y, dt(1)%z, ... dt(1)%zzz, through dt(N)%zzz. Meaning one loop with all members accessed in each iteration.

Blocks 1 and 3 are favorable to vectorization, whereas Block 2 is not for non-scatter/gather.

There is no/negligible performance optimization difference between block 1 and block 2 seeing that the compute loop is the same. the only difference is in the argument preparation for the CALL to the compute function. The CALL for block 1 may require each argument reference to include an integer add instruction (address of pdt + offset to X, address of pdt + offset to Y, ...) a trivial amount of work. Note, due to the arrays for both pdt and standard arrays not being directed as being aligned, the alignment of each array is a "luck of the draw". Alignment differences can affect performance by a few percentage points.

>>So here is my understanding: When we send, say, 4 elements for array xx(:) at a time to vectorize, the L1 cache might only be able to store 6 at a time.

Your understanding of cache is incorrect. It might be beneficial to look at the Wikipedia Article Cache Hierarchy for general information, and How Memory Is Accessed from the IDZ articles. For a more in-depth description as it pertains to Intel CPUs.

A quickie description is (not applicable to all CPUs)

CPUs generally have 2 or 3 cache levels, some/few have 4 levels.:
L1 (~4 or so clock tick access) is faster than L2 (~12 or so clock tick access) is faster than L3 (if present, ~100 or so clock tick access), is faster than RAM access.
L1 is typically 32KB, L2 varies but often is 256 KB, L3 varies often in range of 4MB to 32MB.
Bytes are not cached, rather collections of contiguous bytes are cached, typically in 64 byte, 64-byte aligned data
The cache holds not only the data, but it also holds the address and a tag (and status bits).
The tag is constructed from the address and is typically a reduced (hash) number of bits to facilitate faster search time.
A cache miss at a given level will require a fetch/store from/to a higher level cache or main memory
Advanced cache designs, will detect patterns of access such as streaming loads or fixed stride loads and perform a hardware pre-fetch of the data from higher level cache and/or RAM

Consider what this (may) mean to FortranFan's post #28

N=50 million, the number of properties (arrays) you have is 9, thus:
R4 data the total data is 9 * 4 * 50E6 = 1.8GB
R8 data the total data is 9 * 8 * 50E6 = 3.6GB

Both scenarios easily exceed the capacity of the last level cache and thus should the major loop (DO I=1,N) be executed within an outer loop, the data will not be located in cache.

Let's now look at an example loop (not from your sample code)

do i=1, N ! N = 50 million
  x(i) = x(i) + SomeValue
end do

Assume X is a real of type R4, therefore a 64-byte cache line can contain 16 4-byte R4's

In scalar mode the loop will have:

N/16 memory fetches
N - N/16 L1 cache hits
N adds
N/16 memory stores (assuming write combining by hardware)

In vector mode, using AVX or AVX2

N/16 memory fetches
N/8 - N/16 L1 cache hits
N/8 adds
N/16 memory stores (assuming write combining by hardware)

In vector mode using AVX512 (Skylake...)

N/16 memory fetches
0 L1 cache hits
N/16 adds
N/16 memory stores (assuming write combining by hardware)

Note (strongly) that the wider the vector the lower the hit ratio.... but also the lower the number of hits required. Too often a noob programmer is taught in CS "increase the cach hit ratio", and "attain perfect scaling ratio". Unfortunately this can at times go against "attain best performance". In real-world applications, you are not graded on the first two goals, rather you are graded on the last goal.

I will leave it an exercise for you to extend the operations by latencies.

Additional information. Recall that CPU cache architecture (most Intel) have hardware pre-fetchers. These implementations vary on processor design, but have a limited number of pre-fetchers. The number will vary depending on CPU, but it may be a good assumption that none of them have 9 pre-fetchers. Therefore, it may behoove you to experiment with structure your code to determine the most effective number of member variable (arrays) to process in each loop.

>>Based on your experience, is it possible to tell beforehand which directions a programmer should go in regards with intense computations involving up to and above 1k cores and multiple nodes

Now you present a completely different runtime requirements and beyond the scope of this thread. For the example code stated in this thread, a single node would be faster than any multi-node solution.

>>Finally, if pdt (block1) is more L1 cache friendly (my first question), one could go with that and once turning on the computation decide wether to go most vectorization or being mostly L1 friendly - while if I went for declared type the option regarding L1 cache is no longer as viable.

The computation used in block 1 and block 3 are the same. It should relatively easy to experiment with using a pdt and an analog of pdt dt, where dt declares the arrays x, y, ... zzz as ALLOCATABLE together with an attribute that the allocation be aligned to cache line (64 bytes). This would require to have an init function to perform the allocations and a finish function to deallocate the arrays.

As you saw in the cache comments the goal isn't necessarily L1 cache friendly, rather it is the sum of the memory fetches together with sum of cache hits.

Jim Dempsey

AThar2 · ‎03-09-2019

Hello again

@jimdempseyatthecove : I have made more tests to @FortranFan's code. I have two questions.

1)

First, if you look at the optimisation report, it says that the arrays in 'calc_aux.f90` has unaligned access. That does not make sense since we are declaring them as contiguous. I have tried to compile it with -align array32byte but that did not change it. I tried to follow some of the guidlines in this article https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization - but it did not help.

I then tried to do the alternative ( i have attached the two files of the complete code (it is jsut an extension of FortranFan's code:

   type :: soa
      !DIR$ ATTRIBUTES ALIGN:32::x   ! That does not help in the optimisation report when looping
      real, allocatable :: x(:)
      !DIR$ ATTRIBUTES ALIGN:32::y
      real, allocatable :: y(:)
      !DIR$ ATTRIBUTES ALIGN:32::z
      real, allocatable :: z(:)
      real, allocatable :: xx(:)
      real, allocatable :: yy(:)
      real, allocatable :: zz(:)
      real, allocatable :: xxx(:)
      real, allocatable :: yyy(:)
      real, allocatable :: zzz(:)
   contains
      private
      procedure, pass(this) :: calc_aux_R8_soa
      generic, public :: calc_aux => calc_aux_R8_soa
   end type soa

   subroutine calc_aux_R8_soa( this, aux )
      implicit none
      class(soa), intent(inout) :: this
      real(kind=R8), intent(out)            :: aux
      real(R8) :: t1, t2
      integer :: i

      call cpu_t( time=t1 )
      aux = 0.0_R8
      !DIR$ ASSUME_ALIGNED this%x(1):32    ! - that helped!!!! 
      !DIR$ ASSUME_ALIGNED this%y(1):32
      !DIR$ ASSUME_ALIGNED this%z(1):32
      !DIR$ ASSUME_ALIGNED this%xx(1):32
      !DIR$ ASSUME_ALIGNED this%yy(1):32
      !DIR$ ASSUME_ALIGNED this%zz(1):32
      !DIR$ ASSUME_ALIGNED this%xxx(1):32
      !DIR$ ASSUME_ALIGNED this%yyy(1):32
      !DIR$ ASSUME_ALIGNED this%zzz(1):32

      do concurrent( i = 1: size(this% x) )

         this%x(i) = 56*real(i)-1000
         aux = aux  + this%x(i)

         this%y(i) = 56*real(i)-1000*log(real(i))
         aux = aux  - this%y(i)

         this%z(i) = 56*real(i)-1000*log(real(i+1))
         aux = aux  + this%z(i)

         this%xx(i) = 56*real(i)-1000*log(real(i+10))
         aux = aux  - this%xx(i)

         this%yy(i) = 56*real(i)-1000*log(real(i+3))
         aux = aux  + this%yy(i)

         this%zz(i) = 56*real(i)-1000*log(real(i+2))
         aux = aux  - this%zz(i)

         this%xxx(i) = 56*real(i)-1000*log(real(i+5))
         aux = aux  + this%xxx(i)

         this%yyy(i) = 56*real(i)-1000*log(real(i+4))
         aux = aux  - this%yyy(i)

         this%zzz(i) = 56*real(i)-1000*log(real(i+9))
         aux = aux  - this%zzz(i)

      end do
      call cpu_t( time=t2 )
      calc_time = t2 - t1

      return
   end subroutine calc_aux_R8_soa

The latter bit did not show unaligned access when I had the "DIR" directives in the subroutine. When I declare them inside the type itself it did not make any change.

Also, can you explain why the optimisation report says that the loop (shown above) is not parallelized because of a dependency. However, a few lines below it says the loop is VECTORIZED?

AThar2 · ‎03-09-2019

My second question :

How do I check which vectorization machine I have. I have looked online and the closest thing I come to is running 'cat /proc/cpuinfo` on linux. When I look at my flags I see many things, among other, avx512f , avx, avx2, avx512dq. I initially compiled with `-xcore-avx512` however I realized that made it slow. When I did `-xcore-avx2` it became much better. Hence, I also had to change my Assumed aligned from 64 to 32.

Is there away to make my assumed aligned byte value defined at compilation time so I do not need to change it whenever I change machine, and how do we best read these flags to make sure we compile with the correct/appropriate flags.

Thanks a lot !

jimdempseyatthecove · ‎03-09-2019

>>Is there away to make my assumed aligned byte value defined at compilation time so I do not need to change it whenever I change machine

Define it as an environment variable, and use that in your make file... or align at 64 (cache line size) this won't have any performance difference, and a very small memory loss if any.

A (potential) problem with vectorization of the code in #40 is DO CONCURRENT does not have a REDUCTION clause. As such, what you are doing with aux may be problematic. Try using OpenMP for concurrency

... ! you setup OpenMP
!$omp parallel do simd reduction(+:aux)
do i = 1, size(this% x) 

      this%x(i) = 56*real(i)-1000
      aux = aux  + this%x(i)

      this%y(i) = 56*real(i)-1000*log(real(i))
      aux = aux  - this%y(i)

      this%z(i) = 56*real(i)-1000*log(real(i+1))
      aux = aux  + this%z(i)

      this%xx(i) = 56*real(i)-1000*log(real(i+10))
      aux = aux  - this%xx(i)

      this%yy(i) = 56*real(i)-1000*log(real(i+3))
      aux = aux  + this%yy(i)

      this%zz(i) = 56*real(i)-1000*log(real(i+2))
      aux = aux  - this%zz(i)

      this%xxx(i) = 56*real(i)-1000*log(real(i+5))
      aux = aux  + this%xxx(i)

      this%yyy(i) = 56*real(i)-1000*log(real(i+4))
      aux = aux  - this%yyy(i)

      this%zzz(i) = 56*real(i)-1000*log(real(i+9))
      aux = aux  - this%zzz(i)

   end do
!$omp end parallel do simd
...

Jim Dempsey

declared type and the issue of contiguous memory