Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28383 Discussions

declared type and the issue of contiguous memory

AThar2
Beginner
2,808 Views

Hello 

 

I know this a has been covered before, but I am specifically looking for two answers to my question. First Have a look at the following code 

 

 

      program test

      implicit none

            type :: data
               integer            :: ias     = -7777
               integer            :: is       = 0
               integer            :: ic       = 1
               real               :: x1(3)    = 0.
               real               :: xx(3)    = 0.
               real               :: u1(3)    = 0.    !x1
               real               :: uu(3)    = 0.    !x2
               real               :: d1       = 0.
               real               :: dd       = 0.
               real               :: m1       = 0.
               real               :: mm       = 0.
               real               :: q        = 0.
               real               :: nd       = 0.
               real               :: dtr      = -1.
               real               :: ydef     = 0.
               real               :: dydefdt  = 0.
               real               :: uflcf(3) = 0.
               real               :: elapst   = 0.
               real               :: dist     = 0.
               real               :: dt       = 1.e-3
               real               :: ta2ud     = 0.
               real               :: eddyt    = huge(1e99)
               real               :: usssa(3) = 0.
               real               :: qq2        = 0.    ! -
               real               :: nsda       = 0.    ! -
            real               :: ddtqq      = -1.   ! -
      end type data
      integer(kind=8) :: i, n
      type(data), allocatable      :: dat(:)
      real :: rmd, rand,t1,t2, omp_get_wtime
      real, allocatable::x(:,:)
      n = 50*10**6
      allocate(dat(n))
      allocate(x(3,n))
      rmd = rand(2.0)
      t1 = omp_get_wtime()
      do i = 1, n
          dat(i)%x1(1) = rmd**2 - exp(real(i))
          dat(i)%x1(2) = dat(i)%x1(1) - exp(real(i))
          dat(i)%x1(3) = dat(i)%x1(2) - exp(real(i))
       !  x(1,i) = rmd**2 - exp(real(i))
       !  x(2,i)= x(1,i)  - exp(real(i))
       !  x(3,i)=x(2,i)- exp(real(i))
      enddo


      t2 = omp_get_wtime()
      print*, t2-t1

      end program

 

 

Now the difference between doing x(1,i) = ** and dat(i)% x1(1:3) is quite a lot when comparing t2-t1, which is because of the way memory is stored and cache access. 

I firstly don't understand that printed t2-t1 exclusively for the case with dat(i)%.. can be way smaller then when compared to a physical stop watch. I felt sometime that the t2-t1 was quite smaller than what it felt, so when trying to use my stop watch I observed a a difference.  

Secondly, and most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? 

 

 

0 Kudos
1 Solution
FortranFan
Honored Contributor II
2,725 Views

AT90 wrote:

.. most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? ..

.. how is this achievable is this in Fortran without being verbose. 

.. isnt there a more elegant way?

@AT90,

Would you admit what's "elegant" is in the eyes of the beholder?!  With situations like this, I've often suggested to readers to consider the facility introduced in standard Fortran with the 2003 revision which is parameterized derived type (PDT):

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-parameterized-derived-type-declarations#82788EA3-1C9C-4C06-AAF1-4269F412C03C

With this facility, one can adopt a design which can, with some understanding and consideration, be used with both the "Array of Structs" case that you are using now as well as with the "Struct of Arrays" case which is what will be recommended when better performance via faster processing of in-memory data is desired.

Take a look at this somewhat modified example of the code you show in the original post and please report here if this is "elegant" enough for you!

module kinds_m
   use, intrinsic :: iso_fortran_env, only : I8 => int64
   implicit none
   integer, parameter :: R4 = selected_real_kind( p=6 )
   integer, parameter :: R8 = selected_real_kind( p=12 )
end module

module cpu_m
   use kinds_m, only : I8, R8
   implicit none
contains
   subroutine cpu_t( time )
      !.. Argument list
      real(R8), intent(inout) :: time
      !.. Local variables
      integer(I8) :: tick
      integer(I8) :: rate
      call system_clock (tick, rate)
      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
      return
   end subroutine cpu_t
end module

module data_m

   use kinds_m, only : R4, R8
   use cpu_m, only : cpu_t

   implicit none
   
   type :: data_t(K,N)
      integer, kind :: K = R4
      integer, len :: N = 1
      real(kind=K) :: x1(3,N)
   end type data_t

   real(R8), protected, public :: init_time = 0.0_r8

   interface init_data
      module procedure init_data_scalar
      module procedure init_data_rank1
   end interface

contains

   subroutine init_data_rank1( dat, val )

      ! Arguments
      type(data_t(K=R4,N=*)), intent(inout) :: dat(:)
      real(R4), intent(in)                  :: val

      ! Local variables
      real(kind=R8) :: start_time
      real(kind=R8) :: end_time
      integer :: i

      if (dat(1)%N /= 1 ) return !<== this specific routine can be for N == 1 only

      call cpu_t( start_time )
      do i = 1, size(dat)
         dat(i)%x1(1,1) = val**2 - log(real(i))
         dat(i)%x1(2,1) = dat(i)%x1(1,1) - log(real(i))
         dat(i)%x1(3,1) = dat(i)%x1(2,1) - log(real(i))
      end do
      call cpu_t( end_time )

      init_time = end_time - start_time

      return

   end subroutine init_data_rank1

   subroutine init_data_scalar( dat, val )

      ! Arguments
      type(data_t(K=R4,N=*)), intent(inout) :: dat
      real(R4), intent(in)                  :: val

      ! Local variables
      real(kind=R8) :: start_time
      real(kind=R8) :: end_time
      integer :: i

      if (dat%N <= 1 ) return !<== this specific routine can be for N > 1 only

      call cpu_t( start_time )
      do i = 1, dat%N
         dat%x1(1,i) = val**2 - log(real(i))
         dat%x1(2,i) = dat%x1(1,i) - log(real(i))
         dat%x1(3,i) = dat%x1(2,i) - log(real(i))
      end do
      call cpu_t( end_time )

      init_time = end_time - start_time

      return

   end subroutine init_data_scalar

end module data_m
program test
   
   use kinds_m, only : R4, I8
   use data_m, only : data_t, init_data, init_time
   
   implicit none
   
   ! Local variables
   real(R4) :: rnd
   character(len=*), parameter :: fmtv = "(g0,t12,g0)"
   character(len=*), parameter :: fmtt = "(g0,f10.3)"
   
   call random_number( rnd )
   
   blk_AoS: block
      integer(kind=I8) :: n
      type(data_t(K=R4,N=1)), allocatable :: dat(:)
      print *, "Case: Array of Structs (AoS) "
      n = 50*10**6
      allocate( dat(n) )
      call init_data( dat, val=rnd )
      print fmtt, "Init time (seconds): ", init_time
      print *, "Output some values"
      print fmtv, "i", "x1(1,i)"
      print fmtv, 1, dat(1)%x1(1,1)
      print fmtv, n, dat(n)%x1(1,1)
   end block blk_AoS
   print *
   blk_SoA: block
      integer(kind=I8) :: n
      type(data_t(K=R4,N=:)), allocatable :: dat
      print *, "Case: Struct of Arrays (SoA) "
      n = 50*10**6
      allocate( data_t(K=R4,N=n) :: dat )
      call init_data( dat, val=rnd )
      print fmtt, "Init time (seconds): ", init_time
      print *, "Output some values"
      print fmtv, "i", "x1(1,i)"
      print fmtv, 1, dat%x1(1,1)
      print fmtv, n, dat%x1(1,n)
   end block blk_SoA
   
   stop
   
end program

Upon one execution of this example on Windows platform, the output is:

 Case: Array of Structs (AoS)
Init time (seconds):      8.538
 Output some values
i          x1(1,i)
1          .1537321E-12
50000000   -17.72753

 Case: Struct of Arrays (SoA)
Init time (seconds):      0.147
 Output some values
i          x1(1,i)
1          .1537321E-12
50000000   -17.72753
Press any key to continue . . .

As you will expect, the CPU time with the "Struct of Arrays" case is significantly faster than the "Array of Structs" case.  Note the results are identical in the 2 cases to show you it's a valid comparison.

You can take the above example and extend it as appropriate with your derived type for 'data' and check whether this is something you can use in your actual code.

P.S.> By the way, what's up with your instruction "exp(real(i))" when you are using default real kind with a range of about 1E38 in Intel Fortran?  What do you expect to happen 'i' becomes big with the large data sizes you are using?  For illustration purposes, the log function has been used in place of the exponential one.

 

View solution in original post

0 Kudos
45 Replies
jimdempseyatthecove
Honored Contributor III
1,403 Views

Try this:

t1 = omp_get_wtime()
do i = 1, n
  dat(i)%x1(1) = rmd**2 - sqrt(real(i))
  dat(i)%x1(2) = dat(i)%x1(1) - sqrt(real(i))
  dat(i)%x1(3) = dat(i)%x1(2) - sqrt(real(i))
enddo
t2 = omp_get_wtime()
print*, t2-t1

t1 = omp_get_wtime()
do i = 1, n
  x(1,i) = rmd**2 - sqrt(real(i))
  x(2,i)= x(1,i)  - sqrt(real(i))
  x(3,i)=x(2,i)- sqrt(real(i))
enddo
t2 = omp_get_wtime()
print*, t2-t1

! code to force compiler to .NOT. remove code
t1 = 0.0
do i = 1, n
  t1 = t1 + dat(i)%x1(1)
  t1 = t1 + dat(i)%x1(2)
  t1 = t1 + dat(i)%x1(3)
  t1 = t1 + x(1,i)
  t1 = t1 + x(2,i)
  t1 = t1 + x(3,i)
enddo
if(t1 == 0) print *,"not going to happen"

It is important to insert some use of the data produced such that the compiler does not remove the code due to results not being used.
(this is similar to dead code elimination - dead results elimination)

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,403 Views

Exp is not vectorizable, sqrt is  vectorizable. Also sqrt is heavy weight. You might also consider removing sqrt and test just use real(i)

Jim Dempsey

0 Kudos
AThar2
Beginner
1,403 Views

Thanks for the reply Jim!,

Is there away to make each member of the declared type a continuous array. 
What I mean is if it is possible to make something like having that dat(:)% x(1) is contiguous occupying the first XX memory addresses and then dat(:)% x(2) occupying the next XX memory addresses  etc..

 

As I understand it now is that dat(1) has all its members occupying an X amount of memory where X is the memory of all its members at dat(1) and then dat(2) for the next etc. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,403 Views

>>Is there away to make each member of the declared type a continuous array.

Not that I know of. At some time in one of my earlier posts I suggested that it would be nice to introduce an aggregate type that would do just that. There would still be some issues that would have to be worked out, for example for those user desiring to allocate individual nodes. Perhaps the way to deal with this is to use an attribute on the array descriptor

type(data), aggregate, allocatable,      :: dat(:)  ! *** aggregate is proposed

Then:

dat(i)%x1(1) = rmd**2 -sqrt(real(i))

is converted into:

dat%x1(1,i) = rmd**2 -sqrt(real(i)) ! dat as aggregate with implicit repositioned rank

I think this would be a good language extension. (Integrating with debugger may be an issue)

Jim Dempsey

0 Kudos
Steve_Lionel
Honored Contributor III
1,403 Views

What you want is a "Structure of Arrays" instead of "Array of Structures". This is a classic refactoring for exactly the purpose you want.

0 Kudos
AThar2
Beginner
1,403 Views

Hi Steve 

yes this is in practice what I want. But, how is this achievable is this in Fortran without being verbose.

Could write

 

type dat

real, allocatable :: c(:)

...

end type

 

then just declare type as on instance and allocate all its members with the wirgt dimension.

but isnt there a more elegant way?

0 Kudos
Steve_Lionel
Honored Contributor III
1,403 Views

There is no more elegant way, sorry. I'm not sure there is in any language.

0 Kudos
FortranFan
Honored Contributor II
2,726 Views

AT90 wrote:

.. most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? ..

.. how is this achievable is this in Fortran without being verbose. 

.. isnt there a more elegant way?

@AT90,

Would you admit what's "elegant" is in the eyes of the beholder?!  With situations like this, I've often suggested to readers to consider the facility introduced in standard Fortran with the 2003 revision which is parameterized derived type (PDT):

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-parameterized-derived-type-declarations#82788EA3-1C9C-4C06-AAF1-4269F412C03C

With this facility, one can adopt a design which can, with some understanding and consideration, be used with both the "Array of Structs" case that you are using now as well as with the "Struct of Arrays" case which is what will be recommended when better performance via faster processing of in-memory data is desired.

Take a look at this somewhat modified example of the code you show in the original post and please report here if this is "elegant" enough for you!

module kinds_m
   use, intrinsic :: iso_fortran_env, only : I8 => int64
   implicit none
   integer, parameter :: R4 = selected_real_kind( p=6 )
   integer, parameter :: R8 = selected_real_kind( p=12 )
end module

module cpu_m
   use kinds_m, only : I8, R8
   implicit none
contains
   subroutine cpu_t( time )
      !.. Argument list
      real(R8), intent(inout) :: time
      !.. Local variables
      integer(I8) :: tick
      integer(I8) :: rate
      call system_clock (tick, rate)
      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
      return
   end subroutine cpu_t
end module

module data_m

   use kinds_m, only : R4, R8
   use cpu_m, only : cpu_t

   implicit none
   
   type :: data_t(K,N)
      integer, kind :: K = R4
      integer, len :: N = 1
      real(kind=K) :: x1(3,N)
   end type data_t

   real(R8), protected, public :: init_time = 0.0_r8

   interface init_data
      module procedure init_data_scalar
      module procedure init_data_rank1
   end interface

contains

   subroutine init_data_rank1( dat, val )

      ! Arguments
      type(data_t(K=R4,N=*)), intent(inout) :: dat(:)
      real(R4), intent(in)                  :: val

      ! Local variables
      real(kind=R8) :: start_time
      real(kind=R8) :: end_time
      integer :: i

      if (dat(1)%N /= 1 ) return !<== this specific routine can be for N == 1 only

      call cpu_t( start_time )
      do i = 1, size(dat)
         dat(i)%x1(1,1) = val**2 - log(real(i))
         dat(i)%x1(2,1) = dat(i)%x1(1,1) - log(real(i))
         dat(i)%x1(3,1) = dat(i)%x1(2,1) - log(real(i))
      end do
      call cpu_t( end_time )

      init_time = end_time - start_time

      return

   end subroutine init_data_rank1

   subroutine init_data_scalar( dat, val )

      ! Arguments
      type(data_t(K=R4,N=*)), intent(inout) :: dat
      real(R4), intent(in)                  :: val

      ! Local variables
      real(kind=R8) :: start_time
      real(kind=R8) :: end_time
      integer :: i

      if (dat%N <= 1 ) return !<== this specific routine can be for N > 1 only

      call cpu_t( start_time )
      do i = 1, dat%N
         dat%x1(1,i) = val**2 - log(real(i))
         dat%x1(2,i) = dat%x1(1,i) - log(real(i))
         dat%x1(3,i) = dat%x1(2,i) - log(real(i))
      end do
      call cpu_t( end_time )

      init_time = end_time - start_time

      return

   end subroutine init_data_scalar

end module data_m
program test
   
   use kinds_m, only : R4, I8
   use data_m, only : data_t, init_data, init_time
   
   implicit none
   
   ! Local variables
   real(R4) :: rnd
   character(len=*), parameter :: fmtv = "(g0,t12,g0)"
   character(len=*), parameter :: fmtt = "(g0,f10.3)"
   
   call random_number( rnd )
   
   blk_AoS: block
      integer(kind=I8) :: n
      type(data_t(K=R4,N=1)), allocatable :: dat(:)
      print *, "Case: Array of Structs (AoS) "
      n = 50*10**6
      allocate( dat(n) )
      call init_data( dat, val=rnd )
      print fmtt, "Init time (seconds): ", init_time
      print *, "Output some values"
      print fmtv, "i", "x1(1,i)"
      print fmtv, 1, dat(1)%x1(1,1)
      print fmtv, n, dat(n)%x1(1,1)
   end block blk_AoS
   print *
   blk_SoA: block
      integer(kind=I8) :: n
      type(data_t(K=R4,N=:)), allocatable :: dat
      print *, "Case: Struct of Arrays (SoA) "
      n = 50*10**6
      allocate( data_t(K=R4,N=n) :: dat )
      call init_data( dat, val=rnd )
      print fmtt, "Init time (seconds): ", init_time
      print *, "Output some values"
      print fmtv, "i", "x1(1,i)"
      print fmtv, 1, dat%x1(1,1)
      print fmtv, n, dat%x1(1,n)
   end block blk_SoA
   
   stop
   
end program

Upon one execution of this example on Windows platform, the output is:

 Case: Array of Structs (AoS)
Init time (seconds):      8.538
 Output some values
i          x1(1,i)
1          .1537321E-12
50000000   -17.72753

 Case: Struct of Arrays (SoA)
Init time (seconds):      0.147
 Output some values
i          x1(1,i)
1          .1537321E-12
50000000   -17.72753
Press any key to continue . . .

As you will expect, the CPU time with the "Struct of Arrays" case is significantly faster than the "Array of Structs" case.  Note the results are identical in the 2 cases to show you it's a valid comparison.

You can take the above example and extend it as appropriate with your derived type for 'data' and check whether this is something you can use in your actual code.

P.S.> By the way, what's up with your instruction "exp(real(i))" when you are using default real kind with a range of about 1E38 in Intel Fortran?  What do you expect to happen 'i' becomes big with the large data sizes you are using?  For illustration purposes, the log function has been used in place of the exponential one.

 

0 Kudos
AThar2
Beginner
1,403 Views

@FortranFan This is great!. Thanks a lot for this suggestion, wasn't aware of it before now. 

Two question:

 

1) I prefer to compile with desired integer and real byte, so I could avoid specifying the kind in the declared type and let the default work it out. 

2) I see you use assumed shape in the subroutine call 

 

type(data_t(K=R4,N=*)), intent(inout) :: dat(:)

 

Could I do 

type(data_t(K=R4,N=size_data_t)), intent(inout) :: dat(:)

 

where size_data_t is an "integer, intent(in)" in that routine?

0 Kudos
AThar2
Beginner
1,408 Views

Never mind the above, I just tested it myself and both seem to be accepted by the compiler

0 Kudos
FortranFan
Honored Contributor II
1,408 Views

@AT60,

Glad you tested out the code with your changes too.

Re: your point above with compilation using desired integer and real 'bytes', portability can be an issue.  Fortran standard is a 'friend' here, it has gone to great length to help in this aspect, and coders will generally find it beneficial to employ standard language features and to minimize compiler-option driven programs.

Re: your point, yes you can do so as you found out.  The standard states, "The length type parameter values of a present actual argument shall agree with the corresponding ones of the dummy argument that are not assumed"

0 Kudos
AThar2
Beginner
1,408 Views

@FortranFan 

I realized something with parametrized declared type which is a bit unexpected to me. 

It seems like the fortran compiler is unaware of the memory stride of such types. I realized that by writing out the optimization report and realized the compiler refuses to vectorize it at first hand. After when I force it by having a $omp simd it writes out that the type is non-stride (like the normal declared type) and that the compiler is unaware of its stride. At least in a regular declared type, you have the compiler knowing the jumps in memory. 

An example of the code is given below. Compile first without -iopenmp to realize that the middle loop (parametrized) type is not vectorized, and after try with openmp to observe what the report is now telling. 

 

program test
        implicit none
        type test1(nn)
           integer, len :: nn
           real :: x(nn),y(nn),z(nn), xx(nn),yy(nn), zz(nn)
           real :: xxx(nn),yyy(nn),zzz(nn), xxxx(nn),yyyy(nn), zzzz(nn)
        end type
        type test2
          real :: x,y,z, xx,yy, zz
          real :: xxx,yyy,zzz, xxxx,yyyy, zzzz
          real :: xxx1,yyy2,zzz2, xxxx2,yyyy2, zzzz2
          real :: xxx4,yyy4,zzz4, xxxx4,yyyy4, zzzz23

       end type
       character(8) :: a
       integer :: i
       integer, parameter :: size = 5e7
       type(test1(nn=:)), allocatable :: tt
       type(test2), allocatable ::tt2(:)
       !real(8) :: t1,t2, omp_get_wtime

       real, allocatable :: x(:),y(:),z(:), xx(:),yy(:), zz(:)
       real, allocatable :: xxx(:),yyy(:),zzz(:), xxxx(:),yyyy(:), zzzz(:)

       real ::  aux

       integer(8) :: t1,t2,clock_rate,dum
       allocate(x(size),y(size),z(size), xx(size),yy(size), zz(size), &
              xxx(size),yyy(size),zzz(size), xxxx(size),yyyy(size), zzzz(size))

       allocate(tt2(size))
       allocate(test1(nn=size)::tt)

       aux = 0
       call system_clock ( t1, clock_rate, dum )

        do i = 1, size

           tt2(i)% x = 56*real(i)-1000
           aux = aux  + tt2(i)% x
           tt2(i)% y = 56*real(i)-1000*log(real(i))
           aux = aux - tt2(i)% y

           tt2(i)% z = 56*real(i)-1000*log(real(i+1))
           aux = aux  + tt2(i)% z

           tt2(i)% xx = 56*real(i)-1000*log(real(i+10))
           aux = aux  - tt2(i)% xx

           tt2(i)% yy = 56*real(i)-1000*log(real(i+3))
           aux = aux  + tt2(i)% yy

           tt2(i)% zz = 56*real(i)-1000*log(real(i+2))
           aux = aux  - tt2(i)% zz

           tt2(i)% xxx = 56*real(i)-1000*log(real(i+5))
           aux = aux  + tt2(i)% xxx

           tt2(i)% yyy = 56*real(i)-1000*log(real(i+4))
           aux = aux  - tt2(i)% yyy

           tt2(i)% zzz = 56*real(i)-1000*log(real(i+9))
           aux = aux  - tt2(i)% zzz


      enddo
      call system_clock ( t2, clock_rate, dum )
      print*, real(t2-t1)/real(clock_rate)
      print*, "Fakaart", aux;
      aux =0

      t1=0
      t2=0
      call system_clock ( t1, clock_rate, dum )
!$omp simd
      do i = 1, size
              tt% x(i) = 56*real(i)-1000
              aux = aux  + tt% x(i)

              tt% y(i) = 56*real(i)-1000*log(real(i))
              aux = aux  - tt% y(i)

              tt% z(i) = 56*real(i)-1000*log(real(i+1))
              aux = aux  + tt% z(i)

              tt% xx(i) = 56*real(i)-1000*log(real(i+10))
              aux = aux  - tt% xx(i)

              tt% yy(i) = 56*real(i)-1000*log(real(i+3))
              aux = aux  + tt% yy(i)

              tt% zz(i) = 56*real(i)-1000*log(real(i+2))
              aux = aux  - tt% zz(i)


              tt% xxx(i) = 56*real(i)-1000*log(real(i+5))
              aux = aux  + tt% xxx(i)

              tt% yyy(i) = 56*real(i)-1000*log(real(i+4))
              aux = aux  - tt% yyy(i)

              tt% zzz(i) = 56*real(i)-1000*log(real(i+9))
              aux = aux  - tt% zzz(i)

       enddo
!$omp end simd

       call system_clock ( t2, clock_rate, dum )
       print*, real(t2-t1)/real(clock_rate)
       print*, "Fakaart", aux

       aux=0
 t1=0
t2=0


      call system_clock ( t1, clock_rate, dum )
       do i = 1, size

          x(i) = 56*real(i)-1000
          aux =aux + x(i)
          y(i) = 56*real(i)-1000*log(real(i))
          aux =aux - y(i)

          z(i) = 56*real(i)-1000*log(real(i+1))
          aux = aux+ z(i)

           xx(i) =56*real(i)-1000*log(real(i+10))
           aux = aux - xx(i)

           yy(i) = 56*real(i)-1000*log(real(i+3))
           aux = aux+ yy(i)

           zz(i) = 56*real(i)-1000*log(real(i+2))
           aux = aux - zz(i)

           xxx(i) = 56*real(i)-1000*log(real(i+5))
           aux = aux+ xxx(i)

           yyy(i) = 56*real(i)-1000*log(real(i+4))
           aux = aux - yyy(i)

           zzz(i) = 56*real(i)-1000*log(real(i+9))
           aux = aux - zzz(i)

       enddo
       call system_clock ( t2, clock_rate, dum )
      print*, real(t2-t1)/real(clock_rate)
      print*, "Fakaart", aux


     end program

Does anybody else in the forum know why this is happening.

 

A serious question: If this is something expected and how things should be - Why would I even consider using parametrized type? 

And would you advise me to use auxiliary contiguous pointers which points to each member of the parametrized type - I would imagine that you in that way tweak the compiler to treat the members as contiguous  in memory?

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,408 Views

The structure of the test code above is not structured to support vectorization. The reason for this is (may be) due to aux not being a candidate for reduction. You can explicitly request reduction, however you will have to choose either: +:aux or -:aux *** and write your aux expressions accordingly. e.g. with +:aux use aux=aux + (-y(i))

The compiler should be able to accept either + or - for +:reduction but it is not stated as doing so.

Jim Dempsey

0 Kudos
AThar2
Beginner
1,408 Views

Thanks for the comment.

Two tings:

have an aux variable for two reasons,

1) I just to confirm that all three loops are doing the same.

2) I was told that some compilers might be smart enought to see that the members, xx,yy etc., are not being used further hence it does not carry the work. So by having an aux to later print it, the compiler should do the work

 

Regarding your comment on reduction.

My concern is not really associated to that. My concern is more to the comparison with the different data structures I have shown. Running with a "normal" type, the optimisation report shows that it can vectorize my loop and it knows about the stride structure.

However, when dealing with parametrized types, there are two tings:

1) The report does say that it CANNOT vectorize the loop

2) When I then enable omp simd, I see that the compiler even does not know the stride structure of that type

This being said, I would have imagined that the compiler " knows " the members of the parametrized type are contiguous in memory and after all that vectorization should be done straight forwardly as like having an allocatable array (seen in the last loop)

 

The whole concern started when I tested these three loops on my Linux station and realized that the parametrized declared type "loop" was the least efficient loop!

This is something I found very dubious. Am I missing something here

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,408 Views

Additional thing to consider. If the code generation is to SSE, and possibly to AVX, the register pressure of those loops may exceed the number of registers available to provide good vectorization. See what happens when you reduce the number of statements:

do i = 1, size

  x(i) = 56*real(i)-1000
  aux =aux + x(i)

  y(i) = 56*real(i)-1000*log(real(i))
  aux =aux - y(i)

  z(i) = 56*real(i)-1000*log(real(i+1))
  aux = aux+ z(i)

end do

do i = 1, size

  xx(i) =56*real(i)-1000*log(real(i+10))
  aux = aux - xx(i)

  yy(i) = 56*real(i)-1000*log(real(i+3))
  aux = aux+ yy(i)

  zz(i) = 56*real(i)-1000*log(real(i+2))
  aux = aux - zz(i)

end do

do i = 1, size

  xxx(i) = 56*real(i)-1000*log(real(i+5))
  aux = aux+ xxx(i)

  yyy(i) = 56*real(i)-1000*log(real(i+4))
  aux = aux - yyy(i)

  zzz(i) = 56*real(i)-1000*log(real(i+9))
  aux = aux - zzz(i)

enddo

Jim Dempsey

0 Kudos
AThar2
Beginner
1,408 Views

I am afraid I am not sure what SSE and AVX stand for or are.

Also, that does not explain the issue, does it? I am running three loop over three different data structures. Two are being vectorized and one cannot be vectorized because the compiler thinks there is a dependency. But clearly the three loops are almost identical, only the difference is that one is declared type (arrays of structure), while the other, parametrized type.

 

I really just want to understand why the compiler cannot vectorize the middle loop of the latest I have shown, and why the compiler says that the stride is unknown. Should not it be clear that test% xx(:) is contiguous, is that not what we want to achieve with parametrized types?

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,408 Views

>>I am afraid I am not sure what SSE and AVX stand for or are

A basic understanding of MMX, SSE, AVX and AVX512 is required in order to understand what "vectorization" means in the realm of computation. This instructions are in a class of Single Instruction, Multiple Data (aka SIMD)

MMX is 64 bits wide (e.g. 8 bytes, 4 words, 2 dwords, 1 qword)
SSE is 128 bits wide (widens and adds 4 floats, and 2 doubles plus more instructions)
AVX/AVX2 are 256 bits wide (widens to 8 floats and 4 doubles plus more instructions) 
AVX512(and variants) are 512 bits wide (widens to 16 floats and 8 doubles plus more instructions)

If, as an example, your system supported AVX (and the compiler generated AVX vectorized instructions), the generated loop could perform SIMD vectors of 8 floats per iteration. the loop being converted to equivalent of:

do i = 1, size, 8
  x(i:i+7) = 56*real(i:i+7)-1000
  aux(1:8) = aux(1:8) + x(i:i+7)
  ...

Or more visually informative:

temp_aux(1:8) = /0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0/
temp_reals(1:8) = /1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0/
do i = 1, size, 8
  x(i:i+7) = 56*temp_real(1:8)-1000
  temp_aux(1:8) = temp_aux(1:8) + x(i:i+7)
  ...
  temp_real(1:8) = temp_real(1:8) + 8.0
end do
aux = sum(temp_aux)

Where single instructions operate on 8 floats per operation.

Each of the instruction sets have different numbers of SIMD registers

MMX has 8 registers (integer types only)
SSE, AVX, AVX2 have 16 registers (integer and floating point types)
AVX512  has 32 registers

Depending on the code in the loop, sometimes it can be better to split a loop into multiple loops such that the code better fits within the available registers. The compiler optimization can at times perform the loop splitting for you and sometimes it is not savvy enough to do so.

Jim Dempsey

0 Kudos
AThar2
Beginner
1,408 Views

Thanks for letting me know this. I will  catch up on these instructions. However, I don't see what this would have to do with my original question? 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,408 Views

And a little more in the abstract:

temp_aux(1:8) = /0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0/ ! register
temp_reals(1:8) = /1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0/ ! register
temp_56(1:8) = /56.0, 56.0, 56.0, 56.0, 56.0, 56.0, 56.0, 56.0/ ! register
temp_1000(1:8) = /1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0/ ! register
temp_8(1:8) = /8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0/ ! register
do i = 1, size, 8
  scratch(1:8) = temp_56(1:8) ! one instruction
  scratch(1:8) = scratch(1:8) * temp_real(1:8) ! one instruction
  scratch(1:8) = scratch(1:8) - temp_1000(1:8) ! one instruction
  x(i:i+7) = scratch(1:8) ! one instruction
  temp_aux(1:8) = temp_aux(1:8) + scratch(1:8) ! one instruction
  ...
  temp_real(1:8) = temp_real(1:8) + temp_8(1:8) ! one instruction
end do
aux = sum(temp_aux) ! possibly as few as 3 instructions

Jim Dempsey

0 Kudos
AThar2
Beginner
1,396 Views

Jim, thanks very much. It is indeed nice to know this, however, how is that related to my question?

0 Kudos
Reply