- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
I know this a has been covered before, but I am specifically looking for two answers to my question. First Have a look at the following code
program test implicit none type :: data integer :: ias = -7777 integer :: is = 0 integer :: ic = 1 real :: x1(3) = 0. real :: xx(3) = 0. real :: u1(3) = 0. !x1 real :: uu(3) = 0. !x2 real :: d1 = 0. real :: dd = 0. real :: m1 = 0. real :: mm = 0. real :: q = 0. real :: nd = 0. real :: dtr = -1. real :: ydef = 0. real :: dydefdt = 0. real :: uflcf(3) = 0. real :: elapst = 0. real :: dist = 0. real :: dt = 1.e-3 real :: ta2ud = 0. real :: eddyt = huge(1e99) real :: usssa(3) = 0. real :: qq2 = 0. ! - real :: nsda = 0. ! - real :: ddtqq = -1. ! - end type data integer(kind=8) :: i, n type(data), allocatable :: dat(:) real :: rmd, rand,t1,t2, omp_get_wtime real, allocatable::x(:,:) n = 50*10**6 allocate(dat(n)) allocate(x(3,n)) rmd = rand(2.0) t1 = omp_get_wtime() do i = 1, n dat(i)%x1(1) = rmd**2 - exp(real(i)) dat(i)%x1(2) = dat(i)%x1(1) - exp(real(i)) dat(i)%x1(3) = dat(i)%x1(2) - exp(real(i)) ! x(1,i) = rmd**2 - exp(real(i)) ! x(2,i)= x(1,i) - exp(real(i)) ! x(3,i)=x(2,i)- exp(real(i)) enddo t2 = omp_get_wtime() print*, t2-t1 end program
Now the difference between doing x(1,i) = ** and dat(i)% x1(1:3) is quite a lot when comparing t2-t1, which is because of the way memory is stored and cache access.
I firstly don't understand that printed t2-t1 exclusively for the case with dat(i)%.. can be way smaller then when compared to a physical stop watch. I felt sometime that the t2-t1 was quite smaller than what it felt, so when trying to use my stop watch I observed a a difference.
Secondly, and most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AT90 wrote:.. most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? ..
.. how is this achievable is this in Fortran without being verbose.
.. isnt there a more elegant way?
@AT90,
Would you admit what's "elegant" is in the eyes of the beholder?! With situations like this, I've often suggested to readers to consider the facility introduced in standard Fortran with the 2003 revision which is parameterized derived type (PDT):
With this facility, one can adopt a design which can, with some understanding and consideration, be used with both the "Array of Structs" case that you are using now as well as with the "Struct of Arrays" case which is what will be recommended when better performance via faster processing of in-memory data is desired.
Take a look at this somewhat modified example of the code you show in the original post and please report here if this is "elegant" enough for you!
module kinds_m use, intrinsic :: iso_fortran_env, only : I8 => int64 implicit none integer, parameter :: R4 = selected_real_kind( p=6 ) integer, parameter :: R8 = selected_real_kind( p=12 ) end module module cpu_m use kinds_m, only : I8, R8 implicit none contains subroutine cpu_t( time ) !.. Argument list real(R8), intent(inout) :: time !.. Local variables integer(I8) :: tick integer(I8) :: rate call system_clock (tick, rate) time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) return end subroutine cpu_t end module module data_m use kinds_m, only : R4, R8 use cpu_m, only : cpu_t implicit none type :: data_t(K,N) integer, kind :: K = R4 integer, len :: N = 1 real(kind=K) :: x1(3,N) end type data_t real(R8), protected, public :: init_time = 0.0_r8 interface init_data module procedure init_data_scalar module procedure init_data_rank1 end interface contains subroutine init_data_rank1( dat, val ) ! Arguments type(data_t(K=R4,N=*)), intent(inout) :: dat(:) real(R4), intent(in) :: val ! Local variables real(kind=R8) :: start_time real(kind=R8) :: end_time integer :: i if (dat(1)%N /= 1 ) return !<== this specific routine can be for N == 1 only call cpu_t( start_time ) do i = 1, size(dat) dat(i)%x1(1,1) = val**2 - log(real(i)) dat(i)%x1(2,1) = dat(i)%x1(1,1) - log(real(i)) dat(i)%x1(3,1) = dat(i)%x1(2,1) - log(real(i)) end do call cpu_t( end_time ) init_time = end_time - start_time return end subroutine init_data_rank1 subroutine init_data_scalar( dat, val ) ! Arguments type(data_t(K=R4,N=*)), intent(inout) :: dat real(R4), intent(in) :: val ! Local variables real(kind=R8) :: start_time real(kind=R8) :: end_time integer :: i if (dat%N <= 1 ) return !<== this specific routine can be for N > 1 only call cpu_t( start_time ) do i = 1, dat%N dat%x1(1,i) = val**2 - log(real(i)) dat%x1(2,i) = dat%x1(1,i) - log(real(i)) dat%x1(3,i) = dat%x1(2,i) - log(real(i)) end do call cpu_t( end_time ) init_time = end_time - start_time return end subroutine init_data_scalar end module data_m
program test use kinds_m, only : R4, I8 use data_m, only : data_t, init_data, init_time implicit none ! Local variables real(R4) :: rnd character(len=*), parameter :: fmtv = "(g0,t12,g0)" character(len=*), parameter :: fmtt = "(g0,f10.3)" call random_number( rnd ) blk_AoS: block integer(kind=I8) :: n type(data_t(K=R4,N=1)), allocatable :: dat(:) print *, "Case: Array of Structs (AoS) " n = 50*10**6 allocate( dat(n) ) call init_data( dat, val=rnd ) print fmtt, "Init time (seconds): ", init_time print *, "Output some values" print fmtv, "i", "x1(1,i)" print fmtv, 1, dat(1)%x1(1,1) print fmtv, n, dat(n)%x1(1,1) end block blk_AoS print * blk_SoA: block integer(kind=I8) :: n type(data_t(K=R4,N=:)), allocatable :: dat print *, "Case: Struct of Arrays (SoA) " n = 50*10**6 allocate( data_t(K=R4,N=n) :: dat ) call init_data( dat, val=rnd ) print fmtt, "Init time (seconds): ", init_time print *, "Output some values" print fmtv, "i", "x1(1,i)" print fmtv, 1, dat%x1(1,1) print fmtv, n, dat%x1(1,n) end block blk_SoA stop end program
Upon one execution of this example on Windows platform, the output is:
Case: Array of Structs (AoS) Init time (seconds): 8.538 Output some values i x1(1,i) 1 .1537321E-12 50000000 -17.72753 Case: Struct of Arrays (SoA) Init time (seconds): 0.147 Output some values i x1(1,i) 1 .1537321E-12 50000000 -17.72753 Press any key to continue . . .
As you will expect, the CPU time with the "Struct of Arrays" case is significantly faster than the "Array of Structs" case. Note the results are identical in the 2 cases to show you it's a valid comparison.
You can take the above example and extend it as appropriate with your derived type for 'data' and check whether this is something you can use in your actual code.
P.S.> By the way, what's up with your instruction "exp(real(i))" when you are using default real kind with a range of about 1E38 in Intel Fortran? What do you expect to happen 'i' becomes big with the large data sizes you are using? For illustration purposes, the log function has been used in place of the exponential one.
Link Copied
- « Previous
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim.
So are you saying that it is not a big deal whether having 64 or 32 byte aligned?
also, do you know the reason for the optimisation report saying that the x,y... are unaligned?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not know why the report says that (for do concurrent). One hypothesis would be an oversight in the compiler optimization whereby it assumes if this is not aligned, then any/all of its members have unknown alignment. (as may be the case for an array with specified dimensions). Another possible hypothesis is that your do concurrent has exceeded some number of entities the optimizer keeps track of for possible alignment determination.
Have you tried the OpenMP method?
RE: 64-byte alignment vs 32-byte alignment
With "random" allocations, it would approximately result in 1/2 the 64-byte aligned allocations would lose 32 bytes unnecessarily. So for the sample code listed above you have 9 aligned allocations. Say 5 lose 32 bytes (160 bytes) out of the total allocations 9 x 50 million. Or a loss of 160/450million'th of allocated memory. Do you really need to conserve this amount of memory?.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jim ,
Thanks for the reply. Yes I did try the OpenMP method, and I got the same message in the optimisation report. At the beginning it says cannot be parallelised due to an assumed dependency, but some lines below it says the Loop has been VECTORIZED. A bit confusing I would say. The aligned/unaligned issue remained the same for both cases
However, If i explicitly write ` !DIR$ ASSUME_ALIGNED this%x(1):32 ` that makes them aligned. But if I put the directive ` !DIR$ ATTRIBUTES ALIGN:32::x` in the declaration of the type structure as shown in quote #40 it does not help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Confusing, I agree.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
- Next »