- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
I know this a has been covered before, but I am specifically looking for two answers to my question. First Have a look at the following code
program test implicit none type :: data integer :: ias = -7777 integer :: is = 0 integer :: ic = 1 real :: x1(3) = 0. real :: xx(3) = 0. real :: u1(3) = 0. !x1 real :: uu(3) = 0. !x2 real :: d1 = 0. real :: dd = 0. real :: m1 = 0. real :: mm = 0. real :: q = 0. real :: nd = 0. real :: dtr = -1. real :: ydef = 0. real :: dydefdt = 0. real :: uflcf(3) = 0. real :: elapst = 0. real :: dist = 0. real :: dt = 1.e-3 real :: ta2ud = 0. real :: eddyt = huge(1e99) real :: usssa(3) = 0. real :: qq2 = 0. ! - real :: nsda = 0. ! - real :: ddtqq = -1. ! - end type data integer(kind=8) :: i, n type(data), allocatable :: dat(:) real :: rmd, rand,t1,t2, omp_get_wtime real, allocatable::x(:,:) n = 50*10**6 allocate(dat(n)) allocate(x(3,n)) rmd = rand(2.0) t1 = omp_get_wtime() do i = 1, n dat(i)%x1(1) = rmd**2 - exp(real(i)) dat(i)%x1(2) = dat(i)%x1(1) - exp(real(i)) dat(i)%x1(3) = dat(i)%x1(2) - exp(real(i)) ! x(1,i) = rmd**2 - exp(real(i)) ! x(2,i)= x(1,i) - exp(real(i)) ! x(3,i)=x(2,i)- exp(real(i)) enddo t2 = omp_get_wtime() print*, t2-t1 end program
Now the difference between doing x(1,i) = ** and dat(i)% x1(1:3) is quite a lot when comparing t2-t1, which is because of the way memory is stored and cache access.
I firstly don't understand that printed t2-t1 exclusively for the case with dat(i)%.. can be way smaller then when compared to a physical stop watch. I felt sometime that the t2-t1 was quite smaller than what it felt, so when trying to use my stop watch I observed a a difference.
Secondly, and most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AT90 wrote:.. most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? ..
.. how is this achievable is this in Fortran without being verbose.
.. isnt there a more elegant way?
@AT90,
Would you admit what's "elegant" is in the eyes of the beholder?! With situations like this, I've often suggested to readers to consider the facility introduced in standard Fortran with the 2003 revision which is parameterized derived type (PDT):
With this facility, one can adopt a design which can, with some understanding and consideration, be used with both the "Array of Structs" case that you are using now as well as with the "Struct of Arrays" case which is what will be recommended when better performance via faster processing of in-memory data is desired.
Take a look at this somewhat modified example of the code you show in the original post and please report here if this is "elegant" enough for you!
module kinds_m use, intrinsic :: iso_fortran_env, only : I8 => int64 implicit none integer, parameter :: R4 = selected_real_kind( p=6 ) integer, parameter :: R8 = selected_real_kind( p=12 ) end module module cpu_m use kinds_m, only : I8, R8 implicit none contains subroutine cpu_t( time ) !.. Argument list real(R8), intent(inout) :: time !.. Local variables integer(I8) :: tick integer(I8) :: rate call system_clock (tick, rate) time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) return end subroutine cpu_t end module module data_m use kinds_m, only : R4, R8 use cpu_m, only : cpu_t implicit none type :: data_t(K,N) integer, kind :: K = R4 integer, len :: N = 1 real(kind=K) :: x1(3,N) end type data_t real(R8), protected, public :: init_time = 0.0_r8 interface init_data module procedure init_data_scalar module procedure init_data_rank1 end interface contains subroutine init_data_rank1( dat, val ) ! Arguments type(data_t(K=R4,N=*)), intent(inout) :: dat(:) real(R4), intent(in) :: val ! Local variables real(kind=R8) :: start_time real(kind=R8) :: end_time integer :: i if (dat(1)%N /= 1 ) return !<== this specific routine can be for N == 1 only call cpu_t( start_time ) do i = 1, size(dat) dat(i)%x1(1,1) = val**2 - log(real(i)) dat(i)%x1(2,1) = dat(i)%x1(1,1) - log(real(i)) dat(i)%x1(3,1) = dat(i)%x1(2,1) - log(real(i)) end do call cpu_t( end_time ) init_time = end_time - start_time return end subroutine init_data_rank1 subroutine init_data_scalar( dat, val ) ! Arguments type(data_t(K=R4,N=*)), intent(inout) :: dat real(R4), intent(in) :: val ! Local variables real(kind=R8) :: start_time real(kind=R8) :: end_time integer :: i if (dat%N <= 1 ) return !<== this specific routine can be for N > 1 only call cpu_t( start_time ) do i = 1, dat%N dat%x1(1,i) = val**2 - log(real(i)) dat%x1(2,i) = dat%x1(1,i) - log(real(i)) dat%x1(3,i) = dat%x1(2,i) - log(real(i)) end do call cpu_t( end_time ) init_time = end_time - start_time return end subroutine init_data_scalar end module data_m
program test use kinds_m, only : R4, I8 use data_m, only : data_t, init_data, init_time implicit none ! Local variables real(R4) :: rnd character(len=*), parameter :: fmtv = "(g0,t12,g0)" character(len=*), parameter :: fmtt = "(g0,f10.3)" call random_number( rnd ) blk_AoS: block integer(kind=I8) :: n type(data_t(K=R4,N=1)), allocatable :: dat(:) print *, "Case: Array of Structs (AoS) " n = 50*10**6 allocate( dat(n) ) call init_data( dat, val=rnd ) print fmtt, "Init time (seconds): ", init_time print *, "Output some values" print fmtv, "i", "x1(1,i)" print fmtv, 1, dat(1)%x1(1,1) print fmtv, n, dat(n)%x1(1,1) end block blk_AoS print * blk_SoA: block integer(kind=I8) :: n type(data_t(K=R4,N=:)), allocatable :: dat print *, "Case: Struct of Arrays (SoA) " n = 50*10**6 allocate( data_t(K=R4,N=n) :: dat ) call init_data( dat, val=rnd ) print fmtt, "Init time (seconds): ", init_time print *, "Output some values" print fmtv, "i", "x1(1,i)" print fmtv, 1, dat%x1(1,1) print fmtv, n, dat%x1(1,n) end block blk_SoA stop end program
Upon one execution of this example on Windows platform, the output is:
Case: Array of Structs (AoS) Init time (seconds): 8.538 Output some values i x1(1,i) 1 .1537321E-12 50000000 -17.72753 Case: Struct of Arrays (SoA) Init time (seconds): 0.147 Output some values i x1(1,i) 1 .1537321E-12 50000000 -17.72753 Press any key to continue . . .
As you will expect, the CPU time with the "Struct of Arrays" case is significantly faster than the "Array of Structs" case. Note the results are identical in the 2 cases to show you it's a valid comparison.
You can take the above example and extend it as appropriate with your derived type for 'data' and check whether this is something you can use in your actual code.
P.S.> By the way, what's up with your instruction "exp(real(i))" when you are using default real kind with a range of about 1E38 in Intel Fortran? What do you expect to happen 'i' becomes big with the large data sizes you are using? For illustration purposes, the log function has been used in place of the exponential one.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I really just want to understand why the compiler cannot vectorize the middle loop of the latest I have shown
>>how is that related to my question?
Understanding what vectorization by the CPU entails, may help you understand how to write code that is favorable to vectorization.
>> and why the compiler says that the stride is unknown
It is not clear to me why the compiler could not discern that tt%...(i) references are contiguous. They clearly are (to us readers). This may be a case that during optimization, that the number of registers required exceeded the number available, and that the reason for failure to vectorize was misattributed to unknown stride as opposed to other reason (register pressure). Breaking the one loop into three loops (or one per array) will shed light on the issue.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But should the logic not be that if the loop over tt2(i)% ... can be vectorized (according to the report) then why cannot the tt%, both have same sizes, they just happen to be stored in different data structures, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I must admit that I have got the impression that the implementation of parametrized declared type is a bit dodgy. Can you recognize this by your experience?
I actually spent an appreciable amount of time to move from declared type to parametrized declared types to increase efficiency, and I am getting a bit frustrated because it seems that I just should have stayed where I was.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am a bit surprised by this last statement: I would have expected that going from declared types to parameterized types might make code a lot simpler and better structured, but I am not sure that I'd expected it to become more efficient.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is what I thought. Going to parametrized type would essentially make each member contiguous in memory while the declared type would not with respect to a member of tht type.
I basically learnt that from the initial answers in the thread
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The issue I see with the parameterized type in post #13 is that the dimensions of the arrays are all deferred, so the sizes of the individual components are not known until runtime. This makes vectorization much more difficult. It isn't really a matter of contiguity.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@AT90,
Try the following code with the shown compile and link instructions and note down the CPU time with the PDT approach averaged over a reasonable number of executions. Then report here if you are able to do better than that CPU time with any other approach.
What you should take away from the code if you need to understand all the computations thoroughly and structure your code accordingly and lend as much hand to the compiler while improving the code readability; don't expect the compiler to do all the optimizations for you. And if your code is structured well, you can consume PDTs as well OR BETTER than any other approach including say one where data are structured in ordinary array containers or with ordinary derived types, etc. But the onus is mainly on the coder.
module kinds_m use, intrinsic :: iso_fortran_env, only : I8 => int64 implicit none integer, parameter :: R4 = selected_real_kind( p=6 ) integer, parameter :: R8 = selected_real_kind( p=12 ) end module module cpu_m use kinds_m, only : I8, R8 implicit none contains subroutine cpu_t( time ) !.. Argument list real(R8), intent(inout) :: time !.. Local variables integer(I8) :: tick integer(I8) :: rate call system_clock (tick, rate) time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) return end subroutine cpu_t end module module aux_m use kinds_m, only : R4, R8 use cpu_m, only : cpu_t implicit none interface calc_aux module procedure calc_aux_R4 module procedure calc_aux_R8 end interface real(R8), protected, public :: calc_time = 0.0_r8 contains subroutine calc_aux_R8( x, y, z, xx, yy, zz, xxx, yyy, zzz, aux ) ! Argument list real(kind=R8), contiguous, intent(inout) :: x(:) real(kind=R8), contiguous, intent(inout) :: y(:) real(kind=R8), contiguous, intent(inout) :: z(:) real(kind=R8), contiguous, intent(inout) :: xx(:) real(kind=R8), contiguous, intent(inout) :: yy(:) real(kind=R8), contiguous, intent(inout) :: zz(:) real(kind=R8), contiguous, intent(inout) :: xxx(:) real(kind=R8), contiguous, intent(inout) :: yyy(:) real(kind=R8), contiguous, intent(inout) :: zzz(:) real(kind=R8), intent(out) :: aux include 'calc_aux.f90' return end subroutine subroutine calc_aux_R4( x, y, z, xx, yy, zz, xxx, yyy, zzz, aux ) ! Argument list real(kind=R4), contiguous, intent(inout) :: x(:) real(kind=R4), contiguous, intent(inout) :: y(:) real(kind=R4), contiguous, intent(inout) :: z(:) real(kind=R4), contiguous, intent(inout) :: xx(:) real(kind=R4), contiguous, intent(inout) :: yy(:) real(kind=R4), contiguous, intent(inout) :: zz(:) real(kind=R4), contiguous, intent(inout) :: xxx(:) real(kind=R4), contiguous, intent(inout) :: yyy(:) real(kind=R4), contiguous, intent(inout) :: zzz(:) real(kind=R4), intent(out) :: aux include 'calc_aux.f90' return end subroutine end module aux_m module pdt_m use kinds_m, only : R4, R8 use cpu_m, only : cpu_t use aux_m, only : caux => calc_aux implicit none type :: pdt_t(K,N) integer, kind :: K = R4 integer, len :: N = 1 real(kind=K) :: x(N) real(kind=K) :: y(N) real(kind=K) :: z(N) real(kind=K) :: xx(N) real(kind=K) :: yy(N) real(kind=K) :: zz(N) real(kind=K) :: xxx(N) real(kind=K) :: yyy(N) real(kind=K) :: zzz(N) contains private procedure, pass(this) :: calc_aux_R4 procedure, pass(this) :: calc_aux_R8 generic, public :: calc_aux => calc_aux_R4, calc_aux_R8 end type pdt_t real(R8), protected, public :: calc_time = 0.0_r8 contains subroutine calc_aux_R4( this, aux ) ! Argument list class(pdt_t(K=R4,N=*)), intent(inout) :: this real(kind=R4), intent(out) :: aux ! Local variables real(R8) :: t1, t2 call cpu_t( time=t1 ) aux = 0.0_R4 call caux( this%x, this%y, this%z, this%xx, this%yy, this%zz, this%xxx, this%yyy, this%zzz, aux ) call cpu_t( time=t2 ) return end subroutine subroutine calc_aux_R8( this, aux ) ! Argument list class(pdt_t(K=R8,N=*)), intent(inout) :: this real(kind=R8), intent(out) :: aux ! Local variables real(R8) :: t1, t2 call cpu_t( time=t1 ) aux = 0.0_R8 call caux( this%x, this%y, this%z, this%xx, this%yy, this%zz, this%xxx, this%yyy, this%zzz, aux ) call cpu_t( time=t2 ) calc_time = t2 - t1 return end subroutine end module pdt_m module dt_m use kinds_m, only : R8 use cpu_m, only : cpu_t implicit none type :: dt_t real(kind=R8) :: x real(kind=R8) :: y real(kind=R8) :: z real(kind=R8) :: xx real(kind=R8) :: yy real(kind=R8) :: zz real(kind=R8) :: xxx real(kind=R8) :: yyy real(kind=R8) :: zzz end type dt_t real(R8), protected, public :: calc_time = 0.0_r8 contains subroutine calc_aux( this, aux ) ! Argument list type(dt_t), intent(inout) :: this(:) real(kind=R8), intent(out) :: aux ! Local variables integer :: i real(R8) :: t1, t2 call cpu_t( time=t1 ) aux = 0.0_R8 do i = 1, size(this) this(i)%x = 56*real(i)-1000 aux = aux + this(i)%x this(i)%y = 56*real(i)-1000*log(real(i)) aux = aux - this(i)%y this(i)%z = 56*real(i)-1000*log(real(i+1)) aux = aux + this(i)%z this(i)%xx = 56*real(i)-1000*log(real(i+10)) aux = aux - this(i)%xx this(i)%yy = 56*real(i)-1000*log(real(i+3)) aux = aux + this(i)%yy this(i)%zz = 56*real(i)-1000*log(real(i+2)) aux = aux - this(i)%zz this(i)%xxx = 56*real(i)-1000*log(real(i+5)) aux = aux + this(i)%xxx this(i)%yyy = 56*real(i)-1000*log(real(i+4)) aux = aux - this(i)%yyy this(i)%zzz = 56*real(i)-1000*log(real(i+9)) aux = aux - this(i)%zzz end do call cpu_t( time=t2 ) calc_time = t2 - t1 return end subroutine end module dt_m program p !dir$ if defined (rbytes) !dir$ else !dir$ define rbytes = 64 !dir$ end if !dir$ if (rbytes == 32) use kinds_m, only : WP => R4 !dir$ else use kinds_m, only : WP => R8 !dir$ end if implicit none integer, parameter :: N = 50000000 blk1: block use pdt_m, only : pdt_t, calc_time type(pdt_t(K=WP,N=:)), allocatable :: pdt real(WP) :: aux allocate( pdt_t(K=WP,N=N) :: pdt ) call pdt%calc_aux( aux ) print *, "Block 1: PDT" print *, "aux = ", aux print "(g0,g10.3,g0)", "Calc time = ", calc_time, " seconds" print * end block blk1 blk2: block use dt_m, only : dt_t, calc_aux, calc_time type(dt_t), allocatable :: dt(:) real(WP) :: aux allocate( dt(N) ) call calc_aux( dt, aux ) print *, "Block 2: Derived Type" print *, "aux = ", aux print "(g0,g10.3,g0)", "Calc time = ", calc_time, " seconds" print * end block blk2 blk3: block use aux_m, only : calc_aux, calc_time real(WP), allocatable :: x(:), y(:), z(:), xx(:), yy(:), zz(:), xxx(:), yyy(:), zzz(:) real(WP) :: aux allocate( x(N), y(N), z(N), xx(N), yy(N), zz(N), xxx(N), yyy(N), zzz(N) ) call calc_aux( x, y, z, xx, yy, zz, xxx, yyy, zzz, aux ) print *, "Block 3: Arrays" print *, "aux = ", aux print "(g0,g10.3,g0)", "Calc time = ", calc_time, " seconds" end block blk3 stop end program
Include file 'calc_aux.f90':
! Local variables integer :: i real(R8) :: t1, t2 call cpu_t( time=t1 ) do concurrent ( i = 1:size(x) ) x(i) = 56*real(i)-1000 y(i) = 56*real(i)-1000*log(real(i)) z(i) = 56*real(i)-1000*log(real(i+1)) xx(i) = 56*real(i)-1000*log(real(i+10)) yy(i) = 56*real(i)-1000*log(real(i+3)) zz(i) = 56*real(i)-1000*log(real(i+2)) xxx(i) = 56*real(i)-1000*log(real(i+5)) yyy(i) = 56*real(i)-1000*log(real(i+4)) zzz(i) = 56*real(i)-1000*log(real(i+9)) end do aux = sum(x) - sum(y) + sum(z) - sum(xx) + sum(yy) - sum(zz) + sum(xxx) - sum(yyy) - sum(zzz) call cpu_t( time=t2 ) calc_time = t2 - t1
Compilation and Linking:
C:\temp>ifort /standard-semantics /warn:all /stand /O3 /Qparallel /Qdiag-disable:7025 p.f90 Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.1.144 Build 20181018 Copyright (C) 1985-2018 Intel Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 14.16.27026.1 Copyright (C) Microsoft Corporation. All rights reserved. -out:p.exe -subsystem:console p.obj C:\temp>
Execution result:
C:\temp>p.exe Block 1: PDT aux = -6.999837968078546E+016 Calc time = 0.499 seconds Block 2: Derived Type aux = -6.999837968078546E+016 Calc time = 0.852 seconds Block 3: Arrays aux = -6.999837968078546E+016 Calc time = 0.499 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@FortranFan Thanks so much. I will need a bit of time to study the code and think how to adopt this to a much larger and complex code. But that is definately a way to start. I have one question on this, would you reckon variations in the performance if say I did not use the memembers in consecutive order. E.g. Let say I dont necessairly begin x,y,z,xx etc. I gues that would not matter with your example now due to the way you right out the calc_aux routine which takes arbitrar arrays.
@Steve Lion,
thanks for the reply. This is a new info. I did not know that these are deferred. I thought they just worked like allocatables. So I guess based on your comments, they dont work like allocatables. So can I roughly speaking say that if the parametrized are not used a bit more carefully I would end up make the program slowet rather than faster compared to a regular declared type? And here I mean being a bit more explicit to the compiler that these arrays are contiguous and vectorizable. ? I would really like to hear your say on this
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your tt and tt2 are VERY different. tt2 is an array of items of type test2. tt is a scalar of type test1 that contains multiple arrays of run-time determined size. The memory layout will be very different, and the compiler has to look at the size of each component array at run-time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Steve Lionel
How is that different from allocatable arrays? Wouldn’t compiler also check its array size. Yes, these types are guaranteed to be contiguous. But so should be the items of test1 right?
Also, if that is not correct, can you then explain me the purpose of having this feature. I clearly thought it to be a way to structure your code but also overcome the issue you have with e cases like test2 where each member of the type has non-unite stride memory so that they are not cache friendly, while the parametrized case is friendly in that regard.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@FortranFan, I have tested the code and I indeed get similar results as you on my Mac with Core i5@2.6GHz. However, (and this is my motivation for opening this thread again actually), when running it on my Linux station with Intel Xeon@3.6GHz. I get all three cases to be running at same speed.
The latest code I showed where I compared the three different cases I did, despite the issue with vectorization, get the pdt to run much faster than declared type on my mac. But, on the Linux it was the reverse scenario; I got the pdt to run MUCH slower. Your code just made it run as fast as the declared type on the Linux.
PS: Don't know if this is going to help; but the third block/loop with no types ran with my and your code at same speed as declared type on the Linux, but on Mac both codes showed that the declared type was much slower than the other two.
This is a bit discouraging I must admit, since I ultimately would be running the real cases on big linux machines.
@jim mentioned the registers which could be the cause, how can I compare the registers on both machines. I would think the Linux has a bigger one, just because the processor is stronger, larger cache size etc, and if my Mac's register is not exhausted how could that be on the Linux. I could also be completely wrong on the assumption that the register on my Linux is larger.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AT90 wrote:@FortranFan Thanks so much. I will need a bit of time to study the code and think how to adopt this to a much larger and complex code. But that is definately a way to start. I have one question on this, would you reckon variations in the performance if say I did not use the memembers in consecutive order. E.g. Let say I dont necessairly begin x,y,z,xx etc. I gues that would not matter with your example now due to the way you right out the calc_aux routine which takes arbitrar arrays.
..
@AT90,
Once you take a close look at the lengthy code I posted in Quote #28, you will realize most of it is just a certain coding style/practice with "modern Fortran" (see these references) based rearrangement of what you posted in Quote #13. Basically it's something I find easier to read and follow and hope other readers will too.
The main aspect you would have realized though is the focus on the calculations, the 'dummy' ones I presume you utilized in the 3 DO loops with '56*real(i)-1000*log..', etc. in 9 arrays. Given how you presented them, these calculations are effectively independent of array element order and follow the basic scatter/gather pattern you will come across in any introductory text to parallel programming. In your first DO loop with array object 'tt2' of type 'test2' i.e., a derived type with scalar components, you (are required to) follow the 'classic' sequential programming paradigm and which works ok and compilers such as Intel only have several decades of experience starting with Fortran 90 with optimizing such loops and which you notice in your results. Note with this approach, the aspect of your interest with CONTIGUOUS data does not come into play. However with the other 2 loops, you still follow, needlessly in my opinion, the above same sequential approach which is really not taking advantage in any way of the CONTIGUOUS data arrangement relative to the calculations you are attempting. With the 3rd DO loop though involving ordinary arrays, again a compiler such as Intel has more than several decades of experience with optimization. However things are not so straightfoward with the 2nd loop, whether you use PDTs or derived types with ALLOCATABLE components, both presenting CONTIGUOUS data arrangement. You will find the compiler(s) need extra help in terms of how you write the code, especially if you want them to give you further benefits of optimizing code involving such data structures. And here, it's best if you investigate further by yourself but the chances are you will find both PDTs and derived types with ALLOCATABLE components suffer about the same if you write your code in the way you did in Quote #13. Or that they both work well if you refactor your code taking into account the specific characteristics of your computations, as shown in the include file in Quote #28 as part of the calc_aux subroutine.
My view then with PDTs vs derived types with ALLOCATABLE components is the choice really boils down to 1) quality of implementation of PDTs (e.g., PDTs run into too many bugs in gfortran) and 2) writeability and readability of coding certain data structures; PDTs can be great in this aspect as shown with flexible SoA and AoS pattern in Quote #9.
Nonetheless you need to study your options well and analyze your code and computing requirements thoroughly before investing too much time in an option all too quickly only to realize later that it was inadequate for your needs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AT90 wrote:@FortranFan, I have tested the code and I indeed get similar results as you on my Mac with Core i5@2.6GHz. However, (and this is my motivation for opening this thread again actually), when running it on my Linux station with Intel Xeon@3.6GHz. I get all three cases to be running at same speed.
The latest code I showed where I compared the three different cases I did, despite the issue with vectorization, get the pdt to run much faster than declared type on my mac. But, on the Linux it was the reverse scenario; I got the pdt to run MUCH slower. Your code just made it run as fast as the declared type on the Linux.
PS: Don't know if this is going to help; but the third block/loop with no types ran with my and your code at same speed as declared type on the Linux, but on Mac both codes showed that the declared type was much slower than the other two.
This is a bit discouraging I must admit, since I ultimately would be running the real cases on big linux machines.
@jim mentioned the registers which could be the cause, how can I compare the registers on both machines. I would think the Linux has a bigger one, just because the processor is stronger, larger cache size etc, and if my Mac's register is not exhausted how could that be on the Linux. I could also be completely wrong on the assumption that the register on my Linux is larger.
You may want to request support toward your Linux issues with Intel at their Online Service Center: https://supporttickets.intel.com/servicecenter?lang=en-US
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AT90,
Comments on FortranFan's excellent post #28:
blk1, using pdt_t produces a single pdt with contiguous arrays x(N), ... zzz(N)
Thus X(1), X(2), X(3), ... are contiguous and favor vectorization
call pdt%calc_aux calls caux with the individual arrays within this%
Thus, with exception to the overhead of this call, executes the same code as does blk3 (direct arrays)
blk2, using dt_t, a (contiguous) array of dt_t's, with discontiguous dt(1)%x, dt(2)%x, ...
Thus dt(1)%x, dt(2)%x, ... is incompatible with vectorization (other than with scatter/gather)
Note, vectorization of a member (e.g. X, or Y, ...) with scatter/gather can have degraded performance to that of scalar code.
A redeaming value though is this(i)%x is contiguous with this(i)%y, is condiguouse with this%(i)...
thus presenting a high probability of L1 cache line hit
The specific computation chosen by the test code could (for this test program) could potentially
be optimized by the compiler such that the (for AVX/AVX2), using R8 types, with vector width of 4 doubles,
vector containing x, y, z, xx
vector containing yy, zz, xxx, yyy
scalar containing zzz
However in practice, the computations may not be as favorable to this test program's optimization...
as well as the compiler optimization might not consider this optimization.
blk3, using directly allocatable arrays produce contiguous arrays X, Y, ... zzz
Thus X(1), X(2), X(3), ... are contiguous and favor vectorization
call pdt%calc_aux calls caux with the individual arrays X, Y, ... zzz
Thus, without the overhead of the wrapper call as done by blk1
Additionally
calc_aux.f90 is using DO CONCURRENT(i = 1,size(x))
Recall in my post #16 that register pressure can be reduced by using multiple DO loops. DO concurrent instructs the compiler that it is permissible to automatically perform this conversion for you. However, a potential disadvantage is the output array had to be reprocessed after the do concurrent to produce the sum value into aux. Benchmarking would show you if using DO CONCURRENT or manually breaking the loop into multiple loops is more effective.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Jim Thanks for the comment. Does exploitation of L1 cache and vectorization not go hand in hand. Also, is the code in Block 1 not more cache friendly then the one in block 2?
And is that not the case generally for pdt comapred to derived types
again thanks to you all, really fruitfull!
I have been in contact with support team regarding the discrepancy I mentioned between Linux and mac and now when you clearified that the compiler can freely split the loop, it makes the problem related to the register less likely.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your perspective is from the viewpoint of both blocks being executed using scalar (single) floating point instructions, then the two blocks will have similar L1 cache hit performances.
However, if you consider should block 1 be vectorized, and block 2 not, then consider that
per vector loop iteration, block 1 will have a lower L1 cache hit ratio, however throughput will be faster due to the vector width processing 4 (R8) elements at a time for AVX/AVX2 or 8 (R8) elements at a time for AVX512. For your sample code, using AVX/AVX2 vector code (256 bits), L1 cache hit ratio would be approximately 50% (2 vectors per cache line). On the other hand, using scalar code, assuming R8 types, you would attain approximately 87.5 (7/8) L1 cache hit ratio. Although the hit ratio is better using scalar due to the throughput being less for scalar because the scalar issuing 4x more data fetches and 4x more instruction fetches than vector (double that for AVX512).
If performance is a high priority, then you may need to consider exchanging some elegance in source code off against gaining elegance in machine code generated by the compiler.
If your coding project is just starting, and if code performance is paramount, then it will behoove you to spend some time with data layout of your anticipated heavy computational requirements. IOW construct representative test code from the bottom-up. Then using that information, design your data layout such that it favors performance.
If you are re-coding an existing set of code, then you might be able to look at, and if necessary, reorganize data layout to yield better performance. This article I wrote in 2016 may provide some insight to you.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @jim,
jimdempseyatthecove wrote:If your perspective is from the viewpoint of both blocks being executed using scalar (single) floating point instructions, then the two blocks will have similar L1 cache hit performances.
Will block 1 not have better L1 cache because memory is contiguous (e.g. provided we do something similar to what @FortranFan showed us). I remember you describing very nicely once about the L1 cache issues with declared types versus arrays. Here with block 1 we are almost imitating the scenario with having arrays.
jimdempseyatthecove wrote:per vector loop iteration, block 1 will have a lower L1 cache hit ratio, however throughput will be faster due to the vector width processing 4 (R8) elements at a time for AVX/AVX2 or 8 (R8) elements at a time for AVX512. For your sample code, using AVX/AVX2 vector code (256 bits), L1 cache hit ratio would be approximately 50% (2 vectors per cache line). On the other hand, using scalar code, assuming R8 types, you would attain approximately 87.5 (7/8) L1 cache hit ratio. Although the hit ratio is better using scalar due to the throughput being less for scalar because the scalar issuing 4x more data fetches and 4x more instruction fetches than vector (double that for AVX512).
So here is my understanding: When we send, say, 4 elements for array xx(:) at a time to vectorize, the L1 cache might only be able to store 6 at a time. Hence, the last two elements will be wasted so in this case we have a hit ratio of around 66%. Is that how we should understand it?
Based on your experience, is it possible to tell beforehand which directions a programmer should go in regards with intense computations involving up to and above 1k cores and multiple nodes.
I guess it is all question about whether to go for better L1 hit ratio or vectorization. ?
Finally, if pdt (block1) is more L1 cache friendly (my first question), one could go with that and once turning on the computation decide wether to go most vectorization or being mostly L1 friendly - while if I went for declared type the option regarding L1 cache is no longer as viable.
Thanks a lot for referring that paper. Will be my first thing to do tomorrow.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Will block 1 not have better L1 cache because memory is contiguous...
Blocks 1 and 3 (in scalar execution) will be cache friendly to X(1), X(2), ..., ZZZ(N) access patterns. Meaning you can loop on X, then Y, then X, ... .OR. loop on X, Y, Z, then loop on XX, YY, ZZ, then loop on XXX, YYY, ZZZ
Block 2 (in scalar execution, will be cache friendly for dt(1)%x, dt(1)%y, dt(1)%z, ... dt(1)%zzz, through dt(N)%zzz. Meaning one loop with all members accessed in each iteration.
Blocks 1 and 3 are favorable to vectorization, whereas Block 2 is not for non-scatter/gather.
There is no/negligible performance optimization difference between block 1 and block 2 seeing that the compute loop is the same. the only difference is in the argument preparation for the CALL to the compute function. The CALL for block 1 may require each argument reference to include an integer add instruction (address of pdt + offset to X, address of pdt + offset to Y, ...) a trivial amount of work. Note, due to the arrays for both pdt and standard arrays not being directed as being aligned, the alignment of each array is a "luck of the draw". Alignment differences can affect performance by a few percentage points.
>>So here is my understanding: When we send, say, 4 elements for array xx(:) at a time to vectorize, the L1 cache might only be able to store 6 at a time.
Your understanding of cache is incorrect. It might be beneficial to look at the Wikipedia Article Cache Hierarchy for general information, and How Memory Is Accessed from the IDZ articles. For a more in-depth description as it pertains to Intel CPUs.
A quickie description is (not applicable to all CPUs)
CPUs generally have 2 or 3 cache levels, some/few have 4 levels.:
L1 (~4 or so clock tick access) is faster than L2 (~12 or so clock tick access) is faster than L3 (if present, ~100 or so clock tick access), is faster than RAM access.
L1 is typically 32KB, L2 varies but often is 256 KB, L3 varies often in range of 4MB to 32MB.
Bytes are not cached, rather collections of contiguous bytes are cached, typically in 64 byte, 64-byte aligned data
The cache holds not only the data, but it also holds the address and a tag (and status bits).
The tag is constructed from the address and is typically a reduced (hash) number of bits to facilitate faster search time.
A cache miss at a given level will require a fetch/store from/to a higher level cache or main memory
Advanced cache designs, will detect patterns of access such as streaming loads or fixed stride loads and perform a hardware pre-fetch of the data from higher level cache and/or RAM
Consider what this (may) mean to FortranFan's post #28
N=50 million, the number of properties (arrays) you have is 9, thus:
R4 data the total data is 9 * 4 * 50E6 = 1.8GB
R8 data the total data is 9 * 8 * 50E6 = 3.6GB
Both scenarios easily exceed the capacity of the last level cache and thus should the major loop (DO I=1,N) be executed within an outer loop, the data will not be located in cache.
Let's now look at an example loop (not from your sample code)
do i=1, N ! N = 50 million x(i) = x(i) + SomeValue end do
Assume X is a real of type R4, therefore a 64-byte cache line can contain 16 4-byte R4's
In scalar mode the loop will have:
N/16 memory fetches
N - N/16 L1 cache hits
N adds
N/16 memory stores (assuming write combining by hardware)
In vector mode, using AVX or AVX2
N/16 memory fetches
N/8 - N/16 L1 cache hits
N/8 adds
N/16 memory stores (assuming write combining by hardware)
In vector mode using AVX512 (Skylake...)
N/16 memory fetches
0 L1 cache hits
N/16 adds
N/16 memory stores (assuming write combining by hardware)
Note (strongly) that the wider the vector the lower the hit ratio.... but also the lower the number of hits required. Too often a noob programmer is taught in CS "increase the cach hit ratio", and "attain perfect scaling ratio". Unfortunately this can at times go against "attain best performance". In real-world applications, you are not graded on the first two goals, rather you are graded on the last goal.
I will leave it an exercise for you to extend the operations by latencies.
Additional information. Recall that CPU cache architecture (most Intel) have hardware pre-fetchers. These implementations vary on processor design, but have a limited number of pre-fetchers. The number will vary depending on CPU, but it may be a good assumption that none of them have 9 pre-fetchers. Therefore, it may behoove you to experiment with structure your code to determine the most effective number of member variable (arrays) to process in each loop.
>>Based on your experience, is it possible to tell beforehand which directions a programmer should go in regards with intense computations involving up to and above 1k cores and multiple nodes
Now you present a completely different runtime requirements and beyond the scope of this thread. For the example code stated in this thread, a single node would be faster than any multi-node solution.
>>Finally, if pdt (block1) is more L1 cache friendly (my first question), one could go with that and once turning on the computation decide wether to go most vectorization or being mostly L1 friendly - while if I went for declared type the option regarding L1 cache is no longer as viable.
The computation used in block 1 and block 3 are the same. It should relatively easy to experiment with using a pdt and an analog of pdt dt, where dt declares the arrays x, y, ... zzz as ALLOCATABLE together with an attribute that the allocation be aligned to cache line (64 bytes). This would require to have an init function to perform the allocations and a finish function to deallocate the arrays.
As you saw in the cache comments the goal isn't necessarily L1 cache friendly, rather it is the sum of the memory fetches together with sum of cache hits.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello again
@jimdempseyatthecove : I have made more tests to @FortranFan's code. I have two questions.
1)
First, if you look at the optimisation report, it says that the arrays in 'calc_aux.f90` has unaligned access. That does not make sense since we are declaring them as contiguous. I have tried to compile it with -align array32byte but that did not change it. I tried to follow some of the guidlines in this article https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization - but it did not help.
I then tried to do the alternative ( i have attached the two files of the complete code (it is jsut an extension of FortranFan's code:
type :: soa !DIR$ ATTRIBUTES ALIGN:32::x ! That does not help in the optimisation report when looping real, allocatable :: x(:) !DIR$ ATTRIBUTES ALIGN:32::y real, allocatable :: y(:) !DIR$ ATTRIBUTES ALIGN:32::z real, allocatable :: z(:) real, allocatable :: xx(:) real, allocatable :: yy(:) real, allocatable :: zz(:) real, allocatable :: xxx(:) real, allocatable :: yyy(:) real, allocatable :: zzz(:) contains private procedure, pass(this) :: calc_aux_R8_soa generic, public :: calc_aux => calc_aux_R8_soa end type soa
subroutine calc_aux_R8_soa( this, aux ) implicit none class(soa), intent(inout) :: this real(kind=R8), intent(out) :: aux real(R8) :: t1, t2 integer :: i call cpu_t( time=t1 ) aux = 0.0_R8 !DIR$ ASSUME_ALIGNED this%x(1):32 ! - that helped!!!! !DIR$ ASSUME_ALIGNED this%y(1):32 !DIR$ ASSUME_ALIGNED this%z(1):32 !DIR$ ASSUME_ALIGNED this%xx(1):32 !DIR$ ASSUME_ALIGNED this%yy(1):32 !DIR$ ASSUME_ALIGNED this%zz(1):32 !DIR$ ASSUME_ALIGNED this%xxx(1):32 !DIR$ ASSUME_ALIGNED this%yyy(1):32 !DIR$ ASSUME_ALIGNED this%zzz(1):32 do concurrent( i = 1: size(this% x) ) this%x(i) = 56*real(i)-1000 aux = aux + this%x(i) this%y(i) = 56*real(i)-1000*log(real(i)) aux = aux - this%y(i) this%z(i) = 56*real(i)-1000*log(real(i+1)) aux = aux + this%z(i) this%xx(i) = 56*real(i)-1000*log(real(i+10)) aux = aux - this%xx(i) this%yy(i) = 56*real(i)-1000*log(real(i+3)) aux = aux + this%yy(i) this%zz(i) = 56*real(i)-1000*log(real(i+2)) aux = aux - this%zz(i) this%xxx(i) = 56*real(i)-1000*log(real(i+5)) aux = aux + this%xxx(i) this%yyy(i) = 56*real(i)-1000*log(real(i+4)) aux = aux - this%yyy(i) this%zzz(i) = 56*real(i)-1000*log(real(i+9)) aux = aux - this%zzz(i) end do call cpu_t( time=t2 ) calc_time = t2 - t1 return end subroutine calc_aux_R8_soa
The latter bit did not show unaligned access when I had the "DIR" directives in the subroutine. When I declare them inside the type itself it did not make any change.
Also, can you explain why the optimisation report says that the loop (shown above) is not parallelized because of a dependency. However, a few lines below it says the loop is VECTORIZED?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My second question :
How do I check which vectorization machine I have. I have looked online and the closest thing I come to is running 'cat /proc/cpuinfo` on linux. When I look at my flags I see many things, among other, avx512f , avx, avx2, avx512dq. I initially compiled with `-xcore-avx512` however I realized that made it slow. When I did `-xcore-avx2` it became much better. Hence, I also had to change my Assumed aligned from 64 to 32.
Is there away to make my assumed aligned byte value defined at compilation time so I do not need to change it whenever I change machine, and how do we best read these flags to make sure we compile with the correct/appropriate flags.
Thanks a lot !
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Is there away to make my assumed aligned byte value defined at compilation time so I do not need to change it whenever I change machine
Define it as an environment variable, and use that in your make file... or align at 64 (cache line size) this won't have any performance difference, and a very small memory loss if any.
A (potential) problem with vectorization of the code in #40 is DO CONCURRENT does not have a REDUCTION clause. As such, what you are doing with aux may be problematic. Try using OpenMP for concurrency
... ! you setup OpenMP !$omp parallel do simd reduction(+:aux) do i = 1, size(this% x) this%x(i) = 56*real(i)-1000 aux = aux + this%x(i) this%y(i) = 56*real(i)-1000*log(real(i)) aux = aux - this%y(i) this%z(i) = 56*real(i)-1000*log(real(i+1)) aux = aux + this%z(i) this%xx(i) = 56*real(i)-1000*log(real(i+10)) aux = aux - this%xx(i) this%yy(i) = 56*real(i)-1000*log(real(i+3)) aux = aux + this%yy(i) this%zz(i) = 56*real(i)-1000*log(real(i+2)) aux = aux - this%zz(i) this%xxx(i) = 56*real(i)-1000*log(real(i+5)) aux = aux + this%xxx(i) this%yyy(i) = 56*real(i)-1000*log(real(i+4)) aux = aux - this%yyy(i) this%zzz(i) = 56*real(i)-1000*log(real(i+9)) aux = aux - this%zzz(i) end do !$omp end parallel do simd ...
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page