- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
I know this a has been covered before, but I am specifically looking for two answers to my question. First Have a look at the following code
program test implicit none type :: data integer :: ias = -7777 integer :: is = 0 integer :: ic = 1 real :: x1(3) = 0. real :: xx(3) = 0. real :: u1(3) = 0. !x1 real :: uu(3) = 0. !x2 real :: d1 = 0. real :: dd = 0. real :: m1 = 0. real :: mm = 0. real :: q = 0. real :: nd = 0. real :: dtr = -1. real :: ydef = 0. real :: dydefdt = 0. real :: uflcf(3) = 0. real :: elapst = 0. real :: dist = 0. real :: dt = 1.e-3 real :: ta2ud = 0. real :: eddyt = huge(1e99) real :: usssa(3) = 0. real :: qq2 = 0. ! - real :: nsda = 0. ! - real :: ddtqq = -1. ! - end type data integer(kind=8) :: i, n type(data), allocatable :: dat(:) real :: rmd, rand,t1,t2, omp_get_wtime real, allocatable::x(:,:) n = 50*10**6 allocate(dat(n)) allocate(x(3,n)) rmd = rand(2.0) t1 = omp_get_wtime() do i = 1, n dat(i)%x1(1) = rmd**2 - exp(real(i)) dat(i)%x1(2) = dat(i)%x1(1) - exp(real(i)) dat(i)%x1(3) = dat(i)%x1(2) - exp(real(i)) ! x(1,i) = rmd**2 - exp(real(i)) ! x(2,i)= x(1,i) - exp(real(i)) ! x(3,i)=x(2,i)- exp(real(i)) enddo t2 = omp_get_wtime() print*, t2-t1 end program
Now the difference between doing x(1,i) = ** and dat(i)% x1(1:3) is quite a lot when comparing t2-t1, which is because of the way memory is stored and cache access.
I firstly don't understand that printed t2-t1 exclusively for the case with dat(i)%.. can be way smaller then when compared to a physical stop watch. I felt sometime that the t2-t1 was quite smaller than what it felt, so when trying to use my stop watch I observed a a difference.
Secondly, and most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AT90 wrote:.. most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? ..
.. how is this achievable is this in Fortran without being verbose.
.. isnt there a more elegant way?
@AT90,
Would you admit what's "elegant" is in the eyes of the beholder?! With situations like this, I've often suggested to readers to consider the facility introduced in standard Fortran with the 2003 revision which is parameterized derived type (PDT):
With this facility, one can adopt a design which can, with some understanding and consideration, be used with both the "Array of Structs" case that you are using now as well as with the "Struct of Arrays" case which is what will be recommended when better performance via faster processing of in-memory data is desired.
Take a look at this somewhat modified example of the code you show in the original post and please report here if this is "elegant" enough for you!
module kinds_m use, intrinsic :: iso_fortran_env, only : I8 => int64 implicit none integer, parameter :: R4 = selected_real_kind( p=6 ) integer, parameter :: R8 = selected_real_kind( p=12 ) end module module cpu_m use kinds_m, only : I8, R8 implicit none contains subroutine cpu_t( time ) !.. Argument list real(R8), intent(inout) :: time !.. Local variables integer(I8) :: tick integer(I8) :: rate call system_clock (tick, rate) time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) return end subroutine cpu_t end module module data_m use kinds_m, only : R4, R8 use cpu_m, only : cpu_t implicit none type :: data_t(K,N) integer, kind :: K = R4 integer, len :: N = 1 real(kind=K) :: x1(3,N) end type data_t real(R8), protected, public :: init_time = 0.0_r8 interface init_data module procedure init_data_scalar module procedure init_data_rank1 end interface contains subroutine init_data_rank1( dat, val ) ! Arguments type(data_t(K=R4,N=*)), intent(inout) :: dat(:) real(R4), intent(in) :: val ! Local variables real(kind=R8) :: start_time real(kind=R8) :: end_time integer :: i if (dat(1)%N /= 1 ) return !<== this specific routine can be for N == 1 only call cpu_t( start_time ) do i = 1, size(dat) dat(i)%x1(1,1) = val**2 - log(real(i)) dat(i)%x1(2,1) = dat(i)%x1(1,1) - log(real(i)) dat(i)%x1(3,1) = dat(i)%x1(2,1) - log(real(i)) end do call cpu_t( end_time ) init_time = end_time - start_time return end subroutine init_data_rank1 subroutine init_data_scalar( dat, val ) ! Arguments type(data_t(K=R4,N=*)), intent(inout) :: dat real(R4), intent(in) :: val ! Local variables real(kind=R8) :: start_time real(kind=R8) :: end_time integer :: i if (dat%N <= 1 ) return !<== this specific routine can be for N > 1 only call cpu_t( start_time ) do i = 1, dat%N dat%x1(1,i) = val**2 - log(real(i)) dat%x1(2,i) = dat%x1(1,i) - log(real(i)) dat%x1(3,i) = dat%x1(2,i) - log(real(i)) end do call cpu_t( end_time ) init_time = end_time - start_time return end subroutine init_data_scalar end module data_m
program test use kinds_m, only : R4, I8 use data_m, only : data_t, init_data, init_time implicit none ! Local variables real(R4) :: rnd character(len=*), parameter :: fmtv = "(g0,t12,g0)" character(len=*), parameter :: fmtt = "(g0,f10.3)" call random_number( rnd ) blk_AoS: block integer(kind=I8) :: n type(data_t(K=R4,N=1)), allocatable :: dat(:) print *, "Case: Array of Structs (AoS) " n = 50*10**6 allocate( dat(n) ) call init_data( dat, val=rnd ) print fmtt, "Init time (seconds): ", init_time print *, "Output some values" print fmtv, "i", "x1(1,i)" print fmtv, 1, dat(1)%x1(1,1) print fmtv, n, dat(n)%x1(1,1) end block blk_AoS print * blk_SoA: block integer(kind=I8) :: n type(data_t(K=R4,N=:)), allocatable :: dat print *, "Case: Struct of Arrays (SoA) " n = 50*10**6 allocate( data_t(K=R4,N=n) :: dat ) call init_data( dat, val=rnd ) print fmtt, "Init time (seconds): ", init_time print *, "Output some values" print fmtv, "i", "x1(1,i)" print fmtv, 1, dat%x1(1,1) print fmtv, n, dat%x1(1,n) end block blk_SoA stop end program
Upon one execution of this example on Windows platform, the output is:
Case: Array of Structs (AoS) Init time (seconds): 8.538 Output some values i x1(1,i) 1 .1537321E-12 50000000 -17.72753 Case: Struct of Arrays (SoA) Init time (seconds): 0.147 Output some values i x1(1,i) 1 .1537321E-12 50000000 -17.72753 Press any key to continue . . .
As you will expect, the CPU time with the "Struct of Arrays" case is significantly faster than the "Array of Structs" case. Note the results are identical in the 2 cases to show you it's a valid comparison.
You can take the above example and extend it as appropriate with your derived type for 'data' and check whether this is something you can use in your actual code.
P.S.> By the way, what's up with your instruction "exp(real(i))" when you are using default real kind with a range of about 1E38 in Intel Fortran? What do you expect to happen 'i' becomes big with the large data sizes you are using? For illustration purposes, the log function has been used in place of the exponential one.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this:
t1 = omp_get_wtime() do i = 1, n dat(i)%x1(1) = rmd**2 - sqrt(real(i)) dat(i)%x1(2) = dat(i)%x1(1) - sqrt(real(i)) dat(i)%x1(3) = dat(i)%x1(2) - sqrt(real(i)) enddo t2 = omp_get_wtime() print*, t2-t1 t1 = omp_get_wtime() do i = 1, n x(1,i) = rmd**2 - sqrt(real(i)) x(2,i)= x(1,i) - sqrt(real(i)) x(3,i)=x(2,i)- sqrt(real(i)) enddo t2 = omp_get_wtime() print*, t2-t1 ! code to force compiler to .NOT. remove code t1 = 0.0 do i = 1, n t1 = t1 + dat(i)%x1(1) t1 = t1 + dat(i)%x1(2) t1 = t1 + dat(i)%x1(3) t1 = t1 + x(1,i) t1 = t1 + x(2,i) t1 = t1 + x(3,i) enddo if(t1 == 0) print *,"not going to happen"
It is important to insert some use of the data produced such that the compiler does not remove the code due to results not being used.
(this is similar to dead code elimination - dead results elimination)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Exp is not vectorizable, sqrt is vectorizable. Also sqrt is heavy weight. You might also consider removing sqrt and test just use real(i)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply Jim!,
Is there away to make each member of the declared type a continuous array.
What I mean is if it is possible to make something like having that dat(:)% x(1) is contiguous occupying the first XX memory addresses and then dat(:)% x(2) occupying the next XX memory addresses etc..
As I understand it now is that dat(1) has all its members occupying an X amount of memory where X is the memory of all its members at dat(1) and then dat(2) for the next etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Is there away to make each member of the declared type a continuous array.
Not that I know of. At some time in one of my earlier posts I suggested that it would be nice to introduce an aggregate type that would do just that. There would still be some issues that would have to be worked out, for example for those user desiring to allocate individual nodes. Perhaps the way to deal with this is to use an attribute on the array descriptor
type(data), aggregate, allocatable, :: dat(:) ! *** aggregate is proposed
Then:
dat(i)%x1(1) = rmd**2 -sqrt(real(i)) is converted into: dat%x1(1,i) = rmd**2 -sqrt(real(i)) ! dat as aggregate with implicit repositioned rank
I think this would be a good language extension. (Integrating with debugger may be an issue)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What you want is a "Structure of Arrays" instead of "Array of Structures". This is a classic refactoring for exactly the purpose you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Steve
yes this is in practice what I want. But, how is this achievable is this in Fortran without being verbose.
Could write
type dat
real, allocatable :: c(:)
...
end type
then just declare type as on instance and allocate all its members with the wirgt dimension.
but isnt there a more elegant way?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is no more elegant way, sorry. I'm not sure there is in any language.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AT90 wrote:.. most importantly, if we for some reason are reluctant to go away from the declared type style, is there really no way the compiler can be tricked to consider those members of the declared type as continguous individually ? ..
.. how is this achievable is this in Fortran without being verbose.
.. isnt there a more elegant way?
@AT90,
Would you admit what's "elegant" is in the eyes of the beholder?! With situations like this, I've often suggested to readers to consider the facility introduced in standard Fortran with the 2003 revision which is parameterized derived type (PDT):
With this facility, one can adopt a design which can, with some understanding and consideration, be used with both the "Array of Structs" case that you are using now as well as with the "Struct of Arrays" case which is what will be recommended when better performance via faster processing of in-memory data is desired.
Take a look at this somewhat modified example of the code you show in the original post and please report here if this is "elegant" enough for you!
module kinds_m use, intrinsic :: iso_fortran_env, only : I8 => int64 implicit none integer, parameter :: R4 = selected_real_kind( p=6 ) integer, parameter :: R8 = selected_real_kind( p=12 ) end module module cpu_m use kinds_m, only : I8, R8 implicit none contains subroutine cpu_t( time ) !.. Argument list real(R8), intent(inout) :: time !.. Local variables integer(I8) :: tick integer(I8) :: rate call system_clock (tick, rate) time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) ) return end subroutine cpu_t end module module data_m use kinds_m, only : R4, R8 use cpu_m, only : cpu_t implicit none type :: data_t(K,N) integer, kind :: K = R4 integer, len :: N = 1 real(kind=K) :: x1(3,N) end type data_t real(R8), protected, public :: init_time = 0.0_r8 interface init_data module procedure init_data_scalar module procedure init_data_rank1 end interface contains subroutine init_data_rank1( dat, val ) ! Arguments type(data_t(K=R4,N=*)), intent(inout) :: dat(:) real(R4), intent(in) :: val ! Local variables real(kind=R8) :: start_time real(kind=R8) :: end_time integer :: i if (dat(1)%N /= 1 ) return !<== this specific routine can be for N == 1 only call cpu_t( start_time ) do i = 1, size(dat) dat(i)%x1(1,1) = val**2 - log(real(i)) dat(i)%x1(2,1) = dat(i)%x1(1,1) - log(real(i)) dat(i)%x1(3,1) = dat(i)%x1(2,1) - log(real(i)) end do call cpu_t( end_time ) init_time = end_time - start_time return end subroutine init_data_rank1 subroutine init_data_scalar( dat, val ) ! Arguments type(data_t(K=R4,N=*)), intent(inout) :: dat real(R4), intent(in) :: val ! Local variables real(kind=R8) :: start_time real(kind=R8) :: end_time integer :: i if (dat%N <= 1 ) return !<== this specific routine can be for N > 1 only call cpu_t( start_time ) do i = 1, dat%N dat%x1(1,i) = val**2 - log(real(i)) dat%x1(2,i) = dat%x1(1,i) - log(real(i)) dat%x1(3,i) = dat%x1(2,i) - log(real(i)) end do call cpu_t( end_time ) init_time = end_time - start_time return end subroutine init_data_scalar end module data_m
program test use kinds_m, only : R4, I8 use data_m, only : data_t, init_data, init_time implicit none ! Local variables real(R4) :: rnd character(len=*), parameter :: fmtv = "(g0,t12,g0)" character(len=*), parameter :: fmtt = "(g0,f10.3)" call random_number( rnd ) blk_AoS: block integer(kind=I8) :: n type(data_t(K=R4,N=1)), allocatable :: dat(:) print *, "Case: Array of Structs (AoS) " n = 50*10**6 allocate( dat(n) ) call init_data( dat, val=rnd ) print fmtt, "Init time (seconds): ", init_time print *, "Output some values" print fmtv, "i", "x1(1,i)" print fmtv, 1, dat(1)%x1(1,1) print fmtv, n, dat(n)%x1(1,1) end block blk_AoS print * blk_SoA: block integer(kind=I8) :: n type(data_t(K=R4,N=:)), allocatable :: dat print *, "Case: Struct of Arrays (SoA) " n = 50*10**6 allocate( data_t(K=R4,N=n) :: dat ) call init_data( dat, val=rnd ) print fmtt, "Init time (seconds): ", init_time print *, "Output some values" print fmtv, "i", "x1(1,i)" print fmtv, 1, dat%x1(1,1) print fmtv, n, dat%x1(1,n) end block blk_SoA stop end program
Upon one execution of this example on Windows platform, the output is:
Case: Array of Structs (AoS) Init time (seconds): 8.538 Output some values i x1(1,i) 1 .1537321E-12 50000000 -17.72753 Case: Struct of Arrays (SoA) Init time (seconds): 0.147 Output some values i x1(1,i) 1 .1537321E-12 50000000 -17.72753 Press any key to continue . . .
As you will expect, the CPU time with the "Struct of Arrays" case is significantly faster than the "Array of Structs" case. Note the results are identical in the 2 cases to show you it's a valid comparison.
You can take the above example and extend it as appropriate with your derived type for 'data' and check whether this is something you can use in your actual code.
P.S.> By the way, what's up with your instruction "exp(real(i))" when you are using default real kind with a range of about 1E38 in Intel Fortran? What do you expect to happen 'i' becomes big with the large data sizes you are using? For illustration purposes, the log function has been used in place of the exponential one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@FortranFan This is great!. Thanks a lot for this suggestion, wasn't aware of it before now.
Two question:
1) I prefer to compile with desired integer and real byte, so I could avoid specifying the kind in the declared type and let the default work it out.
2) I see you use assumed shape in the subroutine call
type(data_t(K=R4,N=*)), intent(inout) :: dat(:)
Could I do
type(data_t(K=R4,N=size_data_t)), intent(inout) :: dat(:)
where size_data_t is an "integer, intent(in)" in that routine?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Never mind the above, I just tested it myself and both seem to be accepted by the compiler
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@AT60,
Glad you tested out the code with your changes too.
Re: your point above with compilation using desired integer and real 'bytes', portability can be an issue. Fortran standard is a 'friend' here, it has gone to great length to help in this aspect, and coders will generally find it beneficial to employ standard language features and to minimize compiler-option driven programs.
Re: your point, yes you can do so as you found out. The standard states, "The length type parameter values of a present actual argument shall agree with the corresponding ones of the dummy argument that are not assumed"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@FortranFan
I realized something with parametrized declared type which is a bit unexpected to me.
It seems like the fortran compiler is unaware of the memory stride of such types. I realized that by writing out the optimization report and realized the compiler refuses to vectorize it at first hand. After when I force it by having a $omp simd it writes out that the type is non-stride (like the normal declared type) and that the compiler is unaware of its stride. At least in a regular declared type, you have the compiler knowing the jumps in memory.
An example of the code is given below. Compile first without -iopenmp to realize that the middle loop (parametrized) type is not vectorized, and after try with openmp to observe what the report is now telling.
program test implicit none type test1(nn) integer, len :: nn real :: x(nn),y(nn),z(nn), xx(nn),yy(nn), zz(nn) real :: xxx(nn),yyy(nn),zzz(nn), xxxx(nn),yyyy(nn), zzzz(nn) end type type test2 real :: x,y,z, xx,yy, zz real :: xxx,yyy,zzz, xxxx,yyyy, zzzz real :: xxx1,yyy2,zzz2, xxxx2,yyyy2, zzzz2 real :: xxx4,yyy4,zzz4, xxxx4,yyyy4, zzzz23 end type character(8) :: a integer :: i integer, parameter :: size = 5e7 type(test1(nn=:)), allocatable :: tt type(test2), allocatable ::tt2(:) !real(8) :: t1,t2, omp_get_wtime real, allocatable :: x(:),y(:),z(:), xx(:),yy(:), zz(:) real, allocatable :: xxx(:),yyy(:),zzz(:), xxxx(:),yyyy(:), zzzz(:) real :: aux integer(8) :: t1,t2,clock_rate,dum allocate(x(size),y(size),z(size), xx(size),yy(size), zz(size), & xxx(size),yyy(size),zzz(size), xxxx(size),yyyy(size), zzzz(size)) allocate(tt2(size)) allocate(test1(nn=size)::tt) aux = 0 call system_clock ( t1, clock_rate, dum ) do i = 1, size tt2(i)% x = 56*real(i)-1000 aux = aux + tt2(i)% x tt2(i)% y = 56*real(i)-1000*log(real(i)) aux = aux - tt2(i)% y tt2(i)% z = 56*real(i)-1000*log(real(i+1)) aux = aux + tt2(i)% z tt2(i)% xx = 56*real(i)-1000*log(real(i+10)) aux = aux - tt2(i)% xx tt2(i)% yy = 56*real(i)-1000*log(real(i+3)) aux = aux + tt2(i)% yy tt2(i)% zz = 56*real(i)-1000*log(real(i+2)) aux = aux - tt2(i)% zz tt2(i)% xxx = 56*real(i)-1000*log(real(i+5)) aux = aux + tt2(i)% xxx tt2(i)% yyy = 56*real(i)-1000*log(real(i+4)) aux = aux - tt2(i)% yyy tt2(i)% zzz = 56*real(i)-1000*log(real(i+9)) aux = aux - tt2(i)% zzz enddo call system_clock ( t2, clock_rate, dum ) print*, real(t2-t1)/real(clock_rate) print*, "Fakaart", aux; aux =0 t1=0 t2=0 call system_clock ( t1, clock_rate, dum ) !$omp simd do i = 1, size tt% x(i) = 56*real(i)-1000 aux = aux + tt% x(i) tt% y(i) = 56*real(i)-1000*log(real(i)) aux = aux - tt% y(i) tt% z(i) = 56*real(i)-1000*log(real(i+1)) aux = aux + tt% z(i) tt% xx(i) = 56*real(i)-1000*log(real(i+10)) aux = aux - tt% xx(i) tt% yy(i) = 56*real(i)-1000*log(real(i+3)) aux = aux + tt% yy(i) tt% zz(i) = 56*real(i)-1000*log(real(i+2)) aux = aux - tt% zz(i) tt% xxx(i) = 56*real(i)-1000*log(real(i+5)) aux = aux + tt% xxx(i) tt% yyy(i) = 56*real(i)-1000*log(real(i+4)) aux = aux - tt% yyy(i) tt% zzz(i) = 56*real(i)-1000*log(real(i+9)) aux = aux - tt% zzz(i) enddo !$omp end simd call system_clock ( t2, clock_rate, dum ) print*, real(t2-t1)/real(clock_rate) print*, "Fakaart", aux aux=0 t1=0 t2=0 call system_clock ( t1, clock_rate, dum ) do i = 1, size x(i) = 56*real(i)-1000 aux =aux + x(i) y(i) = 56*real(i)-1000*log(real(i)) aux =aux - y(i) z(i) = 56*real(i)-1000*log(real(i+1)) aux = aux+ z(i) xx(i) =56*real(i)-1000*log(real(i+10)) aux = aux - xx(i) yy(i) = 56*real(i)-1000*log(real(i+3)) aux = aux+ yy(i) zz(i) = 56*real(i)-1000*log(real(i+2)) aux = aux - zz(i) xxx(i) = 56*real(i)-1000*log(real(i+5)) aux = aux+ xxx(i) yyy(i) = 56*real(i)-1000*log(real(i+4)) aux = aux - yyy(i) zzz(i) = 56*real(i)-1000*log(real(i+9)) aux = aux - zzz(i) enddo call system_clock ( t2, clock_rate, dum ) print*, real(t2-t1)/real(clock_rate) print*, "Fakaart", aux end program
Does anybody else in the forum know why this is happening.
A serious question: If this is something expected and how things should be - Why would I even consider using parametrized type?
And would you advise me to use auxiliary contiguous pointers which points to each member of the parametrized type - I would imagine that you in that way tweak the compiler to treat the members as contiguous in memory?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The structure of the test code above is not structured to support vectorization. The reason for this is (may be) due to aux not being a candidate for reduction. You can explicitly request reduction, however you will have to choose either: +:aux or -:aux *** and write your aux expressions accordingly. e.g. with +:aux use aux=aux + (-y(i))
The compiler should be able to accept either + or - for +:reduction but it is not stated as doing so.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the comment.
Two tings:
have an aux variable for two reasons,
1) I just to confirm that all three loops are doing the same.
2) I was told that some compilers might be smart enought to see that the members, xx,yy etc., are not being used further hence it does not carry the work. So by having an aux to later print it, the compiler should do the work
Regarding your comment on reduction.
My concern is not really associated to that. My concern is more to the comparison with the different data structures I have shown. Running with a "normal" type, the optimisation report shows that it can vectorize my loop and it knows about the stride structure.
However, when dealing with parametrized types, there are two tings:
1) The report does say that it CANNOT vectorize the loop
2) When I then enable omp simd, I see that the compiler even does not know the stride structure of that type
This being said, I would have imagined that the compiler " knows " the members of the parametrized type are contiguous in memory and after all that vectorization should be done straight forwardly as like having an allocatable array (seen in the last loop)
The whole concern started when I tested these three loops on my Linux station and realized that the parametrized declared type "loop" was the least efficient loop!
This is something I found very dubious. Am I missing something here
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional thing to consider. If the code generation is to SSE, and possibly to AVX, the register pressure of those loops may exceed the number of registers available to provide good vectorization. See what happens when you reduce the number of statements:
do i = 1, size x(i) = 56*real(i)-1000 aux =aux + x(i) y(i) = 56*real(i)-1000*log(real(i)) aux =aux - y(i) z(i) = 56*real(i)-1000*log(real(i+1)) aux = aux+ z(i) end do do i = 1, size xx(i) =56*real(i)-1000*log(real(i+10)) aux = aux - xx(i) yy(i) = 56*real(i)-1000*log(real(i+3)) aux = aux+ yy(i) zz(i) = 56*real(i)-1000*log(real(i+2)) aux = aux - zz(i) end do do i = 1, size xxx(i) = 56*real(i)-1000*log(real(i+5)) aux = aux+ xxx(i) yyy(i) = 56*real(i)-1000*log(real(i+4)) aux = aux - yyy(i) zzz(i) = 56*real(i)-1000*log(real(i+9)) aux = aux - zzz(i) enddo
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am afraid I am not sure what SSE and AVX stand for or are.
Also, that does not explain the issue, does it? I am running three loop over three different data structures. Two are being vectorized and one cannot be vectorized because the compiler thinks there is a dependency. But clearly the three loops are almost identical, only the difference is that one is declared type (arrays of structure), while the other, parametrized type.
I really just want to understand why the compiler cannot vectorize the middle loop of the latest I have shown, and why the compiler says that the stride is unknown. Should not it be clear that test% xx(:) is contiguous, is that not what we want to achieve with parametrized types?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I am afraid I am not sure what SSE and AVX stand for or are
A basic understanding of MMX, SSE, AVX and AVX512 is required in order to understand what "vectorization" means in the realm of computation. This instructions are in a class of Single Instruction, Multiple Data (aka SIMD)
MMX is 64 bits wide (e.g. 8 bytes, 4 words, 2 dwords, 1 qword)
SSE is 128 bits wide (widens and adds 4 floats, and 2 doubles plus more instructions)
AVX/AVX2 are 256 bits wide (widens to 8 floats and 4 doubles plus more instructions)
AVX512(and variants) are 512 bits wide (widens to 16 floats and 8 doubles plus more instructions)
If, as an example, your system supported AVX (and the compiler generated AVX vectorized instructions), the generated loop could perform SIMD vectors of 8 floats per iteration. the loop being converted to equivalent of:
do i = 1, size, 8 x(i:i+7) = 56*real(i:i+7)-1000 aux(1:8) = aux(1:8) + x(i:i+7) ...
Or more visually informative:
temp_aux(1:8) = /0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0/ temp_reals(1:8) = /1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0/ do i = 1, size, 8 x(i:i+7) = 56*temp_real(1:8)-1000 temp_aux(1:8) = temp_aux(1:8) + x(i:i+7) ... temp_real(1:8) = temp_real(1:8) + 8.0 end do aux = sum(temp_aux)
Where single instructions operate on 8 floats per operation.
Each of the instruction sets have different numbers of SIMD registers
MMX has 8 registers (integer types only)
SSE, AVX, AVX2 have 16 registers (integer and floating point types)
AVX512 has 32 registers
Depending on the code in the loop, sometimes it can be better to split a loop into multiple loops such that the code better fits within the available registers. The compiler optimization can at times perform the loop splitting for you and sometimes it is not savvy enough to do so.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for letting me know this. I will catch up on these instructions. However, I don't see what this would have to do with my original question?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And a little more in the abstract:
temp_aux(1:8) = /0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0/ ! register temp_reals(1:8) = /1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0/ ! register temp_56(1:8) = /56.0, 56.0, 56.0, 56.0, 56.0, 56.0, 56.0, 56.0/ ! register temp_1000(1:8) = /1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0/ ! register temp_8(1:8) = /8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0/ ! register do i = 1, size, 8 scratch(1:8) = temp_56(1:8) ! one instruction scratch(1:8) = scratch(1:8) * temp_real(1:8) ! one instruction scratch(1:8) = scratch(1:8) - temp_1000(1:8) ! one instruction x(i:i+7) = scratch(1:8) ! one instruction temp_aux(1:8) = temp_aux(1:8) + scratch(1:8) ! one instruction ... temp_real(1:8) = temp_real(1:8) + temp_8(1:8) ! one instruction end do aux = sum(temp_aux) ! possibly as few as 3 instructions
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, thanks very much. It is indeed nice to know this, however, how is that related to my question?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page