- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I made this simplified test code
program openmp_test implicit none type :: t_data integer :: size_ integer, allocatable :: counter(:) end type t_data type t_stat integer :: trc(20) end type t_stat type(t_data) :: td type(t_stat) :: ts integer :: i , ip allocate(td% counter(100)) td%size_ = 100 ts% trc = 3 td% counter = 10 ip = 40 !$OMP SIMD do i = 1, 10 ! the loop I want to vectorize ip = ip + 1 td% counter(ip) = td% counter(ip) + 1 ts% trc(8) = ts% trc(8) + 1 enddo do i = 0, 11 print*, td% counter(i+40) enddo print*, "TRC" , ts% trc(8) end program openmp_test
Now my question is how can I correctly/safely make this loop so my ip is incremented corretly and that I get my ts% trc counter correct?
I am aware of some of the different clauses, however, some seem not to work with simd? For example, if I do !$OMP SIMD SHARED(ip) it complains that it was not expecting such clause. Similarly for firstprivate. Is that purposely made?
Also, seems that there is an internal error when doing reduction(+:ts%trc(8)). Does reduction clause normally accepts declared type?. If so, I will report this to intel Support team
I am using ifort 19.0.1
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
noz = 60
nsl = 120
randw = 10*noz*nsl
allocate(part% duma(noz*nsl*randw),part% dumb(noz*nsl*randw),&
& part% dumc(noz*nsl*randw),part% dumd(noz*nsl*randw),&
& part% mass(randw*noz*nsl),part% xmom(randw*noz*nsl),part% ymom(randw*noz*nsl),&
& part% zmom(randw*noz*nsl),ic(randw*noz*nsl))
randw = 10
On my test system I have 16GB
Each allocation is 60*120*10*60*120(*4 for real and integer) = 2073600000 bytes * 9 allocations == 18GB.
While the allocations succeeded for Virtual Memory, the Windows Page file was thrashing horribly. I assume that the allocation for this test program should have been to randw and not to randw*noz*nsl.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Debug build, you do have an index out of range:
ind = 0
do fn = 1, noz
do inj= 1, nsl
!$OMP SIMD private(ind)
do rw = 1, randw
ind = ind + rw (72001, size+1)
Two issues with the above. private may need to be firstprivate such that the initial value of ind is copied in. *** however this may not be your intention.
*** but this is not what the serial code will do. Because:
1) assuming ind on first iteration starts at 0 (Debug build it will, Release ???) ind has the values 1, 3, 6, 10, 15, 21, 28, 36...
2) On subsequent iterations it resumes in single threaded mode, in parallel mode the additional threads may resume with junk.
3) In any event (either event) ind will eventually index out of range of allocation (but may not have with allocation as in #22 above)
4) Due to ind being non-sequential, there is no way that this loop can be SIMD-ized.
The following modifications to your code, vectorized the inner loop, **** however, this may or may not be what you intended to do.
module mod_part
implicit none
type part1
real , allocatable :: duma(:), dumb(:), dumc(:), dumd (:)
real , allocatable :: mass(:), xmom(:), ymom(:), zmom(:)
contains
procedure, pass(this) :: ADD_FUEL_SOURCES
procedure, pass(this) :: dum_loop
end type
contains
subroutine ADD_FUEL_SOURCES(this, m1, m2, u1,v1,w1, nd, ic)
!$OMP DECLARE SIMD(ADD_FUEL_SOURCES) UNIFORM(this)
implicit none
class(part1), intent(inout) :: this
integer, intent(in) :: ic
real, intent(in) :: nd, m1,m2,w1,u1,v1
real :: amd, dmasum, dm, mc
dm = m1 - m2
amd = nd * dm
this% mass(ic) = this% mass(ic) + amd
this% xmom(ic) = this% xmom(ic) + amd*u1
this% ymom(ic) = this% ymom(ic) + amd*v1
this% zmom(ic) = this% zmom(ic) + amd*w1
! this% enthalpy(ic) = this% enthalpy(ic) + nd*hfuel
end subroutine ADD_FUEL_SOURCES
subroutine dum_loop(this,size_)
implicit none
class(part1), intent(inout) :: this
integer, intent(in) :: size_
integer :: i
!$OMP SIMD
do i = 1, size_
this% mass(i) = i + 1-size_+size_/4.3
enddo
end subroutine dum_loop
end module
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
program TEST_OMPSIMD
use mod_part
implicit none
integer :: noz, inj, rw , nrw, nsl , fn ,i, randw, nozcnt, injcnt, ind, indoffset
integer, allocatable :: ic(:)
real :: m1,m2, u1,v1,w1, nd
type(part1) :: part
noz = 60
nsl = 120
nrw = 10
randw = nrw*noz*nsl
allocate(part% duma(randw),part% dumb(randw),&
& part% dumc(randw),part% dumd(randw),&
& part% mass(randw),part% xmom(randw),part% ymom(randw),&
& part% zmom(randw),ic(randw))
randw = 10
ind = 0
do i = 1, randw
if( mod(i,3 ) ) then
ind = 1
else
ind = ind + 1
endif
enddo
ic = ind
nozcnt = 0
injcnt = 0
! assuming linear array allocation dum?(nrw*noz*nsl) is equivilent to dum?(nrw, noz, nsl)
do fn = 1, noz
do inj= 1, nsl
indoffset = (inj-1)*nrw + (fn-1)*nrw*nsl
!$OMP SIMD private(ind)
do rw = 1, nrw
ind = indoffset + rw
part% duma( ind ) = REAL(fn*rw*inj-rw*inj )
part% dumb( ind ) = REAL(fn*rw*inj-1-rw*inj)
part% dumc( ind ) = REAL(fn*rw*inj-2-rw*inj)
part% dumd( ind ) = REAL(fn*rw*inj-3-rw*inj)
m1 = part% duma( ind )-part% dumb( ind )
nd = part% dumc(ind)*m1-part% dumd( ind )
m2 = m1 + nd
u1 = part% duma(ind)/part% dumb(ind)
v1 = part% dumb(ind)/100.
w1 = part% dumb(ind)
! ic = rw
!print*, m1,m2, u1,v1,w1,nd,ic
call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
enddo
enddo
enddo
print*, sum(part%duma)+sum(part%dumb)+sum(part%dumc)+sum(part%dumd)
print*, sum(part%mass ), sum(part%ymom )
call part% dum_loop(10000)
print*, sum(part% mass(1:10000))
end program TEST_OMPSIMD
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note, private(ind) is valid now because it is not first read. Also, if later you OpenMP parallel the outer loop or outer two loops with collaps(2) the code will still work, assuming no overlapping regions of ind amongst threads.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jim
Thanks very much indeed.
the only problem i am still seeing or not entirely sure is what happens when ic has a repeated index for one fn and nsl iteration.
Imagine that my ic is a mapping to another grid. Hence, it size is not dependent on the dumb arrays. Alike for the source terms which are of same size as ic.
So imagine my rw =2:6 all got the index ic=5. In vectorized mode they willeach indepedently access ic=5 for the mass term and update it. However, i would need all contributions added up at ic=5.
So i need some sort of reduction dont I? How can I do this with my arrays mass, xmom etc when they are part of the declared type
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fully vectorizing a loop with an indirect index is problematic. The ic(ind) should thwart usual vectorization as while the dum?(ind) is contiguous, ic(inc) is unknown as to be contiguous or not. This said, compiler optimizations have come a long way. I haven't looked at the generated assembly code on my test system (Core i7 2600K with AVX), which does not have scatter/gather. The vectorization report stated it vectorized two lanes of the possible 8. It could possibly be that it generated code to loop ind with stride of 2 and then check to see if ic(ind) and ic(ind+1) are adjacent, if so, call a 2-wide vector version of ADD_FUEL_SOURCES, and if not call, twice, a scalar version of ADD_FUEL_SOURCES. Don't know.
This said, if (when) you add parallelization, the compiler could not be capable making any determination about possible conflicts amongst threads. This means that you will have to smarten up your code when you go parallel. As to what to do, this greatly depends on your actual code. In particular what is actually done in your do rw loop and what is actually don in your ADD_FUEL_SOURCES. IOW which portion to favor in your optimization.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, although IC(1:end) is contiguous, the way we would access it in the rw loop is completely arbitrary. Would a Reduction in this sense help?
For example, reduction(+:this% mass). Although, it seems that reduction cannot cope with declared type so I would need to find a workaround or simply use auxiliary arrays for mass etc and do reduction of these. Then after the loop I could copy over the aux arrays to the declared type?
Is that usually how it is done?
PS: My RW loop is literally what I have shown although many more components of my declare types. But all have trivial operations and are done in the loop and independent on each rw iteration. It is only at the end when I have the ADD_FUEL_SOURCES I am facing the indirect indexing.
Cheers!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>For example, reduction(+:this% mass).
While that would work, it would be extremely inefficient, not only do you have mass but you also have xmom, ymom, zmom, ???mom - each allocated to the full extent of the number of particles. It may (may stressed) be better to determine for each xxx(ic) a list of contributors this is to say a list of ind's that contribute to the fuel sources. *** but the strategy greatly depends on the computation required between the caller and the callee. For example (not saying this is what you do), in the caller you might use
...
do rw = 1, nrw
ind = indoffset + rw
part% duma( ind ) = REAL(fn*rw*inj-rw*inj )
part% dumb( ind ) = REAL(fn*rw*inj-1-rw*inj)
part% dumc( ind ) = REAL(fn*rw*inj-2-rw*inj)
part% dumd( ind ) = REAL(fn*rw*inj-3-rw*inj)
m1(ind) = part% duma( ind )-part% dumb( ind )
nd(ind) = part% dumc(ind)*m1-part% dumd( ind )
m2(ind) = m1(ind) + nd(ind)
u1(ind) = part% duma(ind)/part% dumb(ind)
v1(ind) = part% dumb(ind)/100.
w1(ind) = part% dumb(ind)
enddo
! the above loop is fully vectorizable
! but it has not performed:
! call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
!dir$ if defined(UseSimpleMethod)
do rw = 1, nrw
ind = indoffset + rw
call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
end do
!dir$ else
!dir$ if defined(Use2WideMethod)
do rw = 1, nrw, 2
ind = indoffset + rw
if(rw == nrw) then
call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
else
if(ic(ind) + 1 == ic(ind+1)) then
! adjacent cells
call part% ADD_FUEL_SOURCES_X2(m1,m2, u1,v1,w1,nd,ic(ind)) ! and ic(ind)+1 same as ic(inc+1)
else
call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind+1))
endif
endif
end do
!dir$ else
... You expand to 4-wide and 8-wide
!dir$ endif
!dir$ endif
...
Is it more work, yes, is it faster, TBD. This depends on the computation recoverable by vectorizing add fuel sources.
Also not shown is if (when)
ic(ind) == ic(ind+1) ! and (ind+2), ...
IOW adjacent ic(x)'s have the same indes
as to if the arguments to add_fuel_sources can be summed, and then use a single call to the scalar variant of the subroutine.
Without seeing the actual code, it is difficult to offer the correct assessment.
Do you need professional help on this?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks very much for your time Jim!
I am not sure I understand the second part of the code. Are the 'use2WideMethod' meant to make the code vectorized when you have ic(ind) == ic(ind+1), or just reducing the cost.?
Also not sure what you mean here :
Also not shown is if (when)
ic(ind) == ic(ind+1) ! and (ind+2), ...
IOW adjacent ic(x)'s have the same indes
as to if the arguments to add_fuel_sources can be summed, and then use a single call to the scalar variant of the subroutine.
Is that related to the add_fuel_sources_X2 call ?
Regarding professional help, I am not sure that I am entitled (depends on what it involves ) to it as this work is related to my PhD research.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When ic(ind) == ic(ind+1), this means that two adjacent ind's in array ic contain the same index value (duplicates) held in ic(ind) and ic(ind+1).
IIF the components (m1(ind), nd(ind), ...) can be added to (m1(ind+1), nd(ind+1), ...) respectively, and then submitted to a single call to ADD_FUEL_SOURCES and get the same result as making two individual calls to ADD_FUEL_SOURCES, one with (m1(ind), nd(ind), ...) and a second call with (m1(ind+), nd(ind+), ...) then you will save time. Assuming the frequency of identical values in ic(ind) and ic(ind+1) is sufficient to offset the overhead of determining as to if to combine or not.
The add_fuel_sources_X2 call is different. It will require (and be called only when) ic(ind) +1 == ic(ind+1)
IOW when two adjacent cells in ic are two sequential numbers (e.g. ic(ind)==4, ic(ind+1)==5)
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »