- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I made this simplified test code
program openmp_test implicit none type :: t_data integer :: size_ integer, allocatable :: counter(:) end type t_data type t_stat integer :: trc(20) end type t_stat type(t_data) :: td type(t_stat) :: ts integer :: i , ip allocate(td% counter(100)) td%size_ = 100 ts% trc = 3 td% counter = 10 ip = 40 !$OMP SIMD do i = 1, 10 ! the loop I want to vectorize ip = ip + 1 td% counter(ip) = td% counter(ip) + 1 ts% trc(8) = ts% trc(8) + 1 enddo do i = 0, 11 print*, td% counter(i+40) enddo print*, "TRC" , ts% trc(8) end program openmp_test
Now my question is how can I correctly/safely make this loop so my ip is incremented corretly and that I get my ts% trc counter correct?
I am aware of some of the different clauses, however, some seem not to work with simd? For example, if I do !$OMP SIMD SHARED(ip) it complains that it was not expecting such clause. Similarly for firstprivate. Is that purposely made?
Also, seems that there is an internal error when doing reduction(+:ts%trc(8)). Does reduction clause normally accepts declared type?. If so, I will report this to intel Support team
I am using ifort 19.0.1
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Internal compiler errors are always errors and should be reported to the Support team.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler probably attempts to hoist ip and ts%trc from the loop, unless you have cut back optimization. That would be clearly better; besides the loop is too short for simd. As Juergen said, internal error is always a reportable bug.
omp simd has no firstprivate clause. reduction clause takes care of that and lastprivate. Intel compilers have a habit of breaking when omp simd is set without correct clauses. It should issue a diagnostic rather than an internal error.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Tim,
This is just an attempt to replicate my real application.
I still did not get how to instruct the compiler about the two counting variables. The problem is that the compiler complains that it CANNOT vectorise the loop because of local assignments. If I e.g. put ip to private (although that would be wrong) it does not complain any more.
Also, does reduction clauses not accept declared type variables?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
... !dir$ if(.false.) !$OMP SIMD do i = 1, 10 ! the loop I want to vectorize ip = ip + 1 td% counter(ip) = td% counter(ip) + 1 ts% trc(8) = ts% trc(8) + 1 enddo !dir$ else i = 10 !dir$ vector td% counter(ip+1:ip+i) = td% counter(ip+1:ip+i) + 1 ts% trc(8) = ts% trc(8) + i ip = ip + i !dir$ endif ...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jim,
Thanks I got your point. However, I my loop is actually larger and a bit more complicated. So with the example I made was to try to show a part of it and what I am trying to do.
My question really is if there is a way to apply reduction clauses?
Would that have worked for example:
!$OMP SIMD REDUCTION(+:trc) do i = 1, 10 ! the loop I want to vectorize ip = ip + 1 trc(8) = trc(8) + 1 enddo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The loop you have shown in #6 is operating on a scalar (iow a single element in an array). IOW vectorization is nonsensical. Also, if you intend to parallelize this loop (we will assume the iteration count will be much larger than 10, say N) then:
ip = ip + 1
is likely not what you intend it to be, or, if it is, will not do what you intend it to do.
In the case that it is what you intend it to do (e.g. a counter), then include ip in the reduction clause .OR. place !$omp atomic in front of that statement.
On the other hand, if (when) ip is used as an index, you will need to code something like this:
ip_start = ip !$OMP SIMD PRIVATE(ip) REDUCTION(+:trc) do i = 1, N ! the loop I want to vectorize ip = ip_start + i trc(8) = trc(8) + 1 ... enddo ip = ip_start + N
If possible, please show either your actual loop or a facsimile that shows the characteristics of your loop as well as exhibits failure to vectorize.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note, if ip always starts out at 1, then use do ip=1, N and use ip instead of i in the loop.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jim,
I ended up manipulating the loop so it always goes from 1 to N and then doing the indexing inside the loop as you showed.
BTW: is there a tump rule when one should not vectorize since the overhead could be larger.
In one of my example I have three nested loops like
do k = 1, nk ! ~30-60 do j = 1, nj ! ~90-300 do i = 1, ni ! ~10-20 -> the loop to be vectorized
This is repeated millions of time, however, as the most nested loop is only 10-20 it is probably not worth doing anything.
Notice that it is not possible to reverse the order of loop (it is not sensical in my application)
Finally, I know that statements like BREAK/EXIT or while loops are not vectorizable as such, but what about CYCLE/CONTINUE in general.
Thanks very much in advance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>as the most nested loop is only 10-20 it is probably not worth doing anything
Not so. if your inner loop code is vectorizable, then depending on your vector load (16 floats/8 doubles, 8 floats/4 doubles, or 4 floats/2 doubles), the inner loop trip count can be reduced by the inverse of the vector load. And it looks likely that at your high end loop counts (360,000) that you should consider parallelizing the loops either just the outer loop or the outer two loops with the collapse(2) clause.
>>CYCLE
In the above code (3 nested loops) it is OK to cycle or exit a loop that is neither vectorized nor parallelized as this makes your loop count in determinant. This said, with the loop count as short as the inner loop (10-20) note that unused lanes within a vector cost you nothing (assuming no divide by 0 or other faulting condition). Therefor you can (for this short loop) compute event the extra iterations, and then ignore the meaningless data. For longer vectorized loops, consider batching the each iteration by vector width
! padd arrays to fill vector width do i=1,niPadded,VectorWidth ! 2, 4, 8, 16 a(i:i+VectorWidth-1) = ... ...
Then afterwards use the unpadded size a(1:ni)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional information:
IIF the inner most loop count (ni) is a parameter, (known at compile time) then the compiler may vectorize the code without you doing anything to the code. Aligning and padding may eliminate some unnecessary code.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:>>as the most nested loop is only 10-20 it is probably not worth doing anything
Not so. if your inner loop code is vectorizable, then depending on your vector load (16 floats/8 doubles, 8 floats/4 doubles, or 4 floats/2 doubles), the inner loop trip count can be reduced by the inverse of the vector load. And it looks likely that at your high end loop counts (360,000) that you should consider parallelizing the loops either just the outer loop or the outer two loops with the collapse(2) clause.
>>CYCLE
In the above code (3 nested loops) it is OK to cycle or exit a loop that is neither vectorized nor parallelized as this makes your loop count in determinant. This said, with the loop count as short as the inner loop (10-20) note that unused lanes within a vector cost you nothing (assuming no divide by 0 or other faulting condition). Therefor you can (for this short loop) compute event the extra iterations, and then ignore the meaningless data. For longer vectorized loops, consider batching the each iteration by vector width
! padd arrays to fill vector width do i=1,niPadded,VectorWidth ! 2, 4, 8, 16 a(i:i+VectorWidth-1) = ... ...Then afterwards use the unpadded size a(1:ni)
Jim Dempsey
I just came across an issue which I think is related to the lack of not having what you just suggested?
I have a loop of i = 1, novar where at one case novar is 100 (does not seem a high number), however, it is a long routine with a few other routine calls. Everything has been made vectorized (at least according to the optimisation report etc), however, when running with the optimised mode (-O3) I get the error that my array is out of bounds. It is clearly not! I have tried to run at full DEBUG mode and everything runs fine.
If I instead say do i = 1, 100 ---> IT WORKS!
So can I assume that it may be because I need to padd the arrays.
Also what do you mean by
Then afterwards use the unpadded size a(1:ni)
It just means that my entire array of `a` can now be used?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a loop of i = 1, novar where at one case novar is 100 (does not seem a high number), however, it is a long routine with a few other routine calls. Everything has been made vectorized (at least according to the optimisation report etc), however, when running with the optimised mode (-O3) I get the error that my array is out of bounds. It is clearly not! I have tried to run at full DEBUG mode and everything runs fine.
If I instead say do i = 1, 100 ---> IT WORKS!
Are you sure that novar is properly set in your parallelized setup? Maybe it runs out of scope?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not running in parallel at all yet. So I am not sure how else That could run out of scope?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Without seeing your code it is hard to tell what is going on.
IIF you are using the vector scheme presented in #10, then you must padd the array(s) to vector width. With array size of 100 (not padded), and vector width of say 8. Your last reference becomes array(97:97+8-1) or (97:104).
IIF you are not using the vector scheme presented in #10, and not padding, then you have a genuine address outside of bounds.
I suggest you debug what is going on. Note, adding statements to optimized code often hides the problem.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim.
Two things
how do you debug in vectorized mode? - as soon as I do print statement, the compiler refuses to vectorize and then my code will work.
Is what you mentioned how it always work? Because i have many times used omp simd without payinh attention to padding. For example when I reset certain various arrays before each time step I do
!$omp simd
do i=1,size_
are(i) = 0.
(...)
enddo
I never encountered a problem before now.
Also, when you are saying this. Does it mean I need to consider the indices from 97 to 100 in a serialized mode by having a separete piece of code to account for this?
thanks very much
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I said: IIF you are using the vector scheme presented in #10
Your code in #16 is not.
>>Also, when you are saying this. Does it mean I need to consider the indices from 97 to 100 in a serialized mode by having a separete piece of code to account for this?
Well you haven't shown the code. Generally (when coding do I=1,size) you do not have to worry about the partial vector at the end.
However, in a specific case where you pass Array(97) to a subroutine/function that declares the dummy as X(VectorWidth), and then manipulate X as if it had a full width of the vector to use, then you have a coding problem.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okay I guess I misunderstod what you meant by vector scheme. But no I am not, I am just doing a conventional do loop
it is basically:
!$omp simd
do I=1,no_particles
enddo
Inside I am running over particles, hence, i am operating over arrays of size 3 (x,y,z)for each loop.
I will try to simplify a case for illustration, however, I am just concerned that case might just be problemfree and it is only when the things become a bit more complicated iget this problem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In an attempt to explain what Jim probably means, if you have a loop inside a procedure with a fixed size declared, it is likely to break if you call the procedure with a different vector length. The problem might be exposed by vectorization. When you have argument cross checking enabled or use an explicit interface, it should catch this violation of Fortran standard.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>get the error that my array is out of bounds
When you get that error... what was the value of the invalid index as compared to the last (first) valid index.
Also, in order to get this error, you must have enabled array out of bounds checking. I think that this disables vectorization, though if it does not, there could be a runtime bug where the oob indication is made on the full width of the vector as opposed to the partial width that may be in effect (vector instruction with mask).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the answers Jim & Tim.
I made some significant progress I would say.
I found two bugs, one which is a silent bug (i.e. the results would be wrong, but the code is valid), and the one which triggers the error I mentioned.
BTW: sometime it says segmentation fault and other time I get array out of bounds. This is only happening when having the OMP SIMD loop enabled.
As I explained my routine which contains the omp simd loop is type bound, hence, it got the declaration (class). IF I change this to type it works!!
I have tried to recreate a simplified version, however, it does not seem to recreate the problem. Perhaps you can early on spot a potential issue.
Two issues:
1) The first section of the code where I have three loops and doing simd loop over `rw=1,randw`. As you observe I am calling a procedure to add some terms into my arrays. Where to put the various terms are determined by `ic`. This ic can be completely arbitrary. Basically, it is an index pointing to which grid cell a particle is located. It means that two or more particles can have THE SAME ic. I tried to recreate that by setting up an ic array that have some repeatability - and if you run a vectorized and serial mode you will realize that the print statements are not giving the same results. Clearly, because my omp simd needs somehow a reduction. But how do I enforce this for such applications where my reduction arrays are obtained from another routine and they are properties of a declared type?
2) The dum_loop routine works. However, in my a bit more lengthty application it did not work before I changed ` class(part1), intent(inout) :: this` to `type(part1), intent(inout) :: this`. Clearly, I had to change the procedure declaration from `pass(this)` to `nopass`
BTW: All calculations are completely random and might be very primitive
module mod_part implicit none type part1 real , allocatable :: duma(:), dumb(:), dumc(:), dumd (:) real , allocatable :: mass(:), xmom(:), ymom(:), zmom(:) contains procedure, pass(this) :: ADD_FUEL_SOURCES procedure, pass(this) :: dum_loop end type contains subroutine ADD_FUEL_SOURCES(this, m1, m2, u1,v1,w1, nd, ic) !$OMP DECLARE SIMD(ADD_FUEL_SOURCES) UNIFORM(this) implicit none class(part1), intent(inout) :: this integer, intent(in) :: ic real, intent(in) :: nd, m1,m2,w1,u1,v1 real :: amd, dmasum, dm, mc dm = m1 - m2 amd = nd * dm this% mass(ic) = this% mass(ic) + amd this% xmom(ic) = this% xmom(ic) + amd*u1 this% ymom(ic) = this% ymom(ic) + amd*v1 this% zmom(ic) = this% zmom(ic) + amd*w1 ! this% enthalpy(ic) = this% enthalpy(ic) + nd*hfuel end subroutine ADD_FUEL_SOURCES subroutine dum_loop(this,size_) implicit none class(part1), intent(inout) :: this integer, intent(in) :: size_ integer :: i !$OMP SIMD do i = 1, size_ this% mass(i) = i + 1-size_+size_/4.3 enddo end subroutine dum_loop end module !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! program TEST_OMPSIMD use mod_part implicit none integer :: noz, inj, rw , nsl , fn ,i, randw, nozcnt, injcnt, ind integer, allocatable :: ic(:) real :: m1,m2, u1,v1,w1, nd type(part1) :: part noz = 60 nsl = 120 randw = 10*noz*nsl allocate(part% duma(noz*nsl*randw),part% dumb(noz*nsl*randw),& & part% dumc(noz*nsl*randw),part% dumd(noz*nsl*randw),& & part% mass(randw*noz*nsl),part% xmom(randw*noz*nsl),part% ymom(randw*noz*nsl),& & part% zmom(randw*noz*nsl),ic(randw*noz*nsl)) randw = 10 ind = 0 do i = 1, randw if( mod(i,3 ) ) then ind = 1 else ind = ind + 1 endif enddo ic = ind nozcnt = 0 injcnt = 0 ind = 0 do fn = 1, noz do inj= 1, nsl !$OMP SIMD private(ind) do rw = 1, randw ind = ind + rw part% duma( ind ) = REAL(fn*rw*inj-rw*inj ) part% dumb( ind ) = REAL(fn*rw*inj-1-rw*inj) part% dumc( ind ) = REAL(fn*rw*inj-2-rw*inj) part% dumd( ind ) = REAL(fn*rw*inj-3-rw*inj) m1 = part% duma( ind )-part% dumb( ind ) nd = part% dumc(ind)*m1-part% dumd( ind ) m2 = m1 + nd u1 = part% duma(ind)/part% dumb(ind) v1 = part% dumb(ind)/100. w1 = part% dumb(ind) ! ic = rw !print*, m1,m2, u1,v1,w1,nd,ic call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind)) enddo enddo enddo print*, sum(part%duma)+sum(part%dumb)+sum(part%dumc)+sum(part%dumd) print*, sum(part%mass ), sum(part%ymom ) call part% dum_loop(10000) print*, sum(part% mass(1:10000)) end program TEST_OMPSIMD
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page