Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
27 Views

Counting within a SIMD loop in Fortran

I made this simplified test code

 

 

program openmp_test

implicit none


type :: t_data
   integer :: size_
   integer, allocatable :: counter(:)
end type t_data

type t_stat
   integer :: trc(20)
end type t_stat


type(t_data)  :: td
type(t_stat) :: ts
integer                    :: i , ip

allocate(td% counter(100))
td%size_ = 100

ts% trc = 3

td% counter = 10

ip = 40

!$OMP SIMD 
do i = 1, 10  ! the loop I want to vectorize
   ip = ip + 1
   td% counter(ip) = td% counter(ip) + 1
   ts% trc(8)      = ts% trc(8) + 1
enddo


do i = 0, 11
   print*, td% counter(i+40)
enddo

print*, "TRC" , ts% trc(8)


end program openmp_test

 

 

 

 

Now my question is how can I correctly/safely make this loop so my ip is incremented corretly and that I get my ts% trc counter correct?

I am aware of some of the different clauses, however, some seem not to work with simd? For example, if I do !$OMP SIMD SHARED(ip) it complains that it was not expecting such clause. Similarly for firstprivate. Is that purposely made? 

Also, seems that there is an internal error when doing reduction(+:ts%trc(8)). Does reduction clause normally accepts declared type?. If so, I will report this to intel Support team

I am using ifort 19.0.1

0 Kudos
29 Replies
Highlighted
Valued Contributor I
23 Views

Internal compiler errors are always errors and should be reported to the Support team.

0 Kudos
Highlighted
Black Belt
23 Views

The compiler probably attempts to hoist ip and ts%trc from the loop, unless you have cut back optimization.  That would be clearly better; besides the loop is too short for simd.  As Juergen said, internal error is always a reportable bug.

omp simd has no firstprivate clause.  reduction clause takes care of that and lastprivate.  Intel compilers have a habit of breaking when omp simd is set without correct clauses.  It should issue a diagnostic rather than an internal error.

0 Kudos
Highlighted
Beginner
23 Views

Hello Tim,

 

This is just an attempt to replicate my real application.

I still did not get how to instruct the compiler about the two counting variables. The problem is that the compiler complains that it CANNOT vectorise the loop because of local assignments. If I e.g. put ip to private (although that would be wrong) it does not complain any more.

Also, does reduction clauses not accept declared type variables?

0 Kudos
Highlighted
23 Views

...
!dir$ if(.false.)
!$OMP SIMD 
do i = 1, 10  ! the loop I want to vectorize
   ip = ip + 1
   td% counter(ip) = td% counter(ip) + 1
   ts% trc(8)      = ts% trc(8) + 1
enddo
!dir$ else
   i = 10
   !dir$ vector
   td% counter(ip+1:ip+i) = td% counter(ip+1:ip+i) + 1
   ts% trc(8)      = ts% trc(8) + i
   ip = ip + i
!dir$ endif
...

Jim Dempsey

0 Kudos
Highlighted
Beginner
23 Views

Hello Jim,

Thanks I got your point. However, I my loop is actually larger and a bit more complicated. So with the example I made was to try to show a part of it and what I am trying to do.

My question really is if there is a way to apply reduction clauses?

Would that have worked for example:

 

 

!$OMP SIMD REDUCTION(+:trc)
do i = 1, 10  ! the loop I want to vectorize
   ip = ip + 1
   trc(8)      = trc(8) + 1
enddo

 

0 Kudos
Highlighted
23 Views

The loop you have shown in #6 is operating on a scalar (iow a single element in an array). IOW vectorization is nonsensical. Also, if you intend to parallelize this loop (we will assume the iteration count will be much larger than 10, say N) then:

   ip = ip + 1

is likely not what you intend it to be, or, if it is, will not do what you intend it to do.

In the case that it is what you intend it to do (e.g. a counter), then include ip in the reduction clause .OR. place !$omp atomic in front of that statement.

On the other hand, if (when) ip is used as an index, you will need to code something like this:

ip_start = ip
!$OMP SIMD PRIVATE(ip) REDUCTION(+:trc)
do i = 1, N  ! the loop I want to vectorize
   ip = ip_start + i
   trc(8)      = trc(8) + 1
   ...
enddo
ip = ip_start + N

If possible, please show either your actual loop or a facsimile that shows the characteristics of your loop as well as exhibits failure to vectorize.

Jim Dempsey

0 Kudos
Highlighted
23 Views

Note, if ip always starts out at 1, then use do ip=1, N and use ip instead of i in the loop.

Jim Dempsey

0 Kudos
Highlighted
Beginner
23 Views

Hello Jim, 

I ended up manipulating the loop so it always goes from 1 to N and then doing the indexing inside the loop as you showed. 

BTW: is there a tump rule when one should not vectorize since the overhead could be larger. 

In one of my example I have three nested loops like

 

do k = 1, nk  ! ~30-60
do j = 1, nj    ! ~90-300
do i = 1, ni    ! ~10-20 -> the loop to be vectorized

 

This is repeated millions of time, however, as the most nested loop is only 10-20 it is probably not worth doing anything.

Notice that it is not possible to reverse the order of loop (it is not sensical in my application) 

 

Finally, I know that statements like BREAK/EXIT or while loops are not vectorizable as such, but what about CYCLE/CONTINUE in general.

 

Thanks very much in advance

0 Kudos
Highlighted
23 Views

>>as the most nested loop is only 10-20 it is probably not worth doing anything

Not so. if your inner loop code is vectorizable, then depending on your vector load (16 floats/8 doubles, 8 floats/4 doubles, or 4 floats/2 doubles), the inner loop trip count can be reduced by the inverse of the vector load. And it looks likely that at your high end loop counts (360,000) that you should consider parallelizing the loops either just the outer loop or the outer two loops with the collapse(2) clause.

>>CYCLE

In the above code (3 nested loops) it is OK to cycle or exit a loop that is neither vectorized nor parallelized as this makes your loop count in determinant. This said, with the loop count as short as the inner loop (10-20) note that unused lanes within a vector cost you nothing (assuming no divide by 0 or other faulting condition). Therefor you can (for this short loop) compute event the extra iterations, and then ignore the meaningless data. For longer vectorized loops, consider batching the each iteration by vector width 

! padd arrays to fill vector width
do i=1,niPadded,VectorWidth ! 2, 4, 8, 16
  a(i:i+VectorWidth-1) = ...
  ...

Then afterwards use the unpadded size a(1:ni)

Jim Dempsey

0 Kudos
Highlighted
23 Views

Additional information:

IIF the inner most loop count (ni) is a parameter, (known at compile time) then the compiler may vectorize the code without you doing anything to the code. Aligning and padding may eliminate some unnecessary code.

Jim Dempsey

0 Kudos
Highlighted
Beginner
23 Views

jimdempseyatthecove wrote:

>>as the most nested loop is only 10-20 it is probably not worth doing anything

Not so. if your inner loop code is vectorizable, then depending on your vector load (16 floats/8 doubles, 8 floats/4 doubles, or 4 floats/2 doubles), the inner loop trip count can be reduced by the inverse of the vector load. And it looks likely that at your high end loop counts (360,000) that you should consider parallelizing the loops either just the outer loop or the outer two loops with the collapse(2) clause.

>>CYCLE

In the above code (3 nested loops) it is OK to cycle or exit a loop that is neither vectorized nor parallelized as this makes your loop count in determinant. This said, with the loop count as short as the inner loop (10-20) note that unused lanes within a vector cost you nothing (assuming no divide by 0 or other faulting condition). Therefor you can (for this short loop) compute event the extra iterations, and then ignore the meaningless data. For longer vectorized loops, consider batching the each iteration by vector width 

! padd arrays to fill vector width
do i=1,niPadded,VectorWidth ! 2, 4, 8, 16
  a(i:i+VectorWidth-1) = ...
  ...

Then afterwards use the unpadded size a(1:ni)

Jim Dempsey

 

I just came across an issue which I think is related to the lack of not having what you just suggested?

I have a loop of i = 1, novar where at one case novar is 100 (does not seem a high number), however, it is a long routine with a few other routine calls. Everything has been made vectorized (at least according to the optimisation report etc), however, when running with the optimised mode (-O3) I get the error that my array is out of bounds. It is clearly not! I have tried to run at full DEBUG mode and everything runs fine.

If I instead say do i = 1, 100 ---> IT WORKS!

So can I assume that it may be because I need to padd the arrays.

 

Also what do you mean by

Then afterwards use the unpadded size a(1:ni)

It just means that my entire array of `a` can now be used?

0 Kudos
Highlighted
Valued Contributor I
23 Views

I have a loop of i = 1, novar where at one case novar is 100 (does not seem a high number), however, it is a long routine with a few other routine calls. Everything has been made vectorized (at least according to the optimisation report etc), however, when running with the optimised mode (-O3) I get the error that my array is out of bounds. It is clearly not! I have tried to run at full DEBUG mode and everything runs fine.

If I instead say do i = 1, 100 ---> IT WORKS!

Are you sure that novar is properly set in your parallelized setup? Maybe it runs out of scope?

 

0 Kudos
Highlighted
Beginner
23 Views

I am not running in parallel at all yet. So I am not sure how else That could run out of scope?

 

0 Kudos
Highlighted
23 Views

Without seeing your code it is hard to tell what is going on.

IIF you are using the vector scheme presented in #10, then you must padd the array(s) to vector width. With array size of 100 (not padded), and vector width of say 8. Your last reference becomes array(97:97+8-1) or (97:104).

IIF you are not using the vector scheme presented in #10, and not padding, then you have a genuine address outside of bounds.

I suggest you debug what is going on. Note, adding statements to optimized code often hides the problem.

Jim Dempsey

0 Kudos
Highlighted
Beginner
23 Views

Thanks Jim. 

Two things

how do you debug in vectorized mode? - as soon as I do print statement, the compiler refuses to vectorize and then my code will work.

 

Is what you mentioned how it always work? Because i have many times used omp simd without payinh attention to padding. For example when I reset certain various arrays before each time step I do

 

!$omp simd

do i=1,size_

are(i) = 0.

(...)

enddo

I never encountered a problem before now. 

Also, when you are saying this. Does it mean I need to consider the indices from 97 to 100 in a serialized mode by having a separete piece of code to account for this?

 

thanks very much

0 Kudos
Highlighted
23 Views

I said: IIF you are using the vector scheme presented in #10

Your code in #16 is not.

>>Also, when you are saying this. Does it mean I need to consider the indices from 97 to 100 in a serialized mode by having a separete piece of code to account for this?

Well you haven't shown the code. Generally (when coding do I=1,size) you do not have to worry about the partial vector at the end.

However, in a specific case where you pass Array(97) to a subroutine/function that declares the dummy as X(VectorWidth), and then manipulate X as if it had a full width of the vector to use, then you have a coding problem.

Jim Dempsey

 

0 Kudos
Highlighted
Beginner
23 Views

Okay I guess I misunderstod what you meant by vector scheme. But no I am not, I am just doing a conventional do loop

it is basically:

!$omp simd

do I=1,no_particles

 

enddo

 

Inside I am running over particles, hence, i am operating over arrays of size 3 (x,y,z)for each loop.

I will try to simplify a case for illustration, however, I am just concerned that case might just be problemfree and it is only when the things become a bit more complicated iget this problem

0 Kudos
Highlighted
Black Belt
23 Views

In an attempt to explain what Jim probably means, if you have a loop inside a procedure with a fixed size declared, it is likely to break if you call the procedure with a different vector length. The problem might be exposed by vectorization. When you have argument cross checking enabled or use an explicit interface, it should catch this violation of Fortran standard.

0 Kudos
Highlighted
23 Views

>>get the error that my array is out of bounds

When you get that error... what was the value of the invalid index as compared to the last (first) valid index.

Also, in order to get this error, you must have enabled array out of bounds checking. I think that this disables vectorization, though if it does not, there could be a runtime bug where the oob indication is made on the full width of the vector as opposed to the partial width that may be in effect (vector instruction with mask).

Jim Dempsey

0 Kudos