a smart(er) way to deal with zero positions in vector subscripts .... any suggestions

may_ka · ‎02-06-2019

Hi,

the code below aggregates values from vector b into vector a using vector p as a position index for a:

Program Test
  real, allocatable :: a(:), b(:)
  integer, allocatable :: p(:), f(:)
  integer :: i,j
  allocate(b(5),p(5))
  p=(/1,2,0,2,1/)
  b=(/0.5,0.1,0.5,0.3,0.1/)
  !!option 1
  allocate(a(1:maxval(p)),source=0.0)
  Do i=1,size(b,1)
    if(p(i)==0) cycle
    a(p(i))=a(p(i))+b(i)
  End Do
  write(*,*) a
  !!end option 1
  deallocate(a)
  !!option 2
  allocate(a(0:maxval(p)),source=0.0)
  Do i=1,size(b,1)
    a(p(i))=a(p(i))+b(i)
  End Do
  a(0)=0.0
  write(*,*) a(1:ubound(a,1))
  !!end option 2
  deallocate(a)
  !!option 3
  allocate(a(maxval(p)),source=0.0)
  allocate(f(count(p>0)))
  f=(/1,2,4,5/)
  Do i=1,size(f,1)
    a(p(f(i)))=a(p(f(i)))+b(f(i))
  End Do
  write(*,*) a
  !!end option 3
end Program Test

The value of zero in index vector p indicates that the related value in b should be skipped.

To my knowledge there are three options to achieve this: a) an if branch, b) starting vector "a" at position zero and c) an additional position vector "f". The first has a drawback of possibly difficult to predict branching. The second has the problem that the position zero of "a" has to be loaded into the cpu every time a zero occures and subsequent operations on that array are more tricky to optimize due to the "weird" array boundaries. The third option poses an additional overhead in terms of memory due to having to allocate memory for "f". In addition there might be some low down due to double indexing.

There are some boundary conditions to this problem:

1. arrays "a", "b" and "p" can be huge (e.g. 100,000,000)

2. all arrays can be two dimensional which would require a zero lower boundary for the first dimension if option 2 is consideredaa

3. none of the arrays can be reallocated before entering the loop

4. the number and distribution of zeros in "p" is determined by the data at run time and their order cannot be changed. There might be no zeros at all.

5. non-zero values in "p" may or may not cover the full length array "a"

Any suggestions for another option are appreciated.

Thanhs

mecej4 · ‎02-06-2019

I don't know whether it is "smart", but the WHERE statement/construct appear to be just what the doctor would have ordered. For example:

Program Test
  real, allocatable :: a(:), b(:)
  integer, allocatable :: p(:), f(:)
  allocate(b(5),p(5))
  p=(/1,2,0,2,1/)
  b=(/0.5,0.1,0.5,0.3,0.1/)
  allocate(a(1:maxval(p)),source=0.0)
  where(p /= 0) a(p)=a(p)+b
  write(*,*) a
  deallocate(a)
end Program Test

The code is certainly more readable as a result of using WHERE.

may_ka · ‎02-06-2019

Hi,

from my perspective "where" is just a short-hand for a loop with an if branch. I suspect the performance isnt any better.

jimdempseyatthecove · ‎02-06-2019

I suggest you take a look at the disassembly code of mecej4's suggestion (use larger arrays). The syntax he is using will likely enable the compiler to generate vectorized code using gather, test, masked add, masked scatter store.

If the density (frequency) of stores exceeds one per cache line, the code should run faster, if average of 1 store per cache line then same speed. If less than 1 store per cache line it still may be a tad faster.

Generate the code, run VTune, then expand the section using the Disassembly view.

Jim Dempsey

FortranFan · ‎02-06-2019

mecej4 wrote:
I don't know whether it is "smart", but the WHERE statement/construct appear to be just what the doctor would have ordered. For example:
Program Test
  real, allocatable :: a(:), b(:)
  integer, allocatable :: p(:), f(:)
  allocate(b(5),p(5))
  p=(/1,2,0,2,1/)
  b=(/0.5,0.1,0.5,0.3,0.1/)
  allocate(a(1:maxval(p)),source=0.0)
  where(p /= 0) a(p)=a(p)+b
  write(*,*) a
  deallocate(a)
end Program Test
The code is certainly more readable as a result of using WHERE.

Good suggestion but note the Fortran standard states, "In each where-assignment-stmt, the mask-expr and the variable being defined shall be arrays of the same shape" which is not met per OP's shape declarations.