Vectorization help

unrue · ‎11-12-2014

Dear Intel users, I'm using ifort 13.1.3 under Linux.

I would like to vectorize the inner loop but I'm not sure that the inner loop is a good candidate. This is the loop:

do i = 1, base_dim

       x = log( real(n + 1) / real(base(i)) )
       npow = floor(x)

       sk = n
       do j  = npow, 0, -1
          bk = base(i)**j
          nk = floor( real(sk) / real(bk) )
          sk = sk - nk * bk

          q(i) = q(i) + real(nk) / real(bk * base(i))

       end do

    end do

n, base and base_dim are function input. The loop is not vectorized due a dependency in a line sk = sk - nk*bk. I'm not sure that this is a prefix sum case, because each value depends on sk at the prevoius step. Do you have an idea?

Thanks in advance.

TimP · ‎11-12-2014

You're right, the inner loop would not be vectorizable. The current compiler suggests either switching the loops or applying simd directive on the outer loop, but the degree of mixed data types appears to be an impediment to auto-vectorization with short vector math library.

jimdempseyatthecove · ‎11-12-2014

At issue here is each q(i) potentially will have different iteration counts and thus not lend its self to vectorization as written.

The suggestion I have will require a little work, but my guess is your problem statement came about because you need to improve this routine. I will sketch what you can do, you decide if it is worth it to program.

Produce an array of npow's:

do i = 1, base_dim
x = log( real(n + 1) / real(base(i)) )
npow(i) = floor(x)
end do

Then for each unique npow within npow(:) create a gather/scatter index array. Using the scatter/gather index array, gather the base(i)'s into a new vector, say base_temp(:). I will assume the representative q(i)'s are initialized to 0, if so, initialize q_temp(:) to 0's, if not use scattery/gather index array to populate q_temp.

q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp))

Where n_temp is the number of matching npow's

At this point, base_temp(1:n_temp) and q_temp(1:n_temp) are same size, same npow'ed, compact vectors. Then using the current unique npow (which you saved in npow_selected):

do i=1,n_temp ! base_temp(1:n_temp) and q_temp(1:n_temp)
  sk = n
  do j  = npow_selected, 0, -1 ! npow the same for all in base_temp(:)
    bk = base_temp(i)**j
    nk = floor( real(sk) / real(bk) )
    sk = sk - nk * bk
    q_temp(i) = q_temp(i) + real(nk) / real(bk * base_temp(i))
  end do
end do

The above loop should vectorize. I will let you fill in the nitty-gritty of performing the gather and initializing the npow_selected.

After the above loop, the q_temp(1:n_temp) array is scattered via the gather/scatter index array back into q:

q(scattergather(1:n_temp)) = q_temp(1:n_temp)

You should see full vectorization of both loops.

Jim Dempsey

mecej4 · ‎11-12-2014

Furthermore, the lines

    nk = floor( real(sk) / real(bk) )
    sk = sk - nk * bk

could be replaced by

    sk = mod(sk,  bk)

Correction added later: if args to mod() may be negative, use modulo(sk, bk) instead; see below.

jimdempseyatthecove · ‎11-12-2014

A better (untested) rewrite of my #4

do i = 1, base_dim
  x = log( real(n + 1) / real(base(i)) )
  npows(i) = floor(x)
end do
do npow = minval(npows), maxval(npows)
  n_temp = 0
  do i = 1, base_dim
    if(npows(i) == npow) then
      n_temp = n_temp + 1
      scattergather(n_temp) = i
    end if
  end do
  if(n_temp > 0) then
    base_temp(1:n_temp) = base(scattergather(1:n_temp))
    q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp))
    do i=1,n_temp ! width of base_temp(:) and q_temp(:)
      sk = n
      do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
        bk = base_temp(i)**j
        nk = floor( real(sk) / real(bk) )
        sk = sk - nk * bk
        q_temp(i) = q_temp(i) + real(nk) / real(bk * base_temp(i))
      end do
    end do
    q(scattergather(1:n_temp)) = q_temp(1:n_temp)
  end if
end do

Jim Dempsey

unrue · ‎11-13-2014

jimdempseyatthecove wrote:

The above loop should vectorize. I will let you fill in the nitty-gritty of performing the gather and initializing the npow_selected.

After the above loop, the q_temp(1:n_temp) array is scattered via the gather/scatter index array back into q:

q(scattergather(1:n_temp)) = q_temp(1:n_temp)

You should see full vectorization of both loops.

Jim Dempsey

Hi Jim, thanks for your great reply! I tried your implementation, but the compiler tell on the inner loop "do j = npow, 0, -1" remark: loop was not vectorized: existence of vector dependence.

The dependance are in the lines:

nk = floor( real(sk) / real(bk) )
sk = sk - nk * bk

I have not yet nested the results.

jimdempseyatthecove · ‎11-13-2014

Try this:

do i = 1, base_dim
  x = log( real(n + 1) / real(base(i)) )
  npows(i) = floor(x)
end do
do npow = minval(npows), maxval(npows)
  n_temp = 0
  do i = 1, base_dim
    if(npows(i) == npow) then
      n_temp = n_temp + 1
      scattergather(n_temp) = i
    end if
  end do
  if(n_temp > 0) then
    base_temp(1:n_temp) = base(scattergather(1:n_temp))
    q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp))
    do i=1,n_temp ! width of base_temp(:) and q_temp(:)
      sk = n
      q_temp_delta = 0
      do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
        bk = base_temp(i)**j
        nk = floor( real(sk) / real(bk) )
        sk = sk - nk * bk
        q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp(i))
      end do
      q_temp(i) = q_temp(i) + q_temp_delta
    end do
    q(scattergather(1:n_temp)) = q_temp(1:n_temp)
  end if
end do

They are both equivalent. In both cases I do not see a vector dependency. The compiler may have been confused by something.

f this works, don't forget to try mecej4's sk= mod(sk,bk).

If the above reports vector dependency then try:

do i = 1, base_dim
  x = log( real(n + 1) / real(base(i)) )
  npows(i) = floor(x)
end do
do npow = minval(npows), maxval(npows)
  n_temp = 0
  do i = 1, base_dim
    if(npows(i) == npow) then
      n_temp = n_temp + 1
      scattergather(n_temp) = i
    end if
  end do
  if(n_temp > 0) then
    base_temp(1:n_temp) = base(scattergather(1:n_temp))
    q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp))
    do i=1,n_temp ! width of base_temp(:) and q_temp(:)
      base_temp_i = base_temp(i)
      sk = n
      q_temp_delta = 0
      do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
        bk = base_temp_i**j
        nk = floor( real(sk) / real(bk) )
        sk = sk - nk * bk
        q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i)
      end do
      q_temp(i) = q_temp(i) + q_temp_delta
    end do
    q(scattergather(1:n_temp)) = q_temp(1:n_temp)
  end if
end do

Jim Dempsey

unrue · ‎11-13-2014

Hi Jim,

by using the first attempt (with q_temp), the compiler reports dependencies and the inner loop is not vectorized. In fact, the execution time is the same. The results are a bit different compared to original first version I reported. I suppose because there are many floating point accumulation and your version accumulates in a different order.

By using also base_temp_i, the execution time still is the same and the dependancies remains. The results are the same of first attempt.

By using sk = mod(sk, bk), the compiler vectorize the inner loop, the execution time perform very well (2x) but the results are totally wrong.

mecej4 · ‎11-13-2014

...but the results are totally wrong.

Is sk/bk going to be negative sometimes? If so, you may need to use the MODULO function instead of the MOD function. From the

Fortran documentation: "If a is of type real and p is not equal to zero, the value of MODULO(a,p) is a - FLOOR( a/ p) * p"

unrue · ‎11-14-2014

mecej4 wrote:

Quote:

...but the results are totally wrong.

Is sk/bk going to be negative sometimes? If so, you may need to use the MODULO function instead of the MOD function. From the

Fortran documentation: "If a is of type real and p is not equal to zero, the value of MODULO(a,p) is a - FLOOR( a/ p) * p"

I'm not sure, but also trying MODULO function the results are bad. The strange thing is I compared sk from the two original lines and from the MOD and MODULO function. They return the same values. I suspect the compiler optimization fails.

jimdempseyatthecove · ‎11-14-2014

>>By using also base_temp_i, the execution time still is the same and the dependancies remains. The results are the same of first attempt.

It is odd that it sees a dependency here, unless the intrinsic floor is non-vectorizable.

Can you make sk, bk and nk of type real, then remove the conversion functions and use

do i = 1, base_dim
  x = log( real(n + 1) / real(base(i)) )
  npows(i) = floor(x)
end do
do npow = minval(npows), maxval(npows)
  n_temp = 0
  do i = 1, base_dim
    if(npows(i) == npow) then
      n_temp = n_temp + 1
      scattergather(n_temp) = i
    end if
  end do
  if(n_temp > 0) then
    base_temp(1:n_temp) = base(scattergather(1:n_temp))
    q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp))
    do i=1,n_temp ! width of base_temp(:) and q_temp(:)
      base_temp_i = base_temp(i)
      sk = n
      q_temp_delta = 0
      do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
        bk = aint(base_temp_i**j)
        nk = aint( sk / bk ) ! requires sk/bk is positive
        q_temp_delta = q_temp_delta + nk / (bk * base_temp_i)
      end do
      q_temp(i) = q_temp(i) + q_temp_delta
    end do
    q(scattergather(1:n_temp)) = q_temp(1:n_temp)
  end if
end do

Watch out for negative numbers

Jim

jimdempseyatthecove · ‎11-14-2014

additional comment about integers and floats doubles and vectorization.

It is important that the vectors, implicit as well as explicit, have the same number of lanes (variables within the lane). In the prior code, if your temporaries (bk, nk, sk) are integers, the default size is 4 bytes. If your array q (and q_temp) is REAL(8), then the compiler may have an issue differing vector widths between the implicit ones generated from your temporaries and the implicit ones generated from your defined variables.

It is unclear as to if this is a problem for your code, but it could if you use a compiler option to set the default real to REAL(8).

Jim Dempsey

mecej4 · ‎11-14-2014

Unrue wrote:
The strange thing is I compared sk from the two original lines and from the MOD and MODULO function. They return the same values. I suspect the compiler optimization fails.

Let us get all the evidence before we announce the verdict. You have not shown more than a small fragment of code, and we do not know the gamut of values that the variables in that fragment may take. Here is a small example to show the difference between mod() and modulo(). Consider making up a small, self-contained example built around the code fragment that you showed. As you have observed, the code with the modulo() could give you significant speed-up because of vectorization.

program whichMod
integer i,j,k1,k2

write(*,'(A,/)')'        i          N     mod(N,15)  modulo(N,15)'
do i=-5,5,2
   j=3*i-4
   k1=mod(j,15)
   k2=modulo(j,15)
   write(*,'(4I10)')i,j,k1,k2
end do
end program

unrue · ‎11-14-2014

jimdempseyatthecove wrote:

additional comment about integers and floats doubles and vectorization.

It is important that the vectors, implicit as well as explicit, have the same number of lanes (variables within the lane). In the prior code, if your temporaries (bk, nk, sk) are integers, the default size is 4 bytes. If your array q (and q_temp) is REAL(8), then the compiler may have an issue differing vector widths between the implicit ones generated from your temporaries and the implicit ones generated from your defined variables.

It is unclear as to if this is a problem for your code, but it could if you use a compiler option to set the default real to REAL(8).

Jim Dempsey

Hi Jim, thanks for your reply. I checked sk/bk and is never negative. Now all variables in the inner loop are real

I cannot substitute the two lines:

nk = floor( real(sk) / real(bk) )
sk = sk - nk * bk

with sk = MOD(sk, bk) function, because nk is used in the next line:

do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
            bk = base_tmp_i**j
            nk = floor( sk / bk )
            sk = sk - nk * bk
            q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
          end do

So I need to compute nk. This is the reason of wrong results of MOD and MUDULO functions. If I substitute nk with (sk/nk) in the last expression, the code is not vectorized.

In your last suggest:

do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
        bk = aint(base_temp_i**j)
        nk = aint( sk / bk ) ! requires sk/bk is positive
        q_temp_delta = q_temp_delta + nk / (bk * base_temp_i)
      end do

sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized

unrue · ‎11-14-2014

mecej4 wrote:

Let us get all the evidence before we announce the verdict. You have not shown more than a small fragment of code, and we do not know the gamut of values that the variables in that fragment may take. Here is a small example to show the difference between mod() and modulo(). Consider making up a small, self-contained example built around the code fragment that you showed. As you have observed, the code with the modulo() could give you significant speed-up because of vectorization.
program whichMod
integer i,j,k1,k2

write(*,'(A,/)')'        i          N     mod(N,15)  modulo(N,15)'
do i=-5,5,2
   j=3*i-4
   k1=mod(j,15)
   k2=modulo(j,15)
   write(*,'(4I10)')i,j,k1,k2
end do
end program

Hi mecej4. I understand why MOD and MODULO functions gave me wrong results. Simply because variable nk was not calculated. See my post above.

jimdempseyatthecove · ‎11-14-2014

>>sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized

?? this is an iterative process, same number of iterations on each lane of the vector, each lane computing its own unique value. It should vectorize.

Lets help it along...

Try adding !DIR$ SIMD PRIVATE(sk, bk, nk, base_temp_i, q_temp_delta) to the do i = 1,n_temp loop

Jim Dempsey

unrue · ‎11-17-2014

jimdempseyatthecove wrote:

>>sk is not updated, and adding the update (sk = sk - nk * bk) the code is not vectorized

?? this is an iterative process, same number of iterations on each lane of the vector, each lane computing its own unique value. It should vectorize.

Lets help it along...

Try adding !DIR$ SIMD PRIVATE(sk, bk, nk, base_temp_i, q_temp_delta) to the do i = 1,n_temp loop

Jim Dempsey

Hi Jim,

By using yout suggest, the loop is now vectorized but the results are wrong. I tried:

!DIR$ SIMD PRIVATE(sk, bk, nk, base_tmp_i, q_tmp_delta)
          do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
            bk = aint(base_tmp_i**j)
            nk = aint( sk / bk )
            sk = sk - nk * bk
            q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
          end do

And:

!DIR$ SIMD PRIVATE(sk, bk, nk, base_tmp_i, q_tmp_delta)
          do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
            bk = base_tmp_i**j
            nk = floor( sk / bk )
            sk = sk - nk * bk
            q_tmp_delta = q_tmp_delta + nk / (bk * base_tmp_i)
          end do

Disabling DIR SIMD, code is not vectorized and results are good. It seems the vectorization causes wrong results.

jimdempseyatthecove · ‎11-17-2014

Let's take another attempt at skinning the cat.

real function q_temp_delta(n, npow, base_temp_i)
!dir$ addributes vector :: q_temp_delta
  implicit none
  integer, intent(in) :: n,npow
  real, intent(in) :: base_temp_i
! local variables
  integer :: sk, bk, nk
  sk = n
  q_temp_delta = 0.0
  do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
    bk = base_temp_i**j
    nk = floor( real(sk) / real(bk) )
    sk = sk - nk * bk
    q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i)
  end do
end function q_temp_delta
...

do i = 1, base_dim
  x = log( real(n + 1) / real(base(i)) )
  npows(i) = floor(x)
end do
do npow = minval(npows), maxval(npows)
  n_temp = 0
  do i = 1, base_dim
    if(npows(i) == npow) then
      n_temp = n_temp + 1
      scattergather(n_temp) = i
    end if
  end do
  if(n_temp > 0) then
    base_temp(1:n_temp) = base(scattergather(1:n_temp))
    q_temp(1:n_temp) = 0.0 ! or q(scattergather(1:n_temp))
    do i=1,n_temp ! width of base_temp(:) and q_temp(:)
      q_temp(i) = q_temp(i) + q_temp_delta(n, npow, base_temp(i))
    end do
    q(scattergather(1:n_temp)) = q_temp(1:n_temp)
  end if
end do

remove now unnecessary PRIVATE(...)

Jim Dempsey

unrue · ‎11-18-2014

Hi Jim,

many thanks for your great help, but the code still not working well (results are bad), and your last versione is not vectorized. The compiler still tell:

remark: loop was not vectorized: existence of vector dependence.
module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 123 and sk line 123.
module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 123 and sk line 123.
module_Quasirandom.f90(123): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 123 and sk line 122.
module_Quasirandom.f90(122): (col. 12) remark: vector dependence: assumed ANTI dependence between sk line 122 and sk line 123.

The involved lines are:

nk = floor( real(sk) / real(bk) )
sk = sk - nk * bk

Could it depends because I'm using OpenMP with 1 thread? The entire routine is under a parallel region, and some vector (like base, base_tmp, q_tmp, scatter_gather, npows, base_dim) are THREADPRIVATE.

jimdempseyatthecove · ‎11-18-2014

Do not use the any thread private variables inside the function I provided.

consider adding PURE attribute to function name and adding UNIFORM clause to !DIR$

PURE real function q_temp_delta(n, npow, base_temp_i)
!dir$ addributes vector:UNIFORM(n, npow) :: q_temp_delta
  implicit none
  integer, intent(in) :: n,npow
  real, intent(in) :: base_temp_i
! local variables
  integer :: sk, bk, nk
  sk = n
  q_temp_delta = 0.0
  do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
    bk = base_temp_i**j
    nk = floor( real(sk) / real(bk) )
    sk = sk - nk * bk
    q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i)
  end do
end function q_temp_delta

Also, if floor continues to be an issue, replace it with your own floor equivalent code that does not have branch flow statements.

real_sk_over_bk = real(sk) / real(bk)
if(mod(real_sk_over_bk, 1.0) .ne. 0.0) real_sk_over_bk = real_sk_over_bk - 1.0
nk = real_sk_over_bk

Add the real variable real_sk_over_bk (change the name if you want), but do not rewrite the code.
The above code can generate and use the vector test and mask move instructions

Jim Dempsey

unrue · ‎11-18-2014

Now, I'm not using any thread variable. This is the compiler report (a part):

module_Quasirandom.f90(111): (col. 22) remark: FUNCTION WAS VECTORIZED.
module_Quasirandom.f90(111): (col. 22) remark: FUNCTION WAS VECTORIZED.
module_Quasirandom.f90(121): (col. 5) remark: loop was not vectorized: existence of vector dependence.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 127 and sk line 127.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 127 and sk line 127.
module_Quasirandom.f90(127): (col. 7) remark: vector dependence: assumed FLOW dependence between sk line 127 and sk line 124.
module_Quasirandom.f90(124): (col. 7) remark: vector dependence: assumed ANTI dependence between sk line 124 and sk line 127.

And results are still wrong. At line 111, there is q_tmp_delta function. At this point, I suspect a bug in my code in another section.