Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Vectorization help

unrue
Beginner
4,073 Views

Dear Intel users, I'm using ifort 13.1.3 under Linux.

I would like to vectorize the inner loop but I'm not sure that the inner loop is a good candidate. This is the loop:

do i = 1, base_dim

       x = log( real(n + 1) / real(base(i)) )
       npow = floor(x)

       sk = n
       do j  = npow, 0, -1
          bk = base(i)**j
          nk = floor( real(sk) / real(bk) )
          sk = sk - nk * bk

          q(i) = q(i) + real(nk) / real(bk * base(i))

       end do

    end do

n, base and base_dim are function input. The loop is not vectorized due a dependency in a line sk = sk - nk*bk. I'm not sure that this is a prefix sum case, because each value depends on sk at the prevoius step. Do you have an idea?

Thanks in advance.

0 Kudos
30 Replies
jimdempseyatthecove
Honored Contributor III
1,129 Views

Unrue,

Most of these incorrect value situations end up clearly being programming errors.... only after you spot the error. Few of these errors would be attributable to compiler bugs.

If this is not a section of code solving for a convergence, then you should have some idea of what benefit will be derived from using the vectorized code in this section of the program. Knowing the potential benefit, you can ascertain as if it would be worth your time to locate the problem section of code. Locating the problem may be difficult, and typically involves asserts and/or inserting trace code. Using asserts is not possible inside vector code because it forces the compiler to produce non-vector code. This leaves using trace logs, with an assert outside the vector loop, and then an assert upon inspection of the trace variables.

If this is a section of code solving for a convergence, you should be aware that any prior hand tuning to determine an optimal epsilon (smallest difference in some converging variable), the prior choice of epsilon may be entirely improper (may not converge, or may wildly diverge) for use in the vectorized code.

Jim Dempsey

0 Kudos
unrue
Beginner
1,129 Views

Hi Jim,

thanks for the reply. Probably my code is very susceptible on little variations, and I should be aware to make any change in that piece of code, but that does not explain why the compiler isn't able to vectorize that loop. I tried also with ifort 14, but nothing is changed, 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,129 Views

unrue,

As a diagnostic, even if the results are wrong, convert the do loop to forward iteration

do j  = 0, npow ! npow the same for all in base_temp(:)

The test is to see if it vectorizes. If it does, then try:

PURE real function q_temp_delta(n, npow, base_temp_i)
!dir$ addributes vector:UNIFORM(n, npow) :: q_temp_delta
  implicit none
  integer, intent(in) :: n,npow
  real, intent(in) :: base_temp_i
! local variables
  integer :: sk, bk, nk
  sk = n
  q_temp_delta = 0.0
  goto (100,101,102,103,...) n+1
  write(*,*) "not enough statements"
  return
1...bk = base_temp_i**j
    ...
103 bk = base_temp_i**j
    nk = floor( real(sk) / real(bk) )
    sk = sk - nk * bk
    q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i)
102 bk = base_temp_i**j
    nk = floor( real(sk) / real(bk) )
    sk = sk - nk * bk
    q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i)
101 bk = base_temp_i**j
    nk = floor( real(sk) / real(bk) )
    sk = sk - nk * bk
    q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i)
100 bk = base_temp_i**j
    nk = floor( real(sk) / real(bk) )
    sk = sk - nk * bk
    q_temp_delta = q_temp_delta + real(nk) / real(bk * base_temp_i)
end function q_temp_delta


Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,129 Views

On the other hand, maybe you can first try:

PURE real function q_temp_delta(n, npow, base_temp_i)
!dir$ addributes vector:UNIFORM(n, npow) :: q_temp_delta
  implicit none
  integer, intent(in) :: n,npow
  real, intent(in) :: base_temp_i
! local variables
! **************** REAL **************
  real :: sk, bk, nk
  sk = n
  q_temp_delta = 0.0
  do j  = npow, 0, -1 ! npow the same for all in base_temp(:)
    bk = int(base_temp_i**j)
    nk = floor( sk / bk )
    sk = sk - nk * bk
    q_temp_delta = q_temp_delta + nk / (bk * base_temp_i)
  end do
end function q_temp_delta

Jim Dempsey

0 Kudos
unrue
Beginner
1,129 Views

Hi Jim,

I tried to do your new version of q_temp_delta function  and also a loop forward separately. No vectorization are done from the compiler. At this point, also supposing the code is vectorized, results are wrong so I can't use it. Maybe the best way is to rethink from the begin the entire initial subroutine. 

0 Kudos
TimP
Honored Contributor III
1,129 Views

A partial vector speedup may be achieved in prefix sum by unrolling the loop and making the new results dependent e.g. on every 4th previous result rather than each depending on the immediately previous one, e.g.

#ifndef __MIC__
          do i= 1,n-3,4
              b(i)= sumr+a(i)
              b(i+1)= sumr+sum(a(i:i+1))
              b(i+2)= sumr+sum(a(i:i+2))
              sumr= sumr+sum(a(i:i+3))
              b(i+3)= sumr
            enddo
#endif

This takes advantage of the pipelining in a single non-vector thread.  It might be considered more efficient than a parallel method which requires many cores, even if the latter does achieve a speedup.

Perhaps you could adapt this strategy.  It may be less useful for your case, since division doesn't pipeline well.

I didn't see in this thread whether you are looking for performance improvement or for a report of full vectorization at possible expense of performance.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,129 Views

This might be worth trying

do i = 1, base_dim
  x = log( real(n + 1) / real(base(i)) )
  npow = floor(x)

  do j=0,npow
    bk(j) = base(i)**j
  end do

  sk = n
  do j  = npow, 0, -1
    nk(j) = floor( real(sk) / real(bk(j)) )
    sk = sk - nk(j) * bk(j)
  end do

  do j = 0, npow
    q(i) = q(i) + real(nk(j)) / real(bk(j) * base(i))
  end do
end do

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,129 Views

That loop from lines 5-7 is a favorite of vectorizing compiler users from 25 years ago.  If it vectorizes, it is effectively performing so many redundant calculations, not taking into account the simple recursive relationship, that it will take far longer than a simple non-vector equivalent.

Jim's strategy to split the divisions into vectorizable loops may be worth trying.  If the 2 division loops can be combined, using the no-prec-div reciprocation trick, it could save some time. 

If the compiler doesn't recognize the calculation of sk as a vector sum reduction, it might do so with the loop reversed, possibly using the omp simd reduction directive, or, if worst comes to worst, splitting out sk=n-dot_product(nk,bk).

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,129 Views

To borrow from Shakespeare play "Much ado about nothing"...

It was never stated what base_dim was nor what the ranges of npow would be. If these numbers are always small, vectorizing the code will be counterproductive. We just hope that all our effort was not for waste.

Jim Dempsey

0 Kudos
unrue
Beginner
1,129 Views

Dear Intel developers, thanks for your help. If  remember well npow was about 1000.

Unfortunately I don't work anymore on that code (I  posted one year ago) so I can't verify your suggestions.

0 Kudos
Reply