Help for vectorization

unrue · ‎06-17-2014

Dear Intel developers,

i need to vectorize the folloe code by using intel cs 2013:

subroutine mysubroutine(n, q)
    integer(long),                      intent(IN)  :: n
    real(stnd),    dimension(base_dim), intent(OUT) :: q

    integer(long)                                   :: nk
    integer(long)                                   :: sk, bk
    integer(long)                                   :: npow

    real(stnd)                                         :: x
    integer(long)                                   :: i, j

    q  = 0.0
    nk = 0
    do i = 1, base_dim

       x = logarithm( real(n + 1), real(base(i)) )
       npow = floor(x)

       sk = n
       do j  = npow, 0, -1
          bk = base(i)**j
          nk = floor( real(sk) / real(bk) )
          sk = sk - nk * bk

          q(i) = q(i) + real(nk) / real(bk * base(i))

       end do

    end do

  end subroutine mysubroutine

Compiler recognize ANTI ad FLOW dependence between sk and ANTI ad FLOW dependence bewween q.

Could you like to help me to vectorize the inner loop? TI have no idea how to solve in particular the sk dependence. Thanks in advance.

TimP · ‎06-17-2014

sk is written with sequential dependence (the value for the next iteration depends on the current one).

Similarly with q(i)

Depending on parts you have removed, the compiler would like to optimize the outer loop. If base_dim were large enough, and you used consistently typed reals and integers, it might like to vectorize portions of the inner loop by interchanging so that a group of i values can be processed by parallel simd.

Your logarithm function apparently would need to be in a form which could be written in line in terms of standard math intrinsics.

jimdempseyatthecove · ‎06-17-2014

Is base_dim sufficiently large enough for you to use a parallel loop?

Jim Dempsey

TimP · ‎06-17-2014

400 is large enough to justify vectorization, although it may be marginal on the MIC.

Post a example which can be compiled and possibly try a current compiler. The compiler I was trying appears to be distributing the outer loop inside the inner so as to attempt that sort of vectorization.

I'd hope you were familiar enough with your algorithm to have your own ideas about how to interchange loops explicitly.

jimdempseyatthecove · ‎06-17-2014

Tim,

I do not see how he can get vectorization of the inner loop due to each lane of the vector potentially (likely) having different trip counts.

This said if he convoluted the inner loop (or added additional loop nesting) he could potentially run all lanes of the vector provided j can run into negative values .AND. when negative the convolution presents a 0.0 to the summation (and do this with no flow changes in code).

This may be too hard to figure out, and it will rely on the compiler to make sense of the source code. This may be one of the cases where you hand write the code using intrinsics in C++.

Jim Dempsey

jimdempseyatthecove · ‎06-17-2014

What is the average value of npow?

IOW, what is the average trip count of the DO J loop?

If this is large enough, then making DO I= parallel might be worthwhile.

If npow is statistically small, you might be able to pre-compute the results and store into a multi-dimensioned array. Then replace the computation with an index calculation and just fetch the correct result.

Jim Dempsey

unrue · ‎06-18-2014

The average value of npow is 8.