Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Help for vectorization

unrue
Beginner
715 Views

Dear Intel developers,

 

i need to vectorize the folloe code by using intel cs 2013:

 

subroutine mysubroutine(n, q)
    integer(long),                      intent(IN)  :: n
    real(stnd),    dimension(base_dim), intent(OUT) :: q

    integer(long)                                   :: nk
    integer(long)                                   :: sk, bk
    integer(long)                                   :: npow

    real(stnd)                                         :: x
    integer(long)                                   :: i, j

    q  = 0.0
    nk = 0
    do i = 1, base_dim

       x = logarithm( real(n + 1), real(base(i)) )
       npow = floor(x)

       sk = n
       do j  = npow, 0, -1
          bk = base(i)**j
          nk = floor( real(sk) / real(bk) )
          sk = sk - nk * bk

          q(i) = q(i) + real(nk) / real(bk * base(i))

       end do

    end do

  end subroutine mysubroutine

Compiler recognize ANTI ad FLOW dependence between sk  and ANTI ad FLOW dependence bewween q.

Could you like to help me to vectorize the inner loop? TI have no idea how to solve in particular the sk dependence. Thanks in advance.

 

 

0 Kudos
6 Replies
TimP
Honored Contributor III
715 Views

sk is written with sequential dependence (the value for the next iteration depends on the current one).

Similarly with q(i)
 

Depending on parts you have removed, the compiler would like to optimize the outer loop.  If base_dim were large enough, and you used consistently typed reals and integers, it might like to vectorize portions of the inner loop by interchanging so that a group of i values can be processed by parallel simd.

Your logarithm function apparently would need to be in a form which could be written in line in terms of standard math intrinsics.

0 Kudos
jimdempseyatthecove
Honored Contributor III
715 Views

Is base_dim sufficiently large enough for you to use a parallel loop?

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
715 Views

400 is large enough to justify vectorization, although it may be marginal on the MIC. 

Post a example which can be compiled and possibly try a current compiler.  The compiler I was trying appears to be distributing the outer loop inside the inner so as to attempt that sort of vectorization.

I'd hope you were familiar enough with your algorithm to have your own ideas about how to interchange loops explicitly. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
715 Views

Tim,

I do not see how he can get vectorization of the inner loop due to each lane of the vector potentially (likely) having different trip counts.

This said if he convoluted the inner loop (or added additional loop nesting) he could potentially run all lanes of the vector provided j can run into negative values .AND. when negative the convolution presents a 0.0 to the summation (and do this with no flow changes in code).

This may be too hard to figure out, and it will rely on the compiler to make sense of the source code. This may be one of the cases where you hand write the code using intrinsics in C++.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
715 Views

What is the average value of npow?

IOW, what is the average trip count of the DO J loop?

If this is large enough, then making DO I= parallel might be worthwhile.

If npow is statistically small, you might be able to pre-compute the results and store into a multi-dimensioned array. Then replace the computation with an index calculation and just fetch the correct result.

Jim Dempsey

0 Kudos
unrue
Beginner
715 Views

The average value of npow is 8. 

0 Kudos
Reply