Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

ifort 14.0 not vectorizing IVDEP loop

okkebas
Beginner
520 Views

I have the following loop:

!DIR$ IVDEP
DO 20 J = JBGN, JEND
X(IROW) = X(IROW) - L(J)*X(JL(J))
20 CONTINUE

When I compile it with ifort 13.1 the vec report shows the loop is vectorized, however when I use ifort 14.0 the vec-report reports that the loop cannot be vectorized. Both codes were compiled uisng the options: -xHost -O2 -vec-report6.

What is the reason ifort 14.0 doesn't vectorize this loop?

0 Kudos
7 Replies
Steven_L_Intel1
Employee
520 Views

You'd need to show us enough code to run it through the compiler and show the problem. What are the complete messages from the vectorization report? Did you try a build with -guide to see if it had suggestions?

0 Kudos
okkebas
Beginner
520 Views

I didn't try with -guide option yet. Still, it seems strange that ifort 13.1 does vectorize it while ifort 14.0 doesn't, being everything else the same.

The complete fragment:

SUBROUTINE DSLUI2(N, B, X, IL, JL, L, DINV, IU, JU, U )
!
!
!
IMPLICIT DOUBLE PRECISION(A-H,O-Z)
INTEGER N, IL(*), JL(*), IU(*), JU(*)
DOUBLE PRECISION B(N), X(N), L(*), DINV(N), U(*)
!
DO 10 I = 1, N
X(I) = B(I)
10 CONTINUE
DO 30 IROW = 2, N
JBGN = IL(IROW)
JEND = IL(IROW+1)-1
!DIR$ IVDEP
DO 20 J = JBGN, JEND
X(IROW) = X(IROW) - L(J)*X(JL(J))
20 CONTINUE
! ENDIF
30 CONTINUE
!
DO 40 I=1,N
X(I) = X(I)*DINV(I)
40 CONTINUE
!
DO 60 ICOL = N, 2, -1
JBGN = JU(ICOL)
JEND = JU(ICOL+1)-1
!DIR$ IVDEP
DO 50 J = JBGN, JEND
X(IU(J)) = X(IU(J)) - U(J)*X(ICOL)
50 CONTINUE
60 CONTINUE
!

dslui2.f90(18): (col. 16) remark: loop was not vectorized: existence of vector dependence
dslui2.f90(19): (col. 16) remark: vector dependence: assumed ANTI dependence between x line 19 and x line 19
dslui2.f90(19): (col. 16) remark: vector dependence: assumed FLOW dependence between x line 19 and x line 19
dslui2.f90(19): (col. 16) remark: vector dependence: assumed FLOW dependence between x line 19 and x line 19
dslui2.f90(19): (col. 16) remark: vector dependence: assumed ANTI dependence between x line 19 and x line 19

 

0 Kudos
TimP
Honored Contributor III
520 Views

This seems more straightforward:

X(IROW) = X(IROW) - dot_product(L(Jbgn:jend),X(JL(Jbgn:jend)))

There's also the omp simd reduction option, more verbose and not always as effective.

0 Kudos
jimdempseyatthecove
Honored Contributor III
520 Views

I suppose the compiler could not determine if any value within JL(J) {within JL(Jbgn:Jend)} was equal to IROW. And if it has, then the loop has dependencies.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
520 Views

jimdempseyatthecove wrote:

I suppose the compiler could not determine if any value within JL(J) {within JL(Jbgn:Jend)} was equal to IROW. And if it has, then the loop has dependencies.

Jim Dempsey

OP might reasonably expect the IVDEP to continue to ignore the dependency, but it's easy enough to rewrite in standard Fortran.

0 Kudos
okkebas
Beginner
520 Views

Thanks for all the fast replies.

Actually I didn't know about the intrinsic DOT_PRODUCT function. Besides cleaner code, does using dot_product improve performance?  How about some of the other intrinsics like TRANSPOSE, MATMUL, PRODUCT, do they improve performance? The reason I'm asking is that I rad that using some of the array operators in fortran comes with a perfromane penalty?

Regarding using omp simd; I thought about it but I'm also looking into using coarrays (this subroutine is just part of a much larger long running program) and I read I cannot combine openmp with coarrays.

I'm still wondering why ifort 13.1 does vectorize the code with the IVDEP directive while 14.0 doesn't.

Thanks again.

0 Kudos
TimP
Honored Contributor III
520 Views

dot_product nearly always performs as well or better than DO loops.

transpose should perform the same as an equivalent DO loop (storing with stride 1).  It might not be the best strategy in very large cases.  The compiler optimizes some cases where matmul and transpose are used together, so as to avoid actually moving data during transpose.

matmul is more complicated, since ifort for Xeon has the -opt-matmul option which generates an MKL function call.  That's good in the case where you benefit by MKL adding additional threads.  Full optimization of MATMUL for single thread in line comes with -O3 -no-opt-matmul.  A possible problem with MATMUL, particularly with -opt-matmul, is that a temporary array must be allocated in some situations to hold the temporary result, if it is used in an expression, so there are cases where the MKL ?gemm facility will be better.

PRODUCT again should perform the same as an equivalent DO loop, but (like DOT_PRODUCT), it avoids some pitfalls.

omp simd, without parallel, needn't invoke OpenMP run-time library, so should not conflict with coarrays.  Needless to say, you may construct cases which haven't been tested.

0 Kudos
Reply