Re: Vectorization too easy to break and vectorization report is

cfspc · ‎08-14-2009

Greetings,

I have been trying to use ICC for vectorization but have been having some serious problems.
Even in very simple situations the compiler does not vectorize code that, in principle, seems
straightforward to vectorize. In most cases the vectorization report produces mysterious and
(to me useless) remarks. Please see the test case below:

[cpp]#include 

void initialize(float * A, float  * b, size_t Size) {
    for (size_t row_index = 0 ; row_index < Size ; ++row_index) {
        float * row = A + row_index * Size ;
        for (size_t col = 0 ; col < Size ; ++col) {
            row[col] = 1.0f / float(col + 1) ;
        }
        b[row_index] = row_index ;
    }
}

void matVec(float const * A, float const * x, float * b, size_t Size) {
    for (size_t row_index = 0 ; row_index < Size ; ++row_index) {
        float const * row = A + row_index * Size ;

        float row_accumulator = 0 ; 

// Vectorizes        
        for (size_t col = 0 ; col < Size ; ++col) {
            row_accumulator += row[col] * x[col] ;
        }

        b[row_index] = row_accumulator ;
    }
}

// The Tonly difference from matVec is that row_index is an int.
void matVec2(float const * A, float const * x, float * b, size_t Size) {
    
    for (int row_index = 0 ; row_index < Size ; ++row_index) { 
        float const * row = A + row_index * Size ;

        float row_accumulator = 0 ; 

// Does not vectorize:
// remark: loop was not vectorized: dereference too complex.
// If I compile with -vec-report=3 I get a bunch of weird remarks
// regarding flow dependence between row_accumulator and itself.
// Does this have to do with some unrolling of the outer loop? 
        for (unsigned col = 0 ; col < Size ; ++col) {
            row_accumulator += row[col] * x[col] ;
        }

        b[row_index] = row_accumulator ;
    }
}

int main() {
    size_t const Size = 256 ;

    float * A, * b, * x ;

    // I wanted memory aligned to 16 bytes but I am not really
    // even getting to that.
    posix_memalign((void**)&A, Size * Size, 16) ;
    posix_memalign((void**)&b, Size * Size, 16) ;
    posix_memalign((void**)&x, Size * Size, 16) ;

    initialize(A, b, Size) ;
    matVec(A, x, b, Size) ;

    return 0 ;
}

[/cpp]

matVec vectorizes well. However matVec2 does not. The only difference between matVec and
matVec2is the type used for the row_index. Looking at the vectorization report I get very cryptic
messages. I even get messages saying that the matVec loop was not vectorized ... then that it was
vectorized:

simple.cpp(60): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(60): (col. 5) remark: loop was not vectorized: unsupported data type.
simple.cpp(61): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(61): (col. 5) remark: loop was not vectorized: existence of vector dependence.
simple.cpp(61): (col. 5) remark: vector dependence: assumed ANTI dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed FLOW dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed FLOW dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed ANTI dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed ANTI dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed FLOW dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed FLOW dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed ANTI dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(61): (col. 5) remark: LOOP WAS VECTORIZED.
simple.cpp(14): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(20): (col. 9) remark: loop was not vectorized: existence of vector dependence.
simple.cpp(21): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(14): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(20): (col. 9) remark: LOOP WAS VECTORIZED.
simple.cpp(31): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(41): (col. 9) remark: loop was not vectorized: existence of vector dependence.
simple.cpp(42): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(31): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(42): (col. 32) remark: loop was not vectorized: dereference too complex.
simple.cpp(4): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(6): (col. 9) remark: loop was not vectorized: unsupported data type.

These messages don't tell me anything about the "row_index int".

I am compiling this code with:
icpc (ICC) 11.1 20090630
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.

icpc simple.cpp -O3 -xW -fp-model fast -o simple -vec-report=3

TimP · ‎08-14-2009

For the loop with an index of type size_t, it's possible there are 2 versions generated, depending on what is seen at run time. It would be better to make the loop count comparison with a local variable which isn't subject to being changed through aliases, and also better to use a loop count variable of data type int, where that is possible.
My personal preference would be to compile with -fno-inline-functions so as to clear up as much vectorization as possible before dealing with in-lining.

Hideki_I_Intel · ‎08-27-2009

1) Unexpected confusing dependence message on the sum-reduction variable is a result of an optimizer bug. A bug report has been submitted. In the mean time, you can use
#pragma unroll_and_jam(0)
for (int row_index = 0; ....)
to get round that bug.

2) In general, address computation that involves 64bit integers/pointers and 32bit unsigned integers is rather difficult for the compiler to deal with.With respect to integral conversions, language definition is more relaxedfor thesigned types (by saying implementation defined when the value exceeds the range), and that difference can result in getting the code optimized or not. If you write the inner loop of matVec2() as in
for (int col = 0; ....)
row_accumulator += A[row_index * Size + col] *x[col]
and change the type of Size to "int", compiler should be able to auto-vectorize it.

We are continuously improving our analysis so that we can capture as much as the language definition allows.

srimks · ‎08-27-2009

Quoting - cfspc

Greetings,

I have been trying to use ICC for vectorization but have been having some serious problems.
Even in very simple situations the compiler does not vectorize code that, in principle, seems
straightforward to vectorize. In most cases the vectorization report produces mysterious and
(to me useless) remarks. Please see the test case below:

[cpp]#include 

void initialize(float * A, float  * b, size_t Size) {
    for (size_t row_index = 0 ; row_index < Size ; ++row_index) {
        float * row = A + row_index * Size ;
        for (size_t col = 0 ; col < Size ; ++col) {
            row[col] = 1.0f / float(col + 1) ;
        }
        b[row_index] = row_index ;
    }
}

void matVec(float const * A, float const * x, float * b, size_t Size) {
    for (size_t row_index = 0 ; row_index < Size ; ++row_index) {
        float const * row = A + row_index * Size ;

        float row_accumulator = 0 ; 

// Vectorizes        
        for (size_t col = 0 ; col < Size ; ++col) {
            row_accumulator += row[col] * x[col] ;
        }

        b[row_index] = row_accumulator ;
    }
}

// The Tonly difference from matVec is that row_index is an int.
void matVec2(float const * A, float const * x, float * b, size_t Size) {
    
    for (int row_index = 0 ; row_index < Size ; ++row_index) { 
        float const * row = A + row_index * Size ;

        float row_accumulator = 0 ; 

// Does not vectorize:
// remark: loop was not vectorized: dereference too complex.
// If I compile with -vec-report=3 I get a bunch of weird remarks
// regarding flow dependence between row_accumulator and itself.
// Does this have to do with some unrolling of the outer loop? 
        for (unsigned col = 0 ; col < Size ; ++col) {
            row_accumulator += row[col] * x[col] ;
        }

        b[row_index] = row_accumulator ;
    }
}

int main() {
    size_t const Size = 256 ;

    float * A, * b, * x ;

    // I wanted memory aligned to 16 bytes but I am not really
    // even getting to that.
    posix_memalign((void**)&A, Size * Size, 16) ;
    posix_memalign((void**)&b, Size * Size, 16) ;
    posix_memalign((void**)&x, Size * Size, 16) ;

    initialize(A, b, Size) ;
    matVec(A, x, b, Size) ;

    return 0 ;
}

[/cpp]

matVec vectorizes well. However matVec2 does not. The only difference between matVec and
matVec2is the type used for the row_index. Looking at the vectorization report I get very cryptic
messages. I even get messages saying that the matVec loop was not vectorized ... then that it was
vectorized:

simple.cpp(60): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(60): (col. 5) remark: loop was not vectorized: unsupported data type.
simple.cpp(61): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(61): (col. 5) remark: loop was not vectorized: existence of vector dependence.
simple.cpp(61): (col. 5) remark: vector dependence: assumed ANTI dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed FLOW dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed FLOW dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed ANTI dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed ANTI dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed FLOW dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed FLOW dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: vector dependence: assumed ANTI dependence between (unknown) line 61 and (unknown) line 61.
simple.cpp(61): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(61): (col. 5) remark: LOOP WAS VECTORIZED.
simple.cpp(14): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(20): (col. 9) remark: loop was not vectorized: existence of vector dependence.
simple.cpp(21): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(21): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 21 and row_accumulator line 21.
simple.cpp(14): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(20): (col. 9) remark: LOOP WAS VECTORIZED.
simple.cpp(31): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(41): (col. 9) remark: loop was not vectorized: existence of vector dependence.
simple.cpp(42): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed FLOW dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(42): (col. 13) remark: vector dependence: assumed ANTI dependence between row_accumulator line 42 and row_accumulator line 42.
simple.cpp(31): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(42): (col. 32) remark: loop was not vectorized: dereference too complex.
simple.cpp(4): (col. 5) remark: loop was not vectorized: not inner loop.
simple.cpp(6): (col. 9) remark: loop was not vectorized: unsupported data type.

These messages don't tell me anything about the "row_index int".

I am compiling this code with:
icpc (ICC) 11.1 20090630
Copyright (C) 1985-2009 Intel Corporation. All rights reserved.

icpc simple.cpp -O3 -xW -fp-model fast -o simple -vec-report=3

Try using explicit calls of pragma's (like distribution, unroll_and_jam, etc.) rather trusting completely what -O3 does no doubt -O3 does some minimal auto-vectorization, check the compilation process by having a log file & also check objdump afterwards how SSE inst. are behaving. Sometime, compiler geneates structural dependencies which can't be too convincing to a programmer so best is to check the objdump.

Out of curiosity why are you using "-fp-model fast", could you try combination of "-fp-model" specially "-fp-model precise" and analyze your performance if gained.

Does use of -xW also helping youto any kind?

~BR

TimP · ‎08-27-2009

It was clear that the fp-model fast default was wanted, as vector sum reduction was expected. -fp-model precise explicitly disables that optimization. Also clearly, -xW or such had already been selected, perhaps by default, otherwise, the compiler would not have attempted vectorization. The examples posed don't require anything but default compiler flags, nor would pragmas be relevant. A problem, as already mentioned in 2 responses, was with off-beat selection of data types for the loop induction variable.

Om_S_Intel · ‎02-19-2016

The issue is resolved in Intel Composer XE 16.0.

Vectorization too easy to break and vectorization report is confusing