loop with indirect addressing does not vectorize

andrew_corrigan · ‎10-29-2010

I am using icpc version 11.1.074 on the x86-64 architecture in Linux, as well as icpc version 11.1.058 on the Mac. Compiling the loop below with"icpc -xHost -O3 -parallel -vec-report3 -c test.cpp"results in "test.cpp(8): (col. 7) remark: loop was not vectorized: dereference too complex." . This loop is described as being vectorizable by the Intel Parallel Composer documentation for #pragma ivdep.Is there anyway to get this loop to vectorize with icpc in Linux/Mac?

[cpp]void vectorization_test(int* a, int* b, int n)
{
    int j;
#pragma ivdep
    for (j=0; j] = a[b] + 1;
    }
}
[/cpp]

TimP · ‎10-29-2010

If -xHost implies an SSE4 architecture, you should be able to get vectorization (arithmetic only by parallel instruction) without much performance loss, by adding #pragma vector always (without it, my compiler says ".... seems inefficient.."). If a more practical case required #pragma ivdep as well as #pragma vector always, rather than accepting c99 restrict, I would be annoyed also. The dependency report when ivdep is omitted, regardless of restrict, is bogus.
If you have the job of getting meaningless vectorization reports on all loops regardless of value, this is a reasonable way. #pragma vector always says vectorize regardless of efficiency or possible degradation of exception handling.

andrew_corrigan · ‎11-01-2010

Code now is as below, and compiled with: "icpc -xsse4.2 -O3 -parallel -vec-report3 -c test.cpp" But I still get "test.cpp(9): (col. 7) remark: loop was not vectorized: dereference too complex." Any other ideas?

[cpp]void indirect_increment(int* a, int* b, int n)
{
    int j;
#pragma ivdep
#pragma vector always
    for (j=0; j] = a[b] + 1;
    }
}
[/cpp]

TimP · ‎11-01-2010

When I downloaded your example, the indentation of your pragmas didn't match what you show in the posts. Of course, that definitely shouldn't matter when -std=c99 is set, as I did when I was checking the effect of restrict (which didn't help). Just wondering how those pragmas might be ignored in your case. If you don't get a clue from
icc -E test.cpp > test.i
you might submit the test.i file on your premier.intel.com account.
Also, of course, it's generally risky using names which conflict with shell built-ins, although I don't see how it could matter in this context.

andrew_corrigan · ‎11-01-2010

When I add -E it just prints out the function, just as I posted it above, I'm not sure how that is supposed to help. I'll try out premier support, if I have access (not sure). Tim, you said in your first post that this should work, would you be willing to share a working example (code + compilation command + compiler version)? It's still not clear to me that this sort of vectorization is even supported, i.e., perhaps the architecture is incapable of vectorization in the presence of indirect addressing.

andrew_corrigan · ‎11-01-2010

I've also tried an equivalent Fortran subroutine. Same problem, the only difference is, instead the reason for not vectorizing is that "subscript too complex".

[fortran]subroutine vectorization_test(na,nb,a,b)
        integer :: na, nb
        real(kind=8) :: a(na), b(nb)

!dir$ ivdep
!dir$ vector always
        do i=1,nb
                a(b(i)) = a(b(i)) + 1
        end do

end subroutine[/fortran]

TimP · ‎11-01-2010

"subscript too complex" with ifort 11.1 appears to be the consequence of your declaration of b as a real type, where you may have meant integer. 32-bit ifort is using SSE2 instructions to "vectorize," ignoring my specification of sse4.1, which ought to save 1 instruction per loop count. Possibly, the compiler expects the extra instruction to be immaterial on account of micro-op fusion.

andrew_corrigan · ‎11-01-2010

thanks for catching the goofy typo! ok, that worked! But, going back to the original code, any idea how to get the C++ loop vectorized?

Andrew

jimdempseyatthecove · ‎11-02-2010

Andrew,

When I was experimenting with CEAN features of Composer XE I found that a for loop containing a single CEAN statement would not "port" the pragmas into (onto) the CEAN statement. However, by placing the #pragma's on the CEAN statement the vectorization was produced. With this in mind, try:

voidindirect_increment(int*a,int*b,intn)
{
intj;
for(j=0;j{
#pragmaivdep
#pragmavectoralways
a[b]=a[b]+1;
}
}

Jim Dempsey

andrew_corrigan · ‎11-02-2010

thanks for the tip, but now it just complains about vector dependencies

test.cpp(6): (col. 4) remark: loop was not vectorized: existence of vector dependence.

test.cpp(10): (col. 7) remark: vector dependence: assumed ANTI dependence between a line 10 and a line 10.

test.cpp(10): (col. 7) remark: vector dependence: assumed FLOW dependence between a line 10 and a line 10.

test.cpp(10): (col. 7) remark: vector dependence: assumed ANTI dependence between a line 10 and a line 10.

test.cpp(6): (col. 4) remark: loop was not vectorized: existence of vector dependence.test.cpp(10): (col. 7) remark: vector dependence: assumed ANTI dependence between a line 10 and a line 10.test.cpp(10): (col. 7) remark: vector dependence: assumed FLOW dependence between a line 10 and a line 10.test.cpp(10): (col. 7) remark: vector dependence: assumed FLOW dependence between a line 10 and a line 10.test.cpp(10): (col. 7) remark: vector dependence: assumed ANTI dependence between a line 10 and a line 10.

Mark_S_Intel1 · ‎11-02-2010

Andrew,

As the compiler does not know if b is linear or not, i.e. a[b], a[b[j+1]], a[b[j+2]], a[b[j+3]] are having unit stride memory accesses, the vectorizer hasto generate stride load/store or gather/scatter. Does not seem much can be done to improvethis in the11.1 compiler, but the next major release of the compiler (to be released very soon) does vectorize your original code. However, the following variation of your code does vectorize with the 11.1 compiler.

--mark

$ cat t2.cpp
const int N = 128;
int a;
int b;

void indirect_increment(int n)
{
int j;
#pragma ivdep
#pragma vector always
for (j=0; j {
a[b] = a[b] + 1;
}
}

$ icpc -V -c -vec-report2 t2.cpp
Intel C++ Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20100806 Package ID: l_cproc_p_11.1.073
Copyright (C) 1985-2010 Intel Corporation. All rights reserved.

t2.cpp(10): (col. 5) remark: LOOP WAS VECTORIZED.

andrew_corrigan · ‎11-02-2010

Hi Mark, Thank you for the helpful information. I am very much looking forward to this future release! The original code that I posted is actually a simplified version of the types of loops I am looking to vectorize, but which exhibited the same problem. The exact version of the loop I am looking to vectorize is below. Will the next major release be able to vectorize it?

[cpp]void gather_scatter(int face0, int face1, double* cell_values, double* face_values, int* owner, int* nghbr)
{
    int i_face, i_owner, i_nghbr;
    double fvl;

#pragma ivdep
#pragma vector always 
    for(i_face = face0; i_face < face1; ++i_face)
    {
        i_owner = owner[i_face];
        i_nghbr = nghbr[i_face];
        fvl = face_values[i_face];
        cell_values[i_owner]  = cell_values[i_owner] + fvl;
        cell_values[i_nghbr]  = cell_values[i_nghbr] - fvl;
    }
}[/cpp]