Community
cancel
Showing results for
Did you mean: Beginner
233 Views

## How to overcome 'vector dependence' while Loop Vectorizing on ATOM processor using icc compiler

Hi,

I'm trying to optimize the code to use it on ATOM processor. I come across one of the loop, which is not vector dependent (after analysing) but still gives the message saying "vector dependent". Following is the code snippet of the loop. All the variables present in the loop are of the type 'short'.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ll_band=in_buf+band_size*band_size*3;
hl_band=in_buf;
low_coeff=out_buf;

lh_band=in_buf+band_size*band_size;
hh_band=in_buf+band_size*band_size*2;
high_coeff=out_buf+band_size*band_size*2;

for(i=0;i<band_size;i++)
{

low_coeff = ll_band - ((hl_band + hl_band + 1)>>1);
high_coeff = lh_band - ((hh_band + hh_band + 1)>>1);

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[2*j]= ll_band-((hl_band[j-1]+hl_band+1)>>1);
high_coeff[2*j]=lh_band-((hh_band[j-1]+hh_band+1)>>1);
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[2*j+1]=2*hl_band+((low_coeff[2*j]+low_coeff[2*j+2])>>1);
high_coeff[2*j+1]=2*hh_band+((high_coeff[2*j]+high_coeff[2*j+2])>>1);
}

low_coeff[2*j+1] = (hl_band<<1) + (low_coeff[2*j]);
high_coeff[2*j+1] = (hh_band<<1) + (high_coeff[2*j]);

ll_band+=band_size;
hl_band+=band_size;
low_coeff+=t_band_size;

lh_band+=band_size;
hh_band+=band_size;
high_coeff+=t_band_size;

}

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

When i get the vector report by following command

icc -c -O3 -Wall -march=core2 -vec-report3 xxxxxx.c

i get the following report with respect to the above loops.

xxxxxx.c(671): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(674): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 674 and hl_band line 673.
xxxxxx.c(673): (col. 7) remark: vector dependence: assumed ANTI dependence between hl_band line 673 and high_coeff line 674.

xxxxxx.c(679): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(682): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 682 and low_coeff line 681.
xxxxxx.c(681): (col. 7) remark: vector dependence: assumed ANTI dependence between low_coeff line 681 and high_coeff line 682.

In the first loop, which complains about the dependence between hl_band and high_coeff, are two independent memory locations in in_buf (input_buffer) and out_buffer (output_buffer).

In the second loop, which complains about the depedence between low_coeff and high_coeff, are two independent memort locations in out_buf (first half of the out_buffer is for low_coeff and second half of out_buffer is for high_coeff). So, these two variables are also independent.

As these loops have independent statements, i tried to vectorise forcefully using the #pragma ivdep and #pragma vector always. Then, both loops got vectorised. But in actual the timing to execute these two loops got increased (by few millisec and didn't get reduced for sure).

So, inorder to vectorize the loop by compiler itself in normal way (not forcefully), i modified the code to separate the two statements in the loop which initially said vector dependent by the compiler. The modified code is as follows.

for(i=0;i<band_size;i++)
{

low_coeff = ll_band - ((hl_band + hl_band + 1)>>1);
high_coeff = lh_band - ((hh_band + hh_band + 1)>>1);

for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[2*j]= ll_band-((hl_band[j-1]+hl_band+1)>>1);
}

for(j=1;j<band_size;j++) //line 676 is this line
{
high_coeff[2*j]=lh_band-((hh_band[j-1]+hh_band+1)>>1);
}

for(j=1;j<band_size;j++) //line 684 is this line
{
low_coeff[2*j]= ll_band-((hl_band[j-1]+hl_band+1)>>1);
}

for(j=0;j<band_size-1;j++) //line 689 is this line
{
high_coeff[2*j+1]=2*hh_band+((high_coeff[2*j]+high_coeff[2*j+2])>>1);
}

low_coeff[2*j+1] = (hl_band<<1) + (low_coeff[2*j]);
high_coeff[2*j+1] = (hh_band<<1) + (high_coeff[2*j]);

}

Now, the compiler says, for first two inner loops:

xxxxxx.c(671): (col. 5) remark: LOOP WAS VECTORIZED.
xxxxxx.c(671): (col. 5) remark: REMAINDER LOOP WAS VECTORIZED.
xxxxxx.c(671): (col. 5) remark: loop skipped: multiversioned.
xxxxxx.c(676): (col. 5) remark: LOOP WAS VECTORIZED.
xxxxxx.c(676): (col. 5) remark: REMAINDER LOOP WAS VECTORIZED.
xxxxxx.c(676): (col. 5) remark: loop skipped: multiversioned.

and next two inner loops, it says:

xxxxxx.c(684): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.
xxxxxx.c(684): (col. 5) remark: loop skipped: multiversioned.
xxxxxx.c(689): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.
xxxxxx.c(689): (col. 5) remark: loop skipped: multiversioned.

Then, i used #pragma vector always for last 2 inner loops to get them vectorized.

With these changes, i got the vectorised loop. But, again good amount of timing reduction didn't happen. May be because i'm running the loop mulitple times (4 times of inner loop) to compute the values than earlier (2 times of inner loop).

Can anyone please, help me out with options how to tell the compiler that these variables are not vector dependent and hence vectorize, which yields me some good amount of reduction in execution time for these loops?

Thanks.

28 Replies Black Belt
195 Views

Does -xSSSE3 give the same result as -march=core2?  I don't know the history of icc treatment of -march options; besides, you don't say which version you are using.  I guess you didn't need -ansi-alias or restrict.

Does it make a difference whether you use all short data types rather than mixing short and int?  For 2* you could try replacement by addition in case the compiler didn't do it that way.  You don't seem to have a consistent style anyway.

If you don't need j as a global short variable, try making it a local int inside each loop, using C99 or C++ int j=1; and then of course leaving the loop ending conditional as an int expression.  Check whether the loop ending condition is treated as loop invariant.  Without a working example, there's no way for us to know, except that any report of vectorization would imply it's OK.

When the compiler says "seems inefficient" it should hardly be a surprise when your results bear that out.

The compiler should fuse your loops if it finds that is better than splitting them in two.

Vectorization with non-contiguous array items is always problematical. gcc sometimes has better non-vector optimization than icc  (although of course the required options are more complicated). Beginner
195 Views

Yes, -xSSSE3 is giving me the same result as that of -march=core2.

I'm using 13.0.1 version of icc compiler.

I didn't find any difference between mixing short and int (short for all data variables and int for loop iterations), and using short for all variables.

Also, i didn't find any difference between 2* and addition (j+j).

Using local int inside each loop didn't help me either.

May be your statement "Vectorization with non-contiguous array items is always problematical." is right. Because, i was able to vectorise the loop which was operating on contiguous memory locations.

My basic question here is with respect to the message "vector dependence". Because, 'high_coeff' and 'hl_band' points to (as initialised before the loop) 'out_buf' and 'in_buf' which are 2 independent buffers. I'm puzzled, why is it saying both variables are vector dependent? Is there any options like assembler directives to tell the compiler that these 2 are not vector dependent? How can i resolve this? Beginner
195 Views

Is there any options like compiler directives to tell the compiler that these 2 are not vector dependent? Black Belt
195 Views

The #pragma simd directives or the CEAN notation over-rule all considerations of vector dependence, but you said originally that you were able to get all loops vectorized (except possibly with the original fused version, where restrict might have been sufficient).

In principle, restrict might be useful to work on multiple arrays per loop even if you choose a scalar optimized unrolled compilation with gcc.

ATOM doesn't have as full a set of blend instructions as other current CPUs which could support these cases where you modify some but not all elements of a vector. Black Belt
195 Views

Out of curiosity, what happens when you replace the shifts with *2 and /2?

Jim Dempsey Black Belt
195 Views

Back in the Pentium4 days, when using compilers which didn't optimize the choice between add, shift, and multiply, the add was recommended.   Right shift is usually recommended rather than signed /2 if applicable.  Vectorizing compilers would be expected to make such choices automatically.

I was wondering about inconsistencies of style which do give us more to consider, and whether an int shift count or mutliplier or divider would alter behavior by forcing promotion from short int. Valued Contributor II
195 Views
Do you think it is always possible to rely on C++ compiler instead of restructuring your own codes? We recently had a case when the most agressive optimization option /O3 of Intel C++ compiler did Not work well and created the slowest codes compared to /O1 and /O2 options. >>...Is there any options like compiler directives to tell the compiler that these 2 are not vector dependent? Blocks of lines ( for example, 671 - 674 ) in your initial post look like some kind of software pipelining and take a look at a #pragma directive swp in Intel C++ compiler Users and Reference Guide. What about #pragma optimize ( "", off ) before your function and #pragma optimize ( "", on ) after with manual unrolling of these loops and prefetching? Beginner
195 Views

#pragma directives are vectorising the loop. But, as this is a over-ruling process, i'm not getting much gain in terms of the loop execution time reduction.

My understanding from these experiments is that, as we are writing the values to non-contiguous memory locations, vectorization is not so efficient. Because, i'm computing the array indices separately (addresses separately) to place the odd-components and even-components in odd and even places in the array (buffer). Am i right here concluding so?

I did another experiment where i removed all these array indices calculations and made the statements in the loop very simple, whose result didn't support my above conclusion. The code snippet is as follows:

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff= ll_band;
high_coeff=lh_band;
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[j+1]=2*hl_band;
high_coeff[j+1]=2*hh_band;
}

xxxxxx.c(671): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(674): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 674 and ll__band line 673.
xxxxxx.c(673): (col. 7) remark: vector dependence: assumed ANTI dependence between ll_band line 673 and high_coeff line 674.

xxxxxx.c(679): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(682): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 682 and hl_band line 681.
xxxxxx.c(681): (col. 7) remark: vector dependence: assumed ANTI dependence between hl_band line 681 and high_coeff line 682.

Here also complier says both these 2 varialbes are vector dependent even they point to 2 different buffers (in_buf and out_buf).

#pragma basically over-rules these messages and vectorize, but may not do efficiently. Is there any other ways to tell compiler that these 2 varialbes are not dependent and that vectorizes efficiently.

I tried another option by declaring the variables initialized by in_buf to 'const'. But it was not helpful.

Now, i feel i'm left with option of modifying/restructuring my code to make contiguous memory write OR switch to intrinsics. Beginner
195 Views

#pragma directives are vectorising the loop. But, as this is a over-ruling process, i'm not getting much gain in terms of the loop execution time reduction.

My understanding from these experiments is that, as we are writing the values to non-contiguous memory locations, vectorization is not so efficient. Because, i'm computing the array indices separately (addresses separately) to place the odd-components and even-components in odd and even places in the array (buffer). Am i right here concluding so?

I did another experiment where i removed all these array indices calculations and made the statements in the loop very simple, whose result didn't support my above conclusion. The code snippet is as follows:

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff= ll_band;
high_coeff=lh_band;
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[j+1]=2*hl_band;
high_coeff[j+1]=2*hh_band;
}

xxxxxx.c(671): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(674): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 674 and ll__band line 673.
xxxxxx.c(673): (col. 7) remark: vector dependence: assumed ANTI dependence between ll_band line 673 and high_coeff line 674.

xxxxxx.c(679): (col. 5) remark: loop was not vectorized: existence of vector dependence.
xxxxxx.c(682): (col. 7) remark: vector dependence: assumed FLOW dependence between high_coeff line 682 and hl_band line 681.
xxxxxx.c(681): (col. 7) remark: vector dependence: assumed ANTI dependence between hl_band line 681 and high_coeff line 682.

Here also complier says both these 2 varialbes are vector dependent even they point to 2 different buffers (in_buf and out_buf).

#pragma basically over-rules these messages and vectorize, but may not do efficiently. Is there any other ways to tell compiler that these 2 varialbes are not dependent and that vectorizes efficiently.

I tried another option by declaring the variables initialized by in_buf to 'const'. But it was not helpful.

Now, i feel i'm left with option of modifying/restructuring my code to make contiguous memory write OR switch to intrinsics.

Thanks. Valued Contributor II
195 Views
By the way, How big are all these arrays? Could you post a complete simplified test case with declarations including a value for band_size variable? Black Belt
195 Views

You haven't showed us how you informed the compiler that hh_band, low_coeff, and high_coeff point to non-overlapping data regions.   However you did it, the message didn't get through.

Once again, frequently used alternatives include:

buffer definitions local to compilation unit

short int *restrict hh_band,.....

#pragma ivdep

#pragma simd vectorlength(16)  // or any other number which may be accepted

compiler options to ask for assumption of standard compliance:  -std=c99 -ansi-alias

noalias options more aggressive than standard Beginner
195 Views

These arrays are 4096 size. Here is the function part which i'm trying to vectorise.

void func(short *in_buf, short *out_buf, short band_size)

{

short *low_coeff, *high_coeff;
short *ll_band,*hl_band, *lh_band, *hh_band;
short t_band_size,j,i;

t_band_size = band_size*2;

ll_band=in_buf+band_size*band_size*3;
hl_band=in_buf;
low_coeff=out_buf;

lh_band=in_buf+band_size*band_size;
hh_band=in_buf+band_size*band_size*2;
high_coeff=out_buf+band_size*band_size*2;

for(i=0;i<band_size;i++)
{

low_coeff = ll_band - ((hl_band + hl_band + 1)>>1);
high_coeff = lh_band - ((hh_band + hh_band + 1)>>1);

/*even coefficients computation*/
for(j=1;j<band_size;j++) //line 671 is this line
{
low_coeff[2*j]= ll_band-((hl_band[j-1]+hl_band+1)>>1);
high_coeff[2*j]=lh_band-((hh_band[j-1]+hh_band+1)>>1);
}

/*odd coefficients computation*/
for(j=0;j<band_size-1;j++) //line 679 is this line
{
low_coeff[2*j+1]=2*hl_band+((low_coeff[2*j]+low_coeff[2*j+2])>>1);
high_coeff[2*j+1]=2*hh_band+((high_coeff[2*j]+high_coeff[2*j+2])>>1);
}

low_coeff[2*j+1] = (hl_band<<1) + (low_coeff[2*j]);
high_coeff[2*j+1] = (hh_band<<1) + (high_coeff[2*j]);

ll_band+=band_size;
hl_band+=band_size;
low_coeff+=t_band_size;

lh_band+=band_size;
hh_band+=band_size;
high_coeff+=t_band_size;

}

}

It is called with these parameters:

funcfunc(in_buf, out_buf, 32); //band_size = 32 Beginner
195 Views

It is called with these parameters:

func(in_buf, out_buf, 32); //band_size = 32 Beginner
195 Views

I didn't inform compiler explicitly that hl_band and high_coeff are non-overlapping memory regions. I initialised these 2 variables with different buffers. Isn't the compiler going to consider these 2 variables in non-overlapping memory regions? How do i explicitly inform compiler about it?

I have used  restrict (short *restrict hh_band...), #pragma ivdep, #pragma simd. #pragma ivdep and #pragma simd vectorises the loop, but i'm not getting the good amount of reduction in execution time of these loops (there is no change in execution timing). So, i'm assuming that the loop vectorization is not efficient with #pragma. restrict didn't help me. Valued Contributor II
195 Views
>>...Isn't the compiler going to consider these 2 variables in non-overlapping memory regions? How do i explicitly >>inform compiler about it? There are several Intel C++ compiler options for rearranging memory and you could try these options. Since your array sizes are 4096 elements ( short type / 8192 bytes for each array ) you possibly have some issues related to L2 cache line and VTune could provide you additional technical details. Black Belt
195 Views

You may get some advantage by declspec(align(32)) at the point of definition of the buffers, if you also specify __assume_aligned(.... for those just before the loops to be optimized.  #pragma vector aligned is like vector always with the additional implied assertion that all operands are 16-byte aligned.

The advantage of 32-byte over 16-byte alignment varies with CPU model, and I haven't seen it documented.  It's not directly associated with the chosen instruction set architecture, so I don't know about atom. Beginner
195 Views

Thanks for you valuable inputs.

Byte alignment options __attribute__(align(16)) (as i'm on linux) didn't help me to vectorise the loop. Still getting the 'vector dependence' between those 2 variables. Is there any other options to make this loop vectorise?

Trying the option of using the intrinsics to vectorize this loop. Valued Contributor II
195 Views
Karthik, Could you post a complete list of command line options and a list of #pragma directives for Intel C++ compiler you currently use? Beginner
195 Views

I use this command to generate the vector report, icc -c -Wall -xSSSE3 -vec-report3 xxxxxx.c

I declare the variables as follows to align memory:

short *ll_band __attribute__(align(16)) //declaring in the same way for other memory variables in the loop

I used the following #pragma directives, and its result are mentioned before it.

#pragma simd : vectorized the loop

#pragma vector aligned : didn't vectorize, says vector dependence between 2 variables

#pragma ivdep : vectorized the loop

#pragma vector always : didn't vectorize, says vector dependence between 2 variables

Currently I'm using #pragma simd

Whenever i got compiler message LOOP WAS VECTORIZED by using the #pragma directives, i didn't get the substantial(reasonable) amount of execution timing reduction when i ran it. So, i concluded that the vectorization was not efficient.  