Autovectoriztion when using TBB parallel_for and _Cilk_for

Nachiket_Bhave · ‎12-27-2011

I am working on a piece of code in which I am performing operations on 3 dimensional matrices using 3 nested 'for' loops. There are no dependencies across the loops iterations so the operations can be parallelized trivially by executing the outermost 'for' loop in parallel. In addition, I am trying to speed up the inner most loop using auto-vectorization. The compiler is behaving differently while auto-vectorizing innermost loop if the outermost loop is parallelized using TBB parallel_for. I am using icc 12.0.2. Following is a snippet of the serial code.

int i ;
int start ;
int end ;
int ip1,ip2;
int j, k;
for(j = 2; j < (nzlp4 -2); j++) {
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
}

I parallelized this code using Cilk plus and TBB by replacing outermost for loop by _Cilk_for and parallel_for respectively.
Following is the Cilk plus code snippet :

int j;
_Cilk_for(j = 2; j < (nzlp4 -2); j++) {
int i ;
int start ;
int end ;
int ip1,ip2;
int k;
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
}

Following is the TBB code snippet

parallel_for(blocked_range(2, nzlp4 - 2), [=] (const blocked_range &r) {
int start ;
int end ;
int i, j, k;
int ip1,ip2;
for(j = r.begin(); j < r.end(); j++) {
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
});

Please note that I have used the pragma 'ivdep' to ensure innermost loops are auto-vectorized. But the auto-vectorization happens only in case of the serial version and the Cilk plus version of the code. In the TBB version, auto-vectorization does not happen. Only after putting additional pragma 'vector always' the TBB version gets vectorized. However, the resultant code takes more than twice the time taken by the serial and the cilk plus version to execute. Note that to make the comparison fair I have restricted Cilk plus and TBB to use only one processor core.
After some searching I found out that if I give the flag -ansi-alias to the compiler, the vectorization happens even in case of the TBB version (without using the pragma 'vector always'). The resultant code is as fast the serial and the Cilk plus code.
So even though I have succeeded in making the TBB version as fast as the Cilk plus and serial version, I am unable to understand the behaviour of the compiler. Why does it require additional flags to properly auto-vectorize the same loop in the TBB version as compared to the serial or the Cilk plus version? In other words what are the additional dependencies assumed by the compiler in case of the TBB version which makes it difficult to vectorize the code in the absence of ANSI aliasing rules. Can anyone explain this?

Michael_K_Intel2 · ‎12-27-2011

Hi,

I cannot tell about the dependencies in your code. But have you tried -vec-report5 to obtain the vectorization report of the compiler? It will give you some insight into why the compiler did not vectorize the loop in question.

I can imagine that the TBB version of your code is harder to analyze for the compiler, since it involves a C++ lambda expression that might hide some of the data dependencies.

You could also try SIMD pragmas for your TBB code to give even more information to the compiler about how to vectorize the code.

Cheers,
-michael

TimP · ‎12-27-2011

Vectorizer may have an easier time if you write
ip1 = i+1;
ip2 = i+2;
so as to reduce the degree of analysis required to vectorize.

jimdempseyatthecove · ‎12-29-2011

To expand on TimP's suggestion (assuming it does not get the vectorization you want)

remove ip1 and ip2

use [i+1] and [i+2] instead.

The Intel instruction set will make those additions automatically within the assembly instruction as opposed to performing the add as a separate step.

IA32 and Intel64 have Scale, Index and Base addressing format together with an (optional) immediate offset. The +1 and +2 will (should) get generated as an immediate offset (scaled appropriately by the compiler).

An additional advantage is by using i+1 and i+2 is this relieves register pressure (assuming the compiler optimizes correctly).

Jim Dempsey

Nachiket_Bhave · ‎12-30-2011

Replacing ip1 by i+1 and ip2 by i+2 inhibits auto-vectorization for some reason. Also, using function objects instead of lambda expressions does not have any effect on the vectorization of the code. However, as I have already mentioned in the question, the code gets vectorized properly if the flag '-ansi-alias' is used. What I am hoping for is some insight into what additional dependancies does icc assume in case of the TBB version of the code that makes vectorizaiton difficult.

TimP · ‎12-30-2011

If the compiler expands the parallel loop into an internal function call, you introduce an additional dependence on interprocedural analysis to recognize non-overlap of your data. -ansi-alias allows the compiler to assume your code complies with C standard, facilitating that analysis. For example, if your Predictor data types are distinct from pointers, assigning to elements of them could be assumed not to clobber your pointers.

jimdempseyatthecove · ‎12-30-2011

>>The compiler is behaving differently while auto-vectorizing innermost loop if the outermost loop is parallelized using TBB parallel_for.

If Predictor, Velocity, Pressure, InverseDensity are pointers outside the scope of the parallel_for, then pass them as value [=] as opposed to as reference [&].

Jim Dempsey

Nachiket_Bhave · ‎01-02-2012

So does this mean that Cilk plus does not expand the parallel loop into an internal function call? Otherwise even Cilk plus should face the same problem while vectorizing the code as TBB.