- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am working on a piece of code in which I am performing operations on 3 dimensional matrices using 3 nested 'for' loops. There are no dependencies across the loops iterations so the operations can be parallelized trivially by executing the outermost 'for' loop in parallel. In addition, I am trying to speed up the inner most loop using auto-vectorization. The compiler is behaving differently while auto-vectorizing innermost loop if the outermost loop is parallelized using TBB parallel_for. I am using icc 12.0.2. Following is a snippet of the serial code.
int i ;
int start ;
int end ;
int ip1,ip2;
int j, k;
for(j = 2; j < (nzlp4 -2); j++) {
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
}
I parallelized this code using Cilk plus and TBB by replacing outermost for loop by _Cilk_for and parallel_for respectively.
Following is the Cilk plus code snippet :
int j;
_Cilk_for(j = 2; j < (nzlp4 -2); j++) {
int i ;
int start ;
int end ;
int ip1,ip2;
int k;
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
}
Following is the TBB code snippet
parallel_for(blocked_range(2, nzlp4 - 2), [=] (const blocked_range &r) {
int start ;
int end ;
int i, j, k;
int ip1,ip2;
for(j = r.begin(); j < r.end(); j++) {
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
});
Please note that I have used the pragma 'ivdep' to ensure innermost loops are auto-vectorized. But the auto-vectorization happens only in case of the serial version and the Cilk plus version of the code. In the TBB version, auto-vectorization does not happen. Only after putting additional pragma 'vector always' the TBB version gets vectorized. However, the resultant code takes more than twice the time taken by the serial and the cilk plus version to execute. Note that to make the comparison fair I have restricted Cilk plus and TBB to use only one processor core.
After some searching I found out that if I give the flag -ansi-alias to the compiler, the vectorization happens even in case of the TBB version (without using the pragma 'vector always'). The resultant code is as fast the serial and the Cilk plus code.
So even though I have succeeded in making the TBB version as fast as the Cilk plus and serial version, I am unable to understand the behaviour of the compiler. Why does it require additional flags to properly auto-vectorize the same loop in the TBB version as compared to the serial or the Cilk plus version? In other words what are the additional dependencies assumed by the compiler in case of the TBB version which makes it difficult to vectorize the code in the absence of ANSI aliasing rules. Can anyone explain this?
int i ;
int start ;
int end ;
int ip1,ip2;
int j, k;
for(j = 2; j < (nzlp4 -2); j++) {
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
}
I parallelized this code using Cilk plus and TBB by replacing outermost for loop by _Cilk_for and parallel_for respectively.
Following is the Cilk plus code snippet :
int j;
_Cilk_for(j = 2; j < (nzlp4 -2); j++) {
int i ;
int start ;
int end ;
int ip1,ip2;
int k;
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
}
Following is the TBB code snippet
parallel_for(blocked_range
int start ;
int end ;
int i, j, k;
int ip1,ip2;
for(j = r.begin(); j < r.end(); j++) {
for (k = 2; k < (nylp4 - 2); k++) {
start = (j * nxlp4bnylp4) + (k *nxlp4);
end = start + nxlp4 - 2;
ip1 = start + 1;
ip2 = start + 2;
#pragma ivdep
for(i=start; i< end-2; i++)
{
Predictor1 = 7.0f*Velocity - 8.0f*Velocity[ip1] + Velocity[ip2];
Predictor2 = 7.0f*Pressure - 8.0f*Pressure[ip1] + Pressure[ip2];
Predictor1 = Pressure - factor * Rigidity * Predictor1 ;
Predictor2 = Velocity - factor * InverseDensity * Predictor2 ;
ip1++;
ip2++;
}
}
});
Please note that I have used the pragma 'ivdep' to ensure innermost loops are auto-vectorized. But the auto-vectorization happens only in case of the serial version and the Cilk plus version of the code. In the TBB version, auto-vectorization does not happen. Only after putting additional pragma 'vector always' the TBB version gets vectorized. However, the resultant code takes more than twice the time taken by the serial and the cilk plus version to execute. Note that to make the comparison fair I have restricted Cilk plus and TBB to use only one processor core.
After some searching I found out that if I give the flag -ansi-alias to the compiler, the vectorization happens even in case of the TBB version (without using the pragma 'vector always'). The resultant code is as fast the serial and the Cilk plus code.
So even though I have succeeded in making the TBB version as fast as the Cilk plus and serial version, I am unable to understand the behaviour of the compiler. Why does it require additional flags to properly auto-vectorize the same loop in the TBB version as compared to the serial or the Cilk plus version? In other words what are the additional dependencies assumed by the compiler in case of the TBB version which makes it difficult to vectorize the code in the absence of ANSI aliasing rules. Can anyone explain this?
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I cannot tell about the dependencies in your code. But have you tried -vec-report5 to obtain the vectorization report of the compiler? It will give you some insight into why the compiler did not vectorize the loop in question.
I can imagine that the TBB version of your code is harder to analyze for the compiler, since it involves a C++ lambda expression that might hide some of the data dependencies.
You could also try SIMD pragmas for your TBB code to give even more information to the compiler about how to vectorize the code.
Cheers,
-michael
I cannot tell about the dependencies in your code. But have you tried -vec-report5 to obtain the vectorization report of the compiler? It will give you some insight into why the compiler did not vectorize the loop in question.
I can imagine that the TBB version of your code is harder to analyze for the compiler, since it involves a C++ lambda expression that might hide some of the data dependencies.
You could also try SIMD pragmas for your TBB code to give even more information to the compiler about how to vectorize the code.
Cheers,
-michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vectorizer may have an easier time if you write
ip1 = i+1;
ip2 = i+2;
so as to reduce the degree of analysis required to vectorize.
ip1 = i+1;
ip2 = i+2;
so as to reduce the degree of analysis required to vectorize.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To expand on TimP's suggestion (assuming it does not get the vectorization you want)
remove ip1 and ip2
use [i+1] and [i+2] instead.
The Intel instruction set will make those additions automatically within the assembly instruction as opposed to performing the add as a separate step.
IA32 and Intel64 have Scale, Index and Base addressing format together with an (optional) immediate offset. The +1 and +2 will (should) get generated as an immediate offset (scaled appropriately by the compiler).
An additional advantage is by using i+1 and i+2 is this relieves register pressure (assuming the compiler optimizes correctly).
Jim Dempsey
remove ip1 and ip2
use [i+1] and [i+2] instead.
The Intel instruction set will make those additions automatically within the assembly instruction as opposed to performing the add as a separate step.
IA32 and Intel64 have Scale, Index and Base addressing format together with an (optional) immediate offset. The +1 and +2 will (should) get generated as an immediate offset (scaled appropriately by the compiler).
An additional advantage is by using i+1 and i+2 is this relieves register pressure (assuming the compiler optimizes correctly).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Replacing ip1 by i+1 and ip2 by i+2 inhibits auto-vectorization for some reason. Also, using function objects instead of lambda expressions does not have any effect on the vectorization of the code. However, as I have already mentioned in the question, the code gets vectorized properly if the flag '-ansi-alias' is used. What I am hoping for is some insight into what additional dependancies does icc assume in case of the TBB version of the code that makes vectorizaiton difficult.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the compiler expands the parallel loop into an internal function call, you introduce an additional dependence on interprocedural analysis to recognize non-overlap of your data. -ansi-alias allows the compiler to assume your code complies with C standard, facilitating that analysis. For example, if your Predictor data types are distinct from pointers, assigning to elements of them could be assumed not to clobber your pointers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>The compiler is behaving differently while auto-vectorizing innermost loop if the outermost loop is parallelized using TBB parallel_for.
If Predictor, Velocity, Pressure, InverseDensity are pointers outside the scope of the parallel_for, then pass them as value [=] as opposed to as reference [&].
Jim Dempsey
If Predictor, Velocity, Pressure, InverseDensity are pointers outside the scope of the parallel_for, then pass them as value [=] as opposed to as reference [&].
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So does this mean that Cilk plus does not expand the parallel loop into an internal function call? Otherwise even Cilk plus should face the same problem while vectorizing the code as TBB.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page