Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Intel TBB and Auto Vectorization

D_M_
Beginner
687 Views
Is there any way to use TBB and autovectorization?

I have this, where kIterSize is abut 512k.
parallel_for(blocked_range(0, kIterSize), parallel_task(), auto_partitioner());

I then do this inside my thread function
for(i = r.begin(); i < r.end(); i++)
BufB += BufA / 2.0f;

This is not autovectorized, so I am taking a huge performance hit.

I want SSE4 and TBB running on all processors.
0 Kudos
5 Replies
Alexey-Kukanov
Employee
687 Views
Try assigning r.end() to a local variable and use that in loop condition; that should help vectorizer to recognize the loop as suitable.
You might look at this my postfor somewhat related investigation.
0 Kudos
D_M_
Beginner
687 Views
Try assigning r.end() to a local variable and use that in loop condition; that should help vectorizer to recognize the loop as suitable.
You might look at this my postfor somewhat related investigation.

Thank you for the response. Your blog is very helpful. However, the local variable assignment did not help.

The compiler is telling me it did not autovectorize because
"parallel_for.h(89): (col. 20) remark: loop was not vectorized: unsupported loop structure.".

This line contains:
if( !my_range.is_divisible() || my_partition.should_execute_range(*this) ) {
my_body( my_range );
return my_partition.continue_after_execute_range(*this);
}


When I replaced r.end() with a local variable it gives these reasons:

parallel_for.h(89): (col. 20) remark: loop was not vectorized: existence of vector dependence.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed FLOW dependence between (unknown) line 89 and this line 89.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed ANTI dependence between this line 89 and (unknown) line 89.
0 Kudos
Alexey-Kukanov
Employee
687 Views
As you may have noticed, the line 89 is where the call to operator() for the body object is performed. I have no idea why the compiler pointed to this line; my best guess is that the remarks really apply to the actual loop over the blocked_range in your code.

If that loop consists of exactly one line as you wrote in the first post, then probably the compiler conservatively assumes the arrays you operate with could overlap (i.e. it can not prove those do not overlap). Assuming you use Intel Compiler, I suggest you to look at the documentation about vectorization. Let me quote just one sentence that might be relevant:

"For example, a common problem with global pointers is that they often prevent the compiler from being able to prove that two memory references refer to distinct locations. Consequently, this prevents certain reordering transformations."

In particular, #pragma ivdep might be used to tell the compiler to forget about assumed dependencies ifyou know for sure those are imaginary.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
687 Views
Quoting - Poiuyt
parallel_for.h(89): (col. 20) remark: loop was not vectorized: existence of vector dependence.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed FLOW dependence between (unknown) line 89 and this line 89.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed ANTI dependence between this line 89 and (unknown) line 89.

You may try to use __restrict/restrict keyword if supported by compiler, i.e.:

float* __restrict B = BufB;
float* __restrict A = BufA;
int end = r.end();
for(i = r.begin(); i < end; i++)
B += A / 2.0f;

This will communicate to the compiler that BufA and BufB are not overlapping so no dependencies.
0 Kudos
D_M_
Beginner
687 Views
Quoting - Dmitriy Vyukov

You may try to use __restrict/restrict keyword if supported by compiler, i.e.:

float* __restrict B = BufB;
float* __restrict A = BufA;
int end = r.end();
for(i = r.begin(); i < end; i++)
B += A / 2.0f;

This will communicate to the compiler that BufA and BufB are not overlapping so no dependencies.

I am using the latest version of the Intel Compiler.

The__restrict keywordworked and I got my performance back. Thanks!
0 Kudos
Reply