- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there any way to use TBB and autovectorization?
I have this, where kIterSize is abut 512k.
parallel_for(blocked_range(0, kIterSize), parallel_task(), auto_partitioner());
I then do this inside my thread function
for(i = r.begin(); i < r.end(); i++)
BufB += BufA / 2.0f;
This is not autovectorized, so I am taking a huge performance hit.
I want SSE4 and TBB running on all processors.
I have this, where kIterSize is abut 512k.
parallel_for(blocked_range
I then do this inside my thread function
for(i = r.begin(); i < r.end(); i++)
BufB += BufA / 2.0f;
This is not autovectorized, so I am taking a huge performance hit.
I want SSE4 and TBB running on all processors.
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try assigning r.end() to a local variable and use that in loop condition; that should help vectorizer to recognize the loop as suitable.
You might look at this my postfor somewhat related investigation.
You might look at this my postfor somewhat related investigation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Alexey Kukanov (Intel)
Try assigning r.end() to a local variable and use that in loop condition; that should help vectorizer to recognize the loop as suitable.
You might look at this my postfor somewhat related investigation.
You might look at this my postfor somewhat related investigation.
Thank you for the response. Your blog is very helpful. However, the local variable assignment did not help.
The compiler is telling me it did not autovectorize because
"parallel_for.h(89): (col. 20) remark: loop was not vectorized: unsupported loop structure.".
This line contains:
if( !my_range.is_divisible() || my_partition.should_execute_range(*this) ) {
my_body( my_range );
return my_partition.continue_after_execute_range(*this);
}
When I replaced r.end() with a local variable it gives these reasons:
parallel_for.h(89): (col. 20) remark: loop was not vectorized: existence of vector dependence.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed FLOW dependence between (unknown) line 89 and this line 89.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed ANTI dependence between this line 89 and (unknown) line 89.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As you may have noticed, the line 89 is where the call to operator() for the body object is performed. I have no idea why the compiler pointed to this line; my best guess is that the remarks really apply to the actual loop over the blocked_range in your code.
If that loop consists of exactly one line as you wrote in the first post, then probably the compiler conservatively assumes the arrays you operate with could overlap (i.e. it can not prove those do not overlap). Assuming you use Intel Compiler, I suggest you to look at the documentation about vectorization. Let me quote just one sentence that might be relevant:
"For example, a common problem with global pointers is that they often prevent the compiler from being able to prove that two memory references refer to distinct locations. Consequently, this prevents certain reordering transformations."
In particular, #pragma ivdep might be used to tell the compiler to forget about assumed dependencies ifyou know for sure those are imaginary.
If that loop consists of exactly one line as you wrote in the first post, then probably the compiler conservatively assumes the arrays you operate with could overlap (i.e. it can not prove those do not overlap). Assuming you use Intel Compiler, I suggest you to look at the documentation about vectorization. Let me quote just one sentence that might be relevant:
"For example, a common problem with global pointers is that they often prevent the compiler from being able to prove that two memory references refer to distinct locations. Consequently, this prevents certain reordering transformations."
In particular, #pragma ivdep might be used to tell the compiler to forget about assumed dependencies ifyou know for sure those are imaginary.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Poiuyt
parallel_for.h(89): (col. 20) remark: loop was not vectorized: existence of vector dependence.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed FLOW dependence between (unknown) line 89 and this line 89.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed ANTI dependence between this line 89 and (unknown) line 89.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed FLOW dependence between (unknown) line 89 and this line 89.
parallel_for.h(89): (col. 20) remark: vector dependence: assumed ANTI dependence between this line 89 and (unknown) line 89.
You may try to use __restrict/restrict keyword if supported by compiler, i.e.:
float* __restrict B = BufB;
float* __restrict A = BufA;
int end = r.end();
for(i = r.begin(); i < end; i++)
B += A / 2.0f;
This will communicate to the compiler that BufA and BufB are not overlapping so no dependencies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
You may try to use __restrict/restrict keyword if supported by compiler, i.e.:
float* __restrict B = BufB;
float* __restrict A = BufA;
int end = r.end();
for(i = r.begin(); i < end; i++)
B += A / 2.0f;
This will communicate to the compiler that BufA and BufB are not overlapping so no dependencies.
I am using the latest version of the Intel Compiler.
The__restrict keywordworked and I got my performance back. Thanks!
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page