- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am running a simple parallel for over an array of floats, implemented both as a range-based parallel for and as a compact one that loops over a consecutive range of integers, i.e.:
[bash] tbb::parallel_for( tbb::blocked_rangeand(0,n), [=](const tbb::blocked_range & r) { for ( size_t i = r.begin(); i != r.end(); ++i ) Foo(a); } ); [/bash]
[bash] tbb::parallel_for( size_t(0), n, [=](size_t i) { Foo(a); } ); [/bash]
The second version always performs worse than the first, usually by a factor of 2x or more. Any idea why is that happening?
tbb::parallel_for(
tbb::blocked_range(0,n),
[=](const tbb::blocked_range& r) {
for ( size_t i = r.begin(); i != r.end(); ++i )
Foo(a);
}
);
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What are these?
...
[=](consttbb::blocked_rangelt;size_tgt;&r){
...
and
...
[=](size_ti){
...
Is that a result of some incorrect "Copy-and-Paste" operation?
...
[=](consttbb::blocked_rangelt;size_tgt;&r){
...
and
...
[=](size_ti){
...
Is that a result of some incorrect "Copy-and-Paste" operation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess it's just some wrong conversion of less-than/greater-than symbols from within the code block to the corresponding HTML keywords.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mostl likely, the compiler is able to vectorize the inner (serial) loop in the first case, but cannot vectorize its equivalent in TBB internals in the second case. Something for us TBB developers to look at.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, probably. It turns out that the performance difference is evident when compiling with -O3. When using -O2, the performance is the same for both loop versions (actually, the performance of the "compact" version does not improve at all when going from O2 to O3). Good to know that you are looking at it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could confirm Alexey's very plausible analysis, and get intermediate relief, by going into include/tbb/parallel_for.h and hoisting r.end() out of the loop in parallel_for_body::operator(), i.e., assign its value to a variable i_end and evaluate i
(Added) Hmm, better also make a local copy of my_step...
(Added) Hmm, better also make a local copy of my_step...

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page