Community
cancel
Showing results for 
Search instead for 
Did you mean: 
36 Views

Performance issue with compact form of parallel_for

Hi,
I am running a simple parallel for over an array of floats, implemented both as a range-based parallel for and as a compact one that loops over a consecutive range of integers, i.e.:
[bash]   tbb::parallel_for( 
            tbb::blocked_range(0,n), 
            [=](const tbb::blocked_range& r) {
                for ( size_t i = r.begin(); i != r.end(); ++i )
                    Foo(a);
            } 
            );
[/bash]
and
[bash]   tbb::parallel_for( size_t(0), n, 
            [=](size_t i) { 
            Foo(a); 
         } ); 
[/bash]
The second version always performs worse than the first, usually by a factor of 2x or more. Any idea why is that happening?
tbb::parallel_for(
tbb::blocked_range(0,n),
[=](const tbb::blocked_range& r) {
for ( size_t i = r.begin(); i != r.end(); ++i )
Foo(a);
}
);
0 Kudos
5 Replies
SergeyKostrov
Valued Contributor II
36 Views

What are these?

...
[=](consttbb::blocked_rangelt;size_tgt;&r){
...

and

...
[=](size_ti){
...

Is that a result of some incorrect "Copy-and-Paste" operation?
36 Views

I guess it's just some wrong conversion of less-than/greater-than symbols from within the code block to the corresponding HTML keywords.
Alexey_K_Intel3
Employee
36 Views

Mostl likely, the compiler is able to vectorize the inner (serial) loop in the first case, but cannot vectorize its equivalent in TBB internals in the second case. Something for us TBB developers to look at.
36 Views

Yes, probably. It turns out that the performance difference is evident when compiling with -O3. When using -O2, the performance is the same for both loop versions (actually, the performance of the "compact" version does not improve at all when going from O2 to O3). Good to know that you are looking at it.
RafSchietekat
Black Belt
36 Views

You could confirm Alexey's very plausible analysis, and get intermediate relief, by going into include/tbb/parallel_for.h and hoisting r.end() out of the loop in parallel_for_body::operator(), i.e., assign its value to a variable i_end and evaluate i

(Added) Hmm, better also make a local copy of my_step...
Reply