Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

Performance issue with compact form of parallel_for

Anastopoulos__Nikos
334 Views
Hi,
I am running a simple parallel for over an array of floats, implemented both as a range-based parallel for and as a compact one that loops over a consecutive range of integers, i.e.:
[bash]   tbb::parallel_for( 
            tbb::blocked_range(0,n), 
            [=](const tbb::blocked_range& r) {
                for ( size_t i = r.begin(); i != r.end(); ++i )
                    Foo(a);
            } 
            );
[/bash]
and
[bash]   tbb::parallel_for( size_t(0), n, 
            [=](size_t i) { 
            Foo(a); 
         } ); 
[/bash]
The second version always performs worse than the first, usually by a factor of 2x or more. Any idea why is that happening?
tbb::parallel_for(
tbb::blocked_range(0,n),
[=](const tbb::blocked_range& r) {
for ( size_t i = r.begin(); i != r.end(); ++i )
Foo(a);
}
);
0 Kudos
5 Replies
SergeyKostrov
Valued Contributor II
334 Views
What are these?

...
[=](consttbb::blocked_rangelt;size_tgt;&r){
...

and

...
[=](size_ti){
...

Is that a result of some incorrect "Copy-and-Paste" operation?
0 Kudos
Anastopoulos__Nikos
334 Views
I guess it's just some wrong conversion of less-than/greater-than symbols from within the code block to the corresponding HTML keywords.
0 Kudos
Alexey-Kukanov
Employee
334 Views
Mostl likely, the compiler is able to vectorize the inner (serial) loop in the first case, but cannot vectorize its equivalent in TBB internals in the second case. Something for us TBB developers to look at.
0 Kudos
Anastopoulos__Nikos
334 Views
Yes, probably. It turns out that the performance difference is evident when compiling with -O3. When using -O2, the performance is the same for both loop versions (actually, the performance of the "compact" version does not improve at all when going from O2 to O3). Good to know that you are looking at it.
0 Kudos
RafSchietekat
Valued Contributor III
334 Views
You could confirm Alexey's very plausible analysis, and get intermediate relief, by going into include/tbb/parallel_for.h and hoisting r.end() out of the loop in parallel_for_body::operator(), i.e., assign its value to a variable i_end and evaluate i

(Added) Hmm, better also make a local copy of my_step...
0 Kudos
Reply