Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
2452 Discussions

Performance issue with compact form of parallel_for

Anastopoulos__Nikos
166 Views
Hi,
I am running a simple parallel for over an array of floats, implemented both as a range-based parallel for and as a compact one that loops over a consecutive range of integers, i.e.:
[bash]   tbb::parallel_for( 
            tbb::blocked_range(0,n), 
            [=](const tbb::blocked_range& r) {
                for ( size_t i = r.begin(); i != r.end(); ++i )
                    Foo(a);
            } 
            );
[/bash]
and
[bash]   tbb::parallel_for( size_t(0), n, 
            [=](size_t i) { 
            Foo(a); 
         } ); 
[/bash]
The second version always performs worse than the first, usually by a factor of 2x or more. Any idea why is that happening?
tbb::parallel_for(
tbb::blocked_range(0,n),
[=](const tbb::blocked_range& r) {
for ( size_t i = r.begin(); i != r.end(); ++i )
Foo(a);
}
);
0 Kudos
5 Replies
SergeyKostrov
Valued Contributor II
166 Views
What are these?

...
[=](consttbb::blocked_rangelt;size_tgt;&r){
...

and

...
[=](size_ti){
...

Is that a result of some incorrect "Copy-and-Paste" operation?
Anastopoulos__Nikos
166 Views
I guess it's just some wrong conversion of less-than/greater-than symbols from within the code block to the corresponding HTML keywords.
Alexey-Kukanov
Employee
166 Views
Mostl likely, the compiler is able to vectorize the inner (serial) loop in the first case, but cannot vectorize its equivalent in TBB internals in the second case. Something for us TBB developers to look at.
Anastopoulos__Nikos
166 Views
Yes, probably. It turns out that the performance difference is evident when compiling with -O3. When using -O2, the performance is the same for both loop versions (actually, the performance of the "compact" version does not improve at all when going from O2 to O3). Good to know that you are looking at it.
RafSchietekat
Black Belt
166 Views
You could confirm Alexey's very plausible analysis, and get intermediate relief, by going into include/tbb/parallel_for.h and hoisting r.end() out of the loop in parallel_for_body::operator(), i.e., assign its value to a variable i_end and evaluate i

(Added) Hmm, better also make a local copy of my_step...
Reply