Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2465 Discussions

What are the differences between parallel_for and parallel_reduce in Intel TBB?

vanhouten777
Beginner
1,673 Views
I think parallel_reduce is better when we have to pass data between different grains or threads. Can someone explain how parallel_reduce work? I have pasted example code.
A loop can do reduction, as in this summation:
float SerialSumFoo( float a[], size_t n ) {
float sum = 0;
for( size_t i=0; i!=n; ++i )
sum += Foo(a);
return sum;
}
If the iterations are independent, you can parallelize this loop using the template class
parallel_reduce as follows:
float ParallelSumFoo( const float a[], size_t n ) {
SumFoo sf(a);
parallel_reduce( blocked_range(0,n), sf );
2 Prior to Intel TBB 2.2, the default was simple_partitioner. Compile with
TBB_DEPRECATED=1 to get the old default.
A loop can do reduction, as in this summation:float SerialSumFoo( float a[], size_t n ) {float sum = 0;for( size_t i=0; i!=n; ++i )sum += Foo(a);return sum;}If the iterations are independent, you can parallelize this loop using the template classparallel_reduce as follows:float ParallelSumFoo( const float a[], size_t n ) {SumFoo sf(a);parallel_reduce( blocked_range(0,n), sf );2 Prior to Intel TBB 2.2, the default was simple_partitioner. Compile withTBB_DEPRECATED=1 to get the old default.
return sf.my_sum;
}
The class SumFoo specifies details of the reduction, such as how to accumulate
subsums and combine them. Here is the definition of class SumFoo:
class SumFoo {
float* my_a;
public:
float my_sum;
void operator()( const blocked_range& r ) {
float *a = my_a;
float sum = my_sum;
size_t end = r.end();
for( size_t i=r.begin(); i!=end; ++i )
sum += Foo(a);
my_sum = sum;
}
SumFoo( SumFoo& x, split ) : my_a(x.my_a), my_sum(0) {}
void join( const SumFoo& y ) {my_sum+=y.my_sum;}
SumFoo(float a[] ) :
my_a(a), my_sum(0)
{}
};
Note the differences with class ApplyFoo from Section 3.2. First, operator() is not
const. This is because it must update SumFoo::my_sum. Second, SumFoo has a
splitting constructor and a method join that must be present for parallel_reduce to
work. The splitting constructor takes as arguments a reference to the original object,
and a dummy argument of type split, which is defined by the library. The dummy
argument distinguishes the splitting constructor from a copy constructor.
TIP: In the example, the definition of operator() uses local temporary variables (a, sum,
end) for scalar values accessed inside the loop. This technique can improve
performance by making it obvious to the compiler that the values can be held in
registers instead of memory. If the values are too large to fit in registers, or have
their address taken in a way the compiler cannot track, the technique might not help.
With a typical optimizing compiler, using local temporaries for only written variables
(such as sum in the example) can suffice, because then the compiler can deduce that
the loop does not write to any of the other locations, and hoist the other reads to
outside the loop.
When a worker thread is available, as decided by the task scheduler,
parallel_reduce invokes the splitting constructor to create a subtask for the worker.
When the subtask completes, parallel_reduce uses method join to accumulate the
result of the subtask. The graph at the top of Figure 5 shows the split-join sequence
that happens when a worker is available:
Figure 5: Graph of the Split-join Sequence
An arc in the Figure 5 indicates order in time. The splitting constructor might run
concurrently while object x is being used for the first half of the reduction. Therefore,
all actions of the splitting constructor that creates y must be made thread safe with
respect to x. So if the splitting constructor needs to increment a reference count
shared with other objects, it should use an atomic increment.
If a worker is not available, the second half of the iteration is reduced using the same
body object that reduced the first half. That is the reduction of the second half starts
where reduction of the first half finished.
CAUTION: Because split/join are not used if workers are unavailable, parallel_reduce does not
necessarily do recursive splitting.
CAUTION: Because the same body might be used to accumulate multiple subranges, it is critical
that operator() not discard earlier accumulations. The code below shows an incorrect
definition of SumFoo::operator().
class SumFoo {
...
public:
float my_sum;
void operator()( const blocked_range& r ) {
...
float sum = 0; // WRONG should be "sum = my_sum".
...
for( ... )
sum += Foo(a);
my_sum = sum;
}
...
};
With the mistake, the body returns a partial sum for the last subrange instead of all
subranges to which parallel_reduce applies it.
The rules for partitioners and grain sizes for parallel_reduce are the same as for
parallel_for.
parallel_reduce generalizes to any associative operation. In general, the splitting
constructor does two things:
Copy read-only information necessary to run the loop body.
Initialize the reduction variable(s) to the identity element of the operation(s).
The join method should do the corresponding merge(s). You can do more than one
reduction at the same time: you can gather the min and max with a single
parallel_reduce.
NOTE: The reduction operation can be non-commutative. The example still works if floatingpoint
addition is replaced by string concatenation.
0 Kudos
22 Replies
RafSchietekat
Valued Contributor III
154 Views
Please see the documentation and answers already given.
0 Kudos
robert-reed
Valued Contributor II
154 Views
I agree with Raf's recommendation that you take another look at the documentation, and particularly the tutorial manual. The questions you pose seem to suggest a lack of basic understanding of the process. Recursive subdivision is exactly what takes place in both parallel_for and parallel_reduce, dividing the range up into enough pieces that all the available HW threads (be they hyper-threads or multiple cores) can get a piece of the action. The first thread in starts the division process and continues dividingdown one branch, using one of several partitioning strategies to decide when to stop, and leaving behind chunks of subrange of varying size as it further subdivides down one branch. Those are up for beingstolenby the other HW threads, who continue the subdivision process on their own branches. So check out the manual, get a good understanding of the basic parallism strategies, and that will hopefully provide some answers.
0 Kudos
Reply