Community
cancel
Showing results for 
Search instead for 
Did you mean: 
vanhouten777
Beginner
227 Views

What are the differences between parallel_for and parallel_reduce in Intel TBB?

I think parallel_reduce is better when we have to pass data between different grains or threads. Can someone explain how parallel_reduce work? I have pasted example code.
A loop can do reduction, as in this summation:
float SerialSumFoo( float a[], size_t n ) {
float sum = 0;
for( size_t i=0; i!=n; ++i )
sum += Foo(a);
return sum;
}
If the iterations are independent, you can parallelize this loop using the template class
parallel_reduce as follows:
float ParallelSumFoo( const float a[], size_t n ) {
SumFoo sf(a);
parallel_reduce( blocked_range(0,n), sf );
2 Prior to Intel TBB 2.2, the default was simple_partitioner. Compile with
TBB_DEPRECATED=1 to get the old default.
A loop can do reduction, as in this summation:float SerialSumFoo( float a[], size_t n ) {float sum = 0;for( size_t i=0; i!=n; ++i )sum += Foo(a);return sum;}If the iterations are independent, you can parallelize this loop using the template classparallel_reduce as follows:float ParallelSumFoo( const float a[], size_t n ) {SumFoo sf(a);parallel_reduce( blocked_range(0,n), sf );2 Prior to Intel TBB 2.2, the default was simple_partitioner. Compile withTBB_DEPRECATED=1 to get the old default.
return sf.my_sum;
}
The class SumFoo specifies details of the reduction, such as how to accumulate
subsums and combine them. Here is the definition of class SumFoo:
class SumFoo {
float* my_a;
public:
float my_sum;
void operator()( const blocked_range& r ) {
float *a = my_a;
float sum = my_sum;
size_t end = r.end();
for( size_t i=r.begin(); i!=end; ++i )
sum += Foo(a);
my_sum = sum;
}
SumFoo( SumFoo& x, split ) : my_a(x.my_a), my_sum(0) {}
void join( const SumFoo& y ) {my_sum+=y.my_sum;}
SumFoo(float a[] ) :
my_a(a), my_sum(0)
{}
};
Note the differences with class ApplyFoo from Section 3.2. First, operator() is not
const. This is because it must update SumFoo::my_sum. Second, SumFoo has a
splitting constructor and a method join that must be present for parallel_reduce to
work. The splitting constructor takes as arguments a reference to the original object,
and a dummy argument of type split, which is defined by the library. The dummy
argument distinguishes the splitting constructor from a copy constructor.
TIP: In the example, the definition of operator() uses local temporary variables (a, sum,
end) for scalar values accessed inside the loop. This technique can improve
performance by making it obvious to the compiler that the values can be held in
registers instead of memory. If the values are too large to fit in registers, or have
their address taken in a way the compiler cannot track, the technique might not help.
With a typical optimizing compiler, using local temporaries for only written variables
(such as sum in the example) can suffice, because then the compiler can deduce that
the loop does not write to any of the other locations, and hoist the other reads to
outside the loop.
When a worker thread is available, as decided by the task scheduler,
parallel_reduce invokes the splitting constructor to create a subtask for the worker.
When the subtask completes, parallel_reduce uses method join to accumulate the
result of the subtask. The graph at the top of Figure 5 shows the split-join sequence
that happens when a worker is available:
Figure 5: Graph of the Split-join Sequence
An arc in the Figure 5 indicates order in time. The splitting constructor might run
concurrently while object x is being used for the first half of the reduction. Therefore,
all actions of the splitting constructor that creates y must be made thread safe with
respect to x. So if the splitting constructor needs to increment a reference count
shared with other objects, it should use an atomic increment.
If a worker is not available, the second half of the iteration is reduced using the same
body object that reduced the first half. That is the reduction of the second half starts
where reduction of the first half finished.
CAUTION: Because split/join are not used if workers are unavailable, parallel_reduce does not
necessarily do recursive splitting.
CAUTION: Because the same body might be used to accumulate multiple subranges, it is critical
that operator() not discard earlier accumulations. The code below shows an incorrect
definition of SumFoo::operator().
class SumFoo {
...
public:
float my_sum;
void operator()( const blocked_range& r ) {
...
float sum = 0; // WRONG should be "sum = my_sum".
...
for( ... )
sum += Foo(a);
my_sum = sum;
}
...
};
With the mistake, the body returns a partial sum for the last subrange instead of all
subranges to which parallel_reduce applies it.
The rules for partitioners and grain sizes for parallel_reduce are the same as for
parallel_for.
parallel_reduce generalizes to any associative operation. In general, the splitting
constructor does two things:
Copy read-only information necessary to run the loop body.
Initialize the reduction variable(s) to the identity element of the operation(s).
The join method should do the corresponding merge(s). You can do more than one
reduction at the same time: you can gather the min and max with a single
parallel_reduce.
NOTE: The reduction operation can be non-commutative. The example still works if floatingpoint
addition is replaced by string concatenation.
0 Kudos
22 Replies
RafSchietekat
Black Belt
223 Views

"Can someone explain how parallel_reduce work?"
Please study the Tutorial and Reference Manual, and then ask any remaining specific questions.
vanhouten777
Beginner
223 Views

Where is grain loop

whrere is within grain loop.
RafSchietekat
Black Belt
223 Views

The body's operator(), as defined by you, processes a "chunk" (not "grain"), typically using a normal for loop based on the subrange's begin() and end() values.

Does that address your question, or what else did you intend?
vanhouten777
Beginner
223 Views

void operator()( const blocked_range& r ) {
float *a = my_a;
float sum = my_sum;
size_t end = r.end();
for( size_t i=r.begin(); i!=end; ++i )
sum += Foo(a);
my_sum = sum;
}
the above code is excuted for each thread , isn't it.
When do we differentiate with actual loop iterations and grainsize?
what does r.begin() and r.end() hold?
Regards,
Van Houten.
RafSchietekat
Black Belt
223 Views

Each thread will typically process multiple chunks, as a strategy for workload distribution. See Tutorial "Automatic Chunking" and "Controlling Chunking" for background about how partitioner and grainsize both affect chunk size (for parallel_reduce as well as parallel_for). r.begin() and r.end() delineate the subrange describing the chunk.
vanhouten777
Beginner
223 Views

Sir,
Can you give an example of delineation?
Regards,
Van Houten.
RafSchietekat
Black Belt
223 Views

At the risk of getting into a circular definition for a generic English word that doesn't even appear in the documentation: I just meant the begin/end values (for a one-dimensional range anyway), e.g., 3 and 7 delineate the range of integers i such that 3<=i<7 (inclusive/exclusive considered implied). Please have a look at the Reference Manual, which documents the "Range Concept" in the "Algorithms" section.

You could also get a copy of the "Intel Threading Building Blocks" book and read it at your leisure.
vanhouten777
Beginner
223 Views

Sir,
How do we differntiate between number of grains and number of loop iterations?
what is n? Is n = grainsize or n = number of grains * grainsize.
Regards,
Van Houten.
RafSchietekat
Black Belt
223 Views

It's "chunks", not "grains". I'm not too fond of using the word "iterations" for parallel execution, but it's the chunks that are eventually processed. n seems to be the end of the original range in this example, that happens to begin at 0. If the range begins at m, n-m would be the range length, so in this case it's equal to n. Chunks can have widely different lengths (especially with auto_partitioner), of at least grainsize/2.0, unless the original range length is already less than that, of course. Obviously no simple multiplicative relationship applies, even with simple_partitioner.
vanhouten777
Beginner
223 Views

Sir,
I think chunksize and grainsize is the same, only difference is grainsize is a value we might give and chunksize is a value which the Intel TBB partitioner chooses between grainsize/2 and grainsize.
Regards,
Van Houten
vanhouten777
Beginner
223 Views

I think n is the total number of iterations and r.begin() and r.end() are the subranges for each chunk.
Regards,
Van Houten.
RafSchietekat
Black Belt
223 Views

"I think chunksize and grainsize is the same, only difference is grainsize is a value we might give and chunksize is a value which the Intel TBB partitioner chooses between grainsize/2 and grainsize."
Chunk size may exceed grainsize for other partitioners than simple_partitioner.

"I think n is the total number of iterations and r.begin() and r.end() are the subranges for each chunk."
n is the end of the input range, and only equal to the number of iterations because in this example the input range happens to begin at 0. r is the subrange.
vanhouten777
Beginner
223 Views

Sir,
I checked the documentation of parallel_reduce, in that the iteration is divided into two halves,
which are handled by two threads, then what next iteration space is divided into two halves, can some explain this procedure?
regards,
Van Houten.
RafSchietekat
Black Belt
223 Views

I'm afraid I don't fully understand the question. The input range is dynamically divided into chunks based on partitioner, grainsize and an element of chance based on stealing. This also affects the body splits and merges. Adjacent chunks are typically executed on the same thread using the same body anyway because they were often created just to provide enough "parallel slack" for workload balancing. Just look at the documented guarantees and respect the requirements, and let TBB take care of the rest.
vanhouten777
Beginner
223 Views

Sir,
Is it like this, does Intel TBB divides each chunk into two iteartion spaces for two threads for hyper threading
one for main thread and one for worker thread, as each chunk is a work for one hardware thread or core.
Regards,
Van Houten.
vanhouten777
Beginner
223 Views

Sir,
So you are saying chunk 1 and 2 will be given to thread1
chunk 3 and 4 will be given to thread2
etc ....
Is chunk same as task?
Regards,
Van Houten.
vanhouten777
Beginner
223 Views

Sir,
Is dividing the iteration space into two halves means giving two adjoining chunks to the same thread.
Regards,
Van Houten.
RafSchietekat
Black Belt
223 Views

A chunk is a final subrange processed in Body::operator(). TBB is currently agnostic about where threads run, so there's nothing special about hyperthreads. Chunks may be given (affinity_partitioner), taken (stealing) or processed locally. Tasks may be recycled to process more than one chunk, and I suspect that some algorithms do that (I would have to check the code). Adjacent ranges are likely to be executed on the same thread, but obviously not always or there would be no concurrency.
RafSchietekat
Black Belt
223 Views

"Tasks may be recycled to process more than one chunk, and I suspect that some algorithms do that (I would have to check the code)."
Oh yes, task reuse to process more than one chunk seems to tend to occur through the scheduler bypass mechanism, which is even more efficient than recycling, as decided by the partitioner. Tasks also get recycled, often together with use of a continuation, but after reminding myself of the main mechanism I didn't stick around to analyse those uses at this time.

However, you don't need to know any of that to use the packaged algorithms, and often you can write your programs without exposing yourself to tasks... Just look at the documented requirements and guarantees, and let TBB take care of the rest.
vanhouten777
Beginner
78 Views

Sir,
I think each chunk is divided into two equal subranges, but these subranges and not subdivided again as it defeats the purpose or it will cause repetition.
Regards,
Van Houten.
Reply