/*
* This is a sample implementation of a parallel version of a for_each
* template.
*/
#ifndef __TBB_for_each_H
#define __TBB_for_each_H
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/partitioner.h"
namespace tbb {
template
class ApplyToIterator {
public:
ApplyToIterator(Func& f) : _function(f) {
}
void operator()(const blocked_range& r) const {
InIter end = r.end();
for(InIter i = r.begin(); i != end; ++i) {
_function(*i);
}
}
private:
Func& _function;
};
template
Func for_each(InIter first, InIter last, Func f) {
parallel_for(blocked_range(first, last), ApplyToIterator (f), auto_partitioner());
return f;
}
}
#endif
Link Copied
The code looks right. I would call it tbb::parallel_for_each instead of tbb::for_each, because overloading on namespaces invites trouble, both because of argument dependent lookup and misunderstanding by casual readers.
Some optimizers, Intel's and likely others, will appreciate seeing a "#pragma ivdep" inside the loop. That tells the compiler that the iterations are independent.
A case can be made for using affinity_partitioner instead of auto_partitioner, because some recent timing data seems to indicate that affinity_partitioner sometimes does better, even when cache affinity is not an issue. But not always, and adding a partitioner argument would break the syntactic match with STL.
On the general interface and semantics, I have mixed feelings about including such a function. On the plus side, it syntactically works just like a std::for_each. Indeed the GNU library has a "parallel mode" that has parallel versions of many STL algorithms.
On the minus side, the development version of TBB has a "parallel_do", which has semanticssimilar toparallel std::for_each. However, the current "parallel_do" is not as efficient, because it has no grainsize control. The TBB "parallel_do" is already a bit of a Swiss army knife (in both good and bad ways), so stretching to do parallel_for_each might be stretching it too far. E.g., the current "parallel_do" acts like it has a grainsize of 1, which is good or bad depending upon the context. I'm leaning towards keeping "parallel_do" as a replacement for "parallel while", with the grain size of 1. (And I look forward to having future parallel languages with clever compilers and hardware that eliminate the grainsize hassle. :-)
The other minus of a parallel version of std::for_each is that it is not very useful in my experience. Unlike the sequential std::for_each, the body of the parallel one cannot do much, because the iterations are concurrent.The body of a sequential one is free to bump output iterators,perform output actions, do side effects on shared objects, etc., and do so deterministically. The body of a parallel version cannot do those things deterministically,and so cannot do much other than operate on the elements of the sequence defined by the iterators. For example, a parallel version of std::for_each cannot do something as simple as copy a sequence element-wise, something the sequential versioncan easily do via a bumped outputiterator.
My general complaint with GNU parallel mode is that to get any efficiency with common hardware, the sequences for single operations are going to have to be large enough to amortize parallel overhead. That probably means sequences with hundreds and thousands of elements. But the sequences can't be too large, or the "f o g" problem hits, which is that if you run operation g on a large data set followed by operation f on a large data set, it will spill out of cache in the meantime, slowing things down considerably compared to the alternative of doing "f o g" on each element seperately, i.e., in one pass instead of two.
That said, I could see adding a convenience layer (like parallel_for_each and parallel_accumulate) to TBB for programmers who might find them useful, or are in the intial stages of porting a serial STL code.
I think thename "parallel_for_each" sufficient conveys the departure from the usual STL semantics. STL algorithm semantics are intensively serial. Indeed it's a fascinating research question about how to construct something as lovely as STL, but forreal parallel programming.
If the pattern is terse and useful, by all means use it. I'm just on the fence on whether it should be included in TBB.But there was a similar request last year, so at least it meets one of my criteria for including a feature: two people have to ask for it.
Doing as a secondary package sounds like the right way to go for starters. We're always looking for examples of TBB, so if you write one, we could add a link to it from our examples page. If it proves popular, we could add it into TBB.
One thought crossed my mind, inspired by Backus' FP language. Here's a way to possibly deal with my objection that std::for_each only handles one sequence, andthus dealing with two or more sequences is awkward, particularly in the parallel case. E.g., writing the algorithm for the inner product of two sequences. FP functions always take a single tupled argument. We could do the same for iterators. For example, given two iterators i and j, the paired iterator (i,j) could have the iterator operations defined on it by algebraic relations like this:
*(i,j) <==> (*i,*j)
(i,j)++ <==> (i++,j++)
(i1,j1)==(i2,j2) <==> i1==i2&&j1==j2
etc.
So given two sequences [xbegin,xend) and [ybegin,yend), the two endpoints of the half-open interval described by [(xbegin,ybegin),(xend,yend))could be passed to parallel_for_each.
Take this to extreme, and it becomes "Backus' FP in C++".
Backus' Turing Award Lecture is the best introduction to FP.
For more complete information about compiler optimizations, see our Optimization Notice.