Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

for_each TBB-style

AJ13
New Contributor I
835 Views
Hello all,

I have been using TBB quite a bit lately, and have written something that might be of use to others, or could be contributed to official TBB. This is a parallel version of the STL for_each template algorithm. This is my first attempt at a contribution to TBB, so I'd really appreciate feedback and help to polish it. I tried to stick with the TBB coding style as much as possible, although I think ApplyToIterator should be in namespace tbb::internal.

It has been suggested that I add a partitioner arugment with a default value of auto_partitioner.

Thanks for any feedback, here is the code.

/*
* This is a sample implementation of a parallel version of a for_each
* template.
*/

#ifndef __TBB_for_each_H
#define __TBB_for_each_H

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/partitioner.h"

namespace tbb {

template
class ApplyToIterator {
public:
ApplyToIterator(Func& f) : _function(f) {
}

void operator()(const blocked_range& r) const {
InIter end = r.end();
for(InIter i = r.begin(); i != end; ++i) {
_function(*i);
}
}

private:
Func& _function;

};


template
Func for_each(InIter first, InIter last, Func f) {
parallel_for(blocked_range(first, last), ApplyToIterator(f), auto_partitioner());
return f;
}





}

#endif
0 Kudos
9 Replies
ARCH_R_Intel
Employee
835 Views

The code looks right. I would call it tbb::parallel_for_each instead of tbb::for_each, because overloading on namespaces invites trouble, both because of argument dependent lookup and misunderstanding by casual readers.

Some optimizers, Intel's and likely others, will appreciate seeing a "#pragma ivdep" inside the loop. That tells the compiler that the iterations are independent.

A case can be made for using affinity_partitioner instead of auto_partitioner, because some recent timing data seems to indicate that affinity_partitioner sometimes does better, even when cache affinity is not an issue. But not always, and adding a partitioner argument would break the syntactic match with STL.

On the general interface and semantics, I have mixed feelings about including such a function. On the plus side, it syntactically works just like a std::for_each. Indeed the GNU library has a "parallel mode" that has parallel versions of many STL algorithms.

On the minus side, the development version of TBB has a "parallel_do", which has semanticssimilar toparallel std::for_each. However, the current "parallel_do" is not as efficient, because it has no grainsize control. The TBB "parallel_do" is already a bit of a Swiss army knife (in both good and bad ways), so stretching to do parallel_for_each might be stretching it too far. E.g., the current "parallel_do" acts like it has a grainsize of 1, which is good or bad depending upon the context. I'm leaning towards keeping "parallel_do" as a replacement for "parallel while", with the grain size of 1. (And I look forward to having future parallel languages with clever compilers and hardware that eliminate the grainsize hassle. :-)

The other minus of a parallel version of std::for_each is that it is not very useful in my experience. Unlike the sequential std::for_each, the body of the parallel one cannot do much, because the iterations are concurrent.The body of a sequential one is free to bump output iterators,perform output actions, do side effects on shared objects, etc., and do so deterministically. The body of a parallel version cannot do those things deterministically,and so cannot do much other than operate on the elements of the sequence defined by the iterators. For example, a parallel version of std::for_each cannot do something as simple as copy a sequence element-wise, something the sequential versioncan easily do via a bumped outputiterator.

My general complaint with GNU parallel mode is that to get any efficiency with common hardware, the sequences for single operations are going to have to be large enough to amortize parallel overhead. That probably means sequences with hundreds and thousands of elements. But the sequences can't be too large, or the "f o g" problem hits, which is that if you run operation g on a large data set followed by operation f on a large data set, it will spill out of cache in the meantime, slowing things down considerably compared to the alternative of doing "f o g" on each element seperately, i.e., in one pass instead of two.

That said, I could see adding a convenience layer (like parallel_for_each and parallel_accumulate) to TBB for programmers who might find them useful, or are in the intial stages of porting a serial STL code.

0 Kudos
AJ13
New Contributor I
835 Views
Thank you for your detailed response. I am still learning a lot about TBB, concurrent programming, and C++. I hope to eventually become a regular contributor to TBB, and I appreciate your responses to my questions.

I agree that the name parallel_for_each would be better, the first version I wrote had this name. If the function has semantics close to that of the STL version of for_each, then it could be a simple matter for programmers to "try out" TBB in critical portions of code that already use for_each. Another reason is the expressive nature of the algorithm call.

In my YetiSim project, I frequently have loops that just call functions on containers full of pointers. In this case, a for_each loop can boost performance and express my intentions, but a parallel_for_each loop can do even better. This is a construct that I would find useful in my own code, and it would be pretty easy to implement.

I had not heard of the parallel mode of the GNU compiler. In my opinion, it would be beneficial to implement STL (or STL-like) algorithms with TBB where possible. These algorithms could of course be packaged separately, this would avoid polluting TBB with helper libraries. Perhaps a TBB-extras package? This could form an area where the open source community may also contribute useful tidbits which do not have to be part of TBB itself.
0 Kudos
robert-reed
Valued Contributor II
835 Views
I'm also still learning about some of these topics, having never written any significant STL-based code, so I'm forced to looking up details on the web, such as this reference on for_each at the SGI STL reference. This is not any standard (guess I'll have to buy myself a copy?) but this description states for for_each: "Applications are performed in forward order, i.e., from first to last" and the InputIterator requirements don't allow random access, making it difficult to implement a generic splitting constructor. Moreover, related descriptions suggest InputIterator doesn't even need to be multi-pass. Given these constraints, concern about matching syntax from std::for_each to any parallel implementation of it seems premature and possibly misleading. Am I missing something?
0 Kudos
AJ13
New Contributor I
835 Views
Oops. I'm eager to contribute things to TBB which I find useful, but you have found something in the specs that show that this can't be done in an STL-conformant way.

So I suppose there is now a single question, can this template be salvaged into something generally useful? Could it be that the very name parallel_for_each conveys random access to the user? For my own uses, this is a pattern that is generally useful and more terse than writing a parallel_for.

It may be problematic then to express general STL algorithms using TBB for parallel versions. Perhaps a better approach would be to create a separate algorithm header file which contains things like this which are "influenced" by the STL standard, but have different implementations / limitations?
0 Kudos
ARCH_R_Intel
Employee
835 Views

I think thename "parallel_for_each" sufficient conveys the departure from the usual STL semantics. STL algorithm semantics are intensively serial. Indeed it's a fascinating research question about how to construct something as lovely as STL, but forreal parallel programming.

If the pattern is terse and useful, by all means use it. I'm just on the fence on whether it should be included in TBB.But there was a similar request last year, so at least it meets one of my criteria for including a feature: two people have to ask for it.

0 Kudos
AJ13
New Contributor I
835 Views
I think it really belongs in the STL equivalent for TBB. In my opinion, TBB should really be the "building blocks" and things like this which only use TBB constructs should be packaged somewhere else... but available to TBB users.

Would you be supportive of a secondary TBB package which contains these non-core extras? Basically a supporting library with useful tidbits and utilities.
0 Kudos
ARCH_R_Intel
Employee
835 Views

Doing as a secondary package sounds like the right way to go for starters. We're always looking for examples of TBB, so if you write one, we could add a link to it from our examples page. If it proves popular, we could add it into TBB.

One thought crossed my mind, inspired by Backus' FP language. Here's a way to possibly deal with my objection that std::for_each only handles one sequence, andthus dealing with two or more sequences is awkward, particularly in the parallel case. E.g., writing the algorithm for the inner product of two sequences. FP functions always take a single tupled argument. We could do the same for iterators. For example, given two iterators i and j, the paired iterator (i,j) could have the iterator operations defined on it by algebraic relations like this:

*(i,j) <==> (*i,*j)

(i,j)++ <==> (i++,j++)

(i1,j1)==(i2,j2) <==> i1==i2&&j1==j2

etc.

So given two sequences [xbegin,xend) and [ybegin,yend), the two endpoints of the half-open interval described by [(xbegin,ybegin),(xend,yend))could be passed to parallel_for_each.

Take this to extreme, and it becomes "Backus' FP in C++".

0 Kudos
AJ13
New Contributor I
835 Views
I'm familiar with functional programming, but not the FP language. I will find a paper or book on it to read this weekend. Perhaps you have one that you could recommend?

It's true that iterators are going to be rather one dimensional. In my case, I am simply executing a member function on each pointer in a rather large collection. The goal of the construct was a concise expression of this.

What you're suggesting seems to be iteration along a vector within a multidimensional space. Perhaps this could be a useful concept in general: pass in the ranges of the space to be considered, and a function that provides a series of vectors that iterate through that space, along with the function object itself.

In this way, a parallel_for_each could be easily designed to perform matrix multiplication, and automatically continue in parallel. So basically you provide a potentially multidimensional space of sequences, and the vector function is capable of iterating through that space however it sees fit. Now, how to split up that vector function into parts is something that I have to think on a bit... but it seems like a good idea.

What do you think?
0 Kudos
ARCH_R_Intel
Employee
835 Views

Backus' Turing Award Lecture is the best introduction to FP.

0 Kudos
Reply