Simpler TBB 'Hello World" via a parallel_foreach construct

milom · ‎08-26-2008

I've been playing around with TBB, and I'm quite impressed. It is likely I'll have my advanced architecture class use it this coming semester for their parallel programming assignment.

One downside to TBB is that the simplest parallel TBB program (a parallel "hello world", if you will) isn't that simple. It introduces a functor class, requires parameters to be passed into the functor class via a constructor, introduces the blocked_range class, and then using the blocked range begin() and end() methods to iterator over the range. To me, this seems like too much object glue for doing something simple. It could very well scare off novices, too.

So, what could be done to reduce the amount of glue code?

To get a discussion going, I mocked up an implementation of a simple "parallel_foreach" construct that is a wrapper around parallel_for. It allows the use of a function (rather than only an object-based functor), allows parameter passing to that function, and hides the blocked_range class using internal iteration.

The syntax is:

parallel_foreach(function, begin, end, arg1, arg2, arg3, arg4, ...);

For example,

parallel_foreach(foo, 0, 3, a, b);

would be equivilent to doing the following in parallel:

foo(0, a, b)
foo(1, a, b)
foo(2, a, b)

So, consider the following simple smoothing/averaging kernel (much like Example 3-6 from the TBB book). With the above syntax, the parallel version is actually a line shorter, because it performs the iteration implicitly:

void smooth_kernel(int i, float* output, float* input) {
output = (input[i-1]+input+input[i+1]) / 3.0;
}

void serial_smooth(float* output, float* input, int n) {
for (int i=1; i < n-1; i++) {
smooth_kernel(i, output, input);
}
}

void parallel_smooth(float* output, float* input, int n) {
parallel_foreach(smooth_kernel, 1, n-1, output, input);
}

Analogous constructs could be created for parallel_foreach2d, parallel_reduce, etc.

My mockup implementation uses a template function with N versions, one for each number of arguments supported (which is a hack, much like the parameter limitation on the tbb_thread class). It uses an internal functor worker object to do the range iteration.

The trickiest part seems to be figuring out if parameters should be passed by value, reference, or constant reference. Passing everything by value would be inefficient and surprise the programmer. Passing by reference is probably better. Yet, passing everything by reference might not always be the right thing either (and could be inefficient by preventing certain compiler optimizations).

Reference parameters then bring up the issue of const vs non-const. Ultimately, you might need all cross-combinations of constant or non-constant references for each to make it work nicely... but that leads to combinational explosion. 2^N different functions/worker class to support N-input functions, etc.

Thoughts?

robert-reed · ‎08-29-2008

While you can proceed with any wrappers around TBB functions you can conceive to simplify the interface for your architecture class, I think the combinatorial explosion you encounter at the end of your explication gives some good reasons why not to proceed with this as a general interface change for TBB, but there are other simplifications in your proposal that hide important details. For example, what is the size of the array elements dealt with in your parallel_foreach? By hiding the blocked_range class, you also hide the template parameterization that identifies the array element class (and its size).

But you're right; there's a lot of "glue" that needs to be provided in order to replace a for-loop with the TBB parallel equivalent. I've spent way too much time migrating loop kernels into functor objects in order to implement TBB parallelism. That's why I'm looking forward to implementations of the C++0x proposal for lambdas, which will greatly simplify the interface for implementing parallel_for in particular. Using it won't hide template parameters for blocked_region objects, but it will eliminate the helper class for parallel_for, allowing developers to keep the loop kernel in-line, adding only a little bit of extra syntax to contextualize the kernel. There are implementations proceeding on several compilers including Intel's. They may not come soon enough to ease the burden of your class but should offer some of the simplification that you seek.

milom · ‎08-29-2008

Although I may use TBB for my class at Penn, that really isn't what this post was about. I was hoping to start a dialog about possible simplifications for all new users of TBB.

The proposed syntactic sugar wrappers actually combine at least three separable ideas. (1) using a function rather than a functor class for each parallel loop, (2) implicitly declaring the iteration range without the blocked_2d, and (3) hiding the explicit iteration over the range in the body code.

Of these, for #1, the wrapper has the "forwarding problem" and variable arguments problem. This could be partly mitigated by Rvalue references and varadic template arguments in C++0x, respectively. As you mentioned, using lambda could even eliminate the need to make a separate function (or functor) at all. Until then, supporting this for a limited number of arguments (just like tbb_thread) would be a reasonable stopgap solution.

For #2, using "blocked_range(0, SIZE)" rather than just "0, SIZE" probably isn't a big deal. Yet, I didn't fully parse your comment: "By hiding the blocked_range class, you also hide the template parameterization that identifies the array element class (and its size)." But perhaps this issues is related to this issue. Of the aspect of my suggestion, #2 is probably the least important.

For #3, this could be done without any of the above problems. It just moves the "for (size_t i = r.begin(); i != r.end(); ++i)" loop header into a wrapper class. Is there any downside to such a change? Perhaps that one line isn't a big deal for blocked_range, but for blocked_range2d and 3d, I found that these same lines kept being repeated over and over again in every functor. (And this would apply to for lambda expressions, too.)

I was actually hoping to make contributions to TBB, for example, perhaps some adaptive lock implementations or maybe lighter-weight fair locks based around tickets. We need some of these things for some of the research experiments that we're doing, and I was considering trying to contribute them. But, I'm not sure if I should take the time to do so.

robert-reed · ‎09-02-2008

I understand the dialog you were trying to start and appreciate your interest in the Threading Building Blocks and the desire to create a universally applicable simplification of its syntax. Having taken the effort to convert some solvers that required a lot of blocked_range and blocked_range2d constructs, I certainly am cognizant of the repeated boilerplate code required to provide the glue between the invocation and the definition of the functor-body; I've done plenty of it, and made plenty of errors trying to get it right.

One of the things I don't like about the wrapper (#1) is that it is a so-called "partial feature," being only half a solution to the semantic problem of splitting the parallel kernel from its point of action: typically you take a loop to parallelize and split the loop body out to a separate functor (or function); in either case that code needs to reside somewhere not in line with its point of invocation. Saving some syntax glue while still having to migrate that code to a remote source location (even if it is just up the page) seems like insufficient results for the effort. Whereas C++0x lambdas solve both problems and are not that far away from availability, in multiple compilers.

Regarding #2, if you assume that every blocked_range you need to deal with is a blocked_range, there's a lot of algorithms/arrays you can roll up into a parallel_for. But what if my array element is a struct vector3f { float x,y,z } ? Now each array element is on the order of 12 bytes rather than sizeof (size_t). It would no longer be sufficient to specify only start and end of loop range as suggested in your example. You could write parallel_foreach(function, begin, end, sizeof_element, arg1, arg2...) but we're just compounding the parameter problem. And blocked_range2d and blocked_range3d offer the ability to have separate typename for the different axis elements (whether anyone will actually use it is a separate question), so add a couple more parameters to the list.

One of the beauties of the lambda solution is context capture, which eliminates the artificial parameterization of local objects to bridge the gap from invocation to implementation. Considering the full utility of blocked_range constructs, you could modify the proposal to accommodate these problems but at the risk of increasing the glue syntax, one of the issues this proposal was trying to address. This harkens back to the partial feature concern raised previously. And I dont think the blocked_range issue is peripheral--hiding it in the parallel_foreach construct is central to the syntax economization that is the goal of this proposal.

Finally, regarding #3, hiding the explicit iteration over range has a number of effects. In particular, it places the loop body in a separate function which now needs to be called for each iteration of the loop. In doing so, the compiler is deprived of any optimizations between the loop bounds control and body that might be done with an inline declaration, like moving the induction variable to a register. This cuts off some pretty substantial compiler optimizations.

We really appreciate your interest in promoting TBB and possibly contributing to it. I'm sure our architecture and development teams would be happy to look at almost anything you propose. And certainly the syntax of TBB in its current incarnation leaves much to be desired. Use of lambdas or other proposals to reduce the glue are certainly in the main stream of interest for TBB moving forward. And we think this forum is a good place to hash out and hatch such new ideas. Let the dialog continue.

milom · ‎09-11-2008

mad reed:
One of the things I dont like about the wrapper (#1) is that it is a so-called partial feature, being only half a solution to the semantic problem of splitting the parallel kernel from its point of action... Whereas C++0x lambdas solve both problems and are not that far away from availability, in multiple compilers.

I agree the lambda solution sounds attractive. I guess it all depends how long we'll have to wait for such extensions to widely enough implemented and dependable that portable C++ programs can use them. Perhaps it will be soon, but it certainly isn't today.

Now each array element is on the order of 12 bytes rather than sizeof (size_t). It would no longer be sufficient to specify only start and end of loop range as suggested in your example. You could write parallel_foreach(function, begin, end, sizeof_element, arg1, ) but were just compounding the parameter problem.

The mockup that I worked on used a template function, so it allowed the same types blocked_range2d. It is just a template function wrapper around blocked_range2d, so it should act the same. Unless I'm missing what you're saying.

As I said above, requiring blocked_range2d in the call wouldn't be a big deal, and is probably cleaner as the type is explicitly declared that way (especially in light of C++'s sometimes-wacky automatic type conversion).

And I dont think the blocked_range issue is peripheralhiding it in the parallel_foreach construct is central to the syntax economization that is the goal of this proposal.

I think getting rid of the functor class is the key advantage of the proposal, and one that might be eclipsed by lambda.

Finally, regarding #3, hiding the explicit iteration over range has a number of effects. In particular, it places the loop body in a separate function which now needs to be called for each iteration of the loop. In doing so, the compiler is deprived of any optimizations between the loop bounds control and body that might be done with an inline declaration, like moving the induction variable to a register. This cuts off some pretty substantial compiler optimizations.

Any compiler worth its salt would inline the function call (assuming the function is in the same compilation scope), allowing the same optimization opportunities. Even without full inlining of the function, I don't see why the compiler couldn't move the loop induction variable to a register.

This, in fact, brings up a question about closures. Wouldn't a closure have the same or worse problems? I don't know the lambda/closure specification well, but it seems like it would change the way stack frames and such work. It seems like closures would require extra alias analysis and such to ensure registers can be register allocated and such. Although compilers might support closures soon, I wonder how efficiently they might be supported.

We really appreciate your interest in promoting TBB and possibly contributing to it. 60; Im sure our architecture and development teams would be happy to look at almost anything you propose. And certainly the syntax of TBB in its current incarnation leaves much to be desired. Use of lambdas or other proposals to reduce the glue are certainly in the main stream of interest for TBB moving forward. And we think this forum is a good place to hash out and hatch such new ideas. Let the dialog continue.

Aside from syntax, I was curious if there were any "wish list" or general thoughts on TBB's lock-based sycronization primitives? Are they being used widely? Have they been highly tuned at this point? Perhaps it is appropriate to start to new thread on the topic, but I'd be curious to know the TBB implementors view on the maturity of that part of the library.

robert-reed · ‎09-17-2008

milom:

The mockup that I worked on used a template function, so it allowed the same types blocked_range2d. It is just a template function wrapper around blocked_range2d, so it should act the same.

I'm sorry. I didn't see any template parameters specified in the sample code you provided above. I'm curious about your reference to the under-qualified blocked_range2d, which fully qualified requires RowValue and ColumnValue types to fix the iterator increments. You specifically talk about hiding the blocked_range class at the beginning of this thread. If you're willing relax that restriction, we have no argument.

Any compiler worth its salt would inline the function call (assuming the function is in the same compilation scope), allowing the same optimization opportunities. Even without full inlining of the function, I don't see why the compiler couldn't move the loop induction variable to a register.

I wish more compilers were worth their salt in sense you expect. However, practical examples like this one from Alexey Kukanov point out that compilers are a lot more restricted in their decisions about safety versus performance than we'd like them to be. And in this case we're not even expecting something like an automatic function inline, but just an array dereference that spooks the optimizer.

Wouldn't a closure have the same or worse problems? I don't know the lambda/closure specification well, but it seems like it would change the way stack frames and such work.

As far as I understand it, a lambda closure is equivalent to a real functor, the only practical difference being that the latter has a (class)name into which it is bound as an operator. It should be no worse than using the functors that are the current TBB practice. Any optimizations the compiler can do for regular functors should apply to lambdas.