Solved: Commutative operations with parallel_reduce

e4lam · ‎03-25-2010

Hi,

From the documentation is appears that parallel_reduce() by itself has an ordering guarentee. But, what if you have an operation where ordering does not matter? In that case, will the scheduling be less than optimal? Say for example, computing certain parts of the range take much longer than other parts, and the order in which the results are joined do not matter.

Second question. Suppose we have an map-reduce operation that uses a lot of memory per Body object to accumulate into. It seems that in general, parallel_reduce() can produce an unbounded number of splitting operations, thus creating many Body objects, using up a lot of memory. Is there a way to limit the number of Body objects that get created (eg. to the number of cores) ? If not, what is the best way to accomplish something like this?

Thanks,
-Edward

RafSchietekat · ‎03-26-2010

If work is distributed truly unevenly, you will of course get to the finish earlier if you can arrange to tackle the biggest lumps first, and parallel_reduce doesn't help you achieve that, but neither does parallel_for, even with its independent iterations.

Maybe you didn't expect the amount of Body reuse that does occur, and maybe therefore Alexey's clarification has already significantly alleviated your concerns, but while the number of body splits may be related to the number of worker threads it can still significantly exceed that number because scheduling is not fully optimised for least number of thefts, and this may be a concern if a Body split/join is very costly. Unless joining occurs more aggressively than I presume, dealing with the number of simultaneous Body instances (both active and phantom, if I may reuse a term from process scheduling), which is the exact concern that you formulated, has the same solution: the one suggested by Alexey (using TLS, but not necessarily with parallel_for in case of excessive lumpiness).

On the other hand, unless you can measure or confidently predict such problems, don't get paralysed with premature optimisation.

View solution in original post

Alexey-Kukanov · ‎03-25-2010

parallel_reduce does not give the strong ordering guarantee; it however ensures that reduction happens only between "neighbour" subtotals, and those are not commuted. Maybe I better explain itwith an example.

Imagine we have to concatenate strings A, B, C, and D. With serial execution, the sequence of concatenations would be ((A+B)+C)+D, i.e. first B is appended to A, then C appended to the result, then D. The execution by parallel_reduce would be (A+B)+(C+D), i.e. two pairs of strings are concatenated in parallel, and then the result for the pair at right is appended to the result for the pair at left, but not vice versa. So associativity of reduction operation is enough to have the same result as in serial execution.

Thus there is little to no impact on scheduling due to the guarantee. The whole iteration space is recursively split into subranges, and this process naturally forms a logicaltree of tasks (space-wise, the tree does not exist in full at any given time). Computation is performed by the leaves of the tree, with subtotals accumulated in Body objects. If two neighbor leaves were executed by the same worker thread, their results are accumulated in the same body object. Only if two neighbour leaves (or branches) are processed in parallel by different workers, Body object is split, in orderto accumulate into separate subtotals. As soon as subsequent subranges (neighbourbranches of the tree) are both processed, method join() is called for the body that should append the result at right to the result at left. No worker thread is waiting for another; the first that completes its half will go for new job, and the second one will merge the results.

As for memory usage, since bodies areonly splitlazily when work stealing happens, the total number of bodies is not huge; it should be of the same order of magnitude as the number of threads, not the number of iterations.

However, since ordering does not matter for youbut memory does, another solution may be suitable. It's a combination of parallel_for and enumerable_thread_specific (ETS). You do the computations over subranges with parallel_for and accumulate the results into thread-local "views" of ETS (or first accumulate in local variables and then append the resultto ETS). After parallel_for completes, you can iterate over ETS and merge the results accumulated by each worker. The number of local ETS views is equal to the number of workers taking part in computations, which by default is same as the number of cores.

Now try and see which approach works better for you.

RafSchietekat · ‎03-26-2010

If work is distributed truly unevenly, you will of course get to the finish earlier if you can arrange to tackle the biggest lumps first, and parallel_reduce doesn't help you achieve that, but neither does parallel_for, even with its independent iterations.

Maybe you didn't expect the amount of Body reuse that does occur, and maybe therefore Alexey's clarification has already significantly alleviated your concerns, but while the number of body splits may be related to the number of worker threads it can still significantly exceed that number because scheduling is not fully optimised for least number of thefts, and this may be a concern if a Body split/join is very costly. Unless joining occurs more aggressively than I presume, dealing with the number of simultaneous Body instances (both active and phantom, if I may reuse a term from process scheduling), which is the exact concern that you formulated, has the same solution: the one suggested by Alexey (using TLS, but not necessarily with parallel_for in case of excessive lumpiness).

On the other hand, unless you can measure or confidently predict such problems, don't get paralysed with premature optimisation.

e4lam · ‎03-26-2010

Thanks for the info, guys! As for the memory usage, we're already using thread-local storage in the parallel code that I'm looking to convert to use TBB. It just seemed that parallel_reduce() already had the perfect interface for doing this so that the use of thread-local storage could be completely hidden for such use cases.

RafSchietekat · ‎03-26-2010

There's always the option ofadaptingparallel_reduce to trade noncommutatitivity support and possibly some raw performance for fewer Body instances; if applied only where handling Body instancesis in fact expensive, any loss in raw performance would be partially, completely, or quite possibly more than completely compensated by improved global efficiency.

RafSchietekat · ‎03-28-2010

So does #3 mean you're going ahead with Alexey's suggestion (parallel_for + TLS)? How does parallel_reduce measure up, if you've tried that as well? If you see a significant difference, would you be interested in a parallel_reduce alternative as described in #4?

Don't overlook the most important paragraph in #2 just because it's the smallest and last one...

e4lam · ‎03-28-2010

Raf, essentially for now as a first cut I'm going to just go with parallel_for() + TLS. Haven't gotten far enough to see if we're going to attempt some re-implementation of parallel_reduce() yet. But yes, definitely interested. :)