- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From the documentation is appears that parallel_reduce() by itself has an ordering guarentee. But, what if you have an operation where ordering does not matter? In that case, will the scheduling be less than optimal? Say for example, computing certain parts of the range take much longer than other parts, and the order in which the results are joined do not matter.
Second question. Suppose we have an map-reduce operation that uses a lot of memory per Body object to accumulate into. It seems that in general, parallel_reduce() can produce an unbounded number of splitting operations, thus creating many Body objects, using up a lot of memory. Is there a way to limit the number of Body objects that get created (eg. to the number of cores) ? If not, what is the best way to accomplish something like this?
Thanks,
-Edward
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe you didn't expect the amount of Body reuse that does occur, and maybe therefore Alexey's clarification has already significantly alleviated your concerns, but while the number of body splits may be related to the number of worker threads it can still significantly exceed that number because scheduling is not fully optimised for least number of thefts, and this may be a concern if a Body split/join is very costly. Unless joining occurs more aggressively than I presume, dealing with the number of simultaneous Body instances (both active and phantom, if I may reuse a term from process scheduling), which is the exact concern that you formulated, has the same solution: the one suggested by Alexey (using TLS, but not necessarily with parallel_for in case of excessive lumpiness).
On the other hand, unless you can measure or confidently predict such problems, don't get paralysed with premature optimisation.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
parallel_reduce does not give the strong ordering guarantee; it however ensures that reduction happens only between "neighbour" subtotals, and those are not commuted. Maybe I better explain itwith an example.
Imagine we have to concatenate strings A, B, C, and D. With serial execution, the sequence of concatenations would be ((A+B)+C)+D, i.e. first B is appended to A, then C appended to the result, then D. The execution by parallel_reduce would be (A+B)+(C+D), i.e. two pairs of strings are concatenated in parallel, and then the result for the pair at right is appended to the result for the pair at left, but not vice versa. So associativity of reduction operation is enough to have the same result as in serial execution.
Thus there is little to no impact on scheduling due to the guarantee. The whole iteration space is recursively split into subranges, and this process naturally forms a logicaltree of tasks (space-wise, the tree does not exist in full at any given time). Computation is performed by the leaves of the tree, with subtotals accumulated in Body objects. If two neighbor leaves were executed by the same worker thread, their results are accumulated in the same body object. Only if two neighbour leaves (or branches) are processed in parallel by different workers, Body object is split, in orderto accumulate into separate subtotals. As soon as subsequent subranges (neighbourbranches of the tree) are both processed, method join() is called for the body that should append the result at right to the result at left. No worker thread is waiting for another; the first that completes its half will go for new job, and the second one will merge the results.
As for memory usage, since bodies areonly splitlazily when work stealing happens, the total number of bodies is not huge; it should be of the same order of magnitude as the number of threads, not the number of iterations.
However, since ordering does not matter for youbut memory does, another solution may be suitable. It's a combination of parallel_for and enumerable_thread_specific (ETS). You do the computations over subranges with parallel_for and accumulate the results into thread-local "views" of ETS (or first accumulate in local variables and then append the resultto ETS). After parallel_for completes, you can iterate over ETS and merge the results accumulated by each worker. The number of local ETS views is equal to the number of workers taking part in computations, which by default is same as the number of cores.
Now try and see which approach works better for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe you didn't expect the amount of Body reuse that does occur, and maybe therefore Alexey's clarification has already significantly alleviated your concerns, but while the number of body splits may be related to the number of worker threads it can still significantly exceed that number because scheduling is not fully optimised for least number of thefts, and this may be a concern if a Body split/join is very costly. Unless joining occurs more aggressively than I presume, dealing with the number of simultaneous Body instances (both active and phantom, if I may reuse a term from process scheduling), which is the exact concern that you formulated, has the same solution: the one suggested by Alexey (using TLS, but not necessarily with parallel_for in case of excessive lumpiness).
On the other hand, unless you can measure or confidently predict such problems, don't get paralysed with premature optimisation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Don't overlook the most important paragraph in #2 just because it's the smallest and last one...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page