parallel_reduce does not give the strong ordering guarantee; it however ensures that reduction happens only between "neighbour" subtotals, and those are not commuted. Maybe I better explain itwith an example.
Imagine we have to concatenate strings A, B, C, and D. With serial execution, the sequence of concatenations would be ((A+B)+C)+D, i.e. first B is appended to A, then C appended to the result, then D. The execution by parallel_reduce would be (A+B)+(C+D), i.e. two pairs of strings are concatenated in parallel, and then the result for the pair at right is appended to the result for the pair at left, but not vice versa. So associativity of reduction operation is enough to have the same result as in serial execution.
Thus there is little to no impact on scheduling due to the guarantee. The whole iteration space is recursively split into subranges, and this process naturally forms a logicaltree of tasks (space-wise, the tree does not exist in full at any given time). Computation is performed by the leaves of the tree, with subtotals accumulated in Body objects. If two neighbor leaves were executed by the same worker thread, their results are accumulated in the same body object. Only if two neighbour leaves (or branches) are processed in parallel by different workers, Body object is split, in orderto accumulate into separate subtotals. As soon as subsequent subranges (neighbourbranches of the tree) are both processed, method join() is called for the body that should append the result at right to the result at left. No worker thread is waiting for another; the first that completes its half will go for new job, and the second one will merge the results.
As for memory usage, since bodies areonly splitlazily when work stealing happens, the total number of bodies is not huge; it should be of the same order of magnitude as the number of threads, not the number of iterations.
However, since ordering does not matter for youbut memory does, another solution may be suitable. It's a combination of parallel_for and enumerable_thread_specific (ETS). You do the computations over subranges with parallel_for and accumulate the results into thread-local "views" of ETS (or first accumulate in local variables and then append the resultto ETS). After parallel_for completes, you can iterate over ETS and merge the results accumulated by each worker. The number of local ETS views is equal to the number of workers taking part in computations, which by default is same as the number of cores.
Now try and see which approach works better for you.