typename SeparatorType::separation_type parallel_for_group(const CompositeType& composite, const SeparatorType& separator, const Body& body, const Partitioner& partitioner)The separator type passed must support the following operation:
separation separate(const CompositeType&)
void operator()(SeparatorType::grouped_range_type& x)
void operator()(SeperatorType::ungrouped_range_type& x)
I haven't downloaded your implementation, so my questions come from examining what you've posted. I do have a few.
Are there only two "groups," the grouped and ungrouped collections? Or is CompositeType a hierarchical container, able to be unpealed recursively? How well do you expect this to scale with with growth in available HW threads?
If you're reordering elements to coalesce them into groups and the ungrouped group, does that require copying the elements around in memory, a potentially high cost serial operation?
What's a grouped_range_type and what does the Partitioner do in this environment? Are we talking linked lists here, or is the grouped_range a potentially splittable range?
I took a quick look at the blog you referred to. I'm not sure I fully grasp the nature of the dependencies described here, but I fear the constantly changing dependencygroupsthe diagrams illustratewill be difficult to implement without constantly trashing caches. Are the nodes described in this designpreserved at some constant vaddr? Then presumably the group associations are described in some other data structure, possibly an array of node pointers. That data structure must have to be rebuilt with each generation? So a processing element, given a group or the un-group, may be dealing with an arbitrary set of underlying nodes from generation to generation? It's hard to imagine how caches could be preserved/reused in such an organization. Are the nodes themselves a regularly varying population, or do they have some stability, generation to generation? To use a basketball analogy, this sounds like employing a zone defense, taking on any node that lands in my zone (dependency equivalence class); perhaps a man-to-man defense might be better for cache preservation?
I'm not convinced of the usage example cited above, either. The suggestion for grouping precise and imprecise calculations so that one set could be done more quickly, wouldn't that tend to exacerbate load imbalances? I think a better scheme would be to mix big and small computational tasks so that each processing element gets the same amount of work.