Parallel Stable Sort - Page 2

Riyad_Parvez · ‎05-28-2013

According to http://threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_sort_func.htm.

Performs an unstable sort of sequence [begin1, end1). An unstable sort might not preserve the relative ordering of elements with equal keys. The sort is deterministic; sorting the same sequence will produce the same result each time.

So I need a parallel stable sort. How can I get that?

RafSchietekat · ‎06-30-2013

For your education and entertainment, here's my contribution for parallel_merge() and parallel_stable_sort(). The only thing somebody has to do is take a weed whacker to it to eliminate the dead code, now left there for the benefit of whoever wants to run some comparisons, although it could still happen that one of several variations should be chosen based on some cut-off values.

RafSchietekat · ‎07-07-2013

jimdempseyatthecove wrote:

Suggestion: First pass to select multiple pivot points such that at the end of the first pass you have N partitons (N= number of threads). At the start of the second pass, each of the threads will know the number of entries in all partitions. Using this information, they then each select an optimal number of sub-sub-partitions. At some point each sub-...-sub-partition is optimally partitioned by 2.

Note, the first pass is still run with single thread (unless effective leader-follower technique can be found). This pass should be slighty slower since you are comparing N keys however the N-1 keys of the next cycle will be in L1 cache.

Do you or does anyone have a good experience with such a technique? I started to experiment with this, also based on "Chapter 14: Sample Sort" from the book "Structured Parallel Programming", using concurrent_vector flexible-size "buckets" instead of multiple passes to tally and scatter, but I rarely see an improvement over parallel_quick_sort (mostly looking at 8 threads and 5000000 elements). I verified that distribution is OK (except when most elements are equal). Could it be the redundant storage, the concurrent_vector (used only with push_back() and iterators, and capacity increments of what it needs on average, so fragmentation shouldn't be an issue, I would think)? Do I really have to do the multiple passes to get histogram information and perform a scatter to just-enough storage?

Perhaps you were suggesting something else, because the only sequential initial pass I'm doing is gathering samples (oversample, sort, sample at equal intervals)?

(Correction) After correcting a copy-paste error in one of the overloads (the one that's supposed to call the other with a std::less)... there is more upside visible, but whenever it's really significant it's trumped by parallel_stable_merge()! And it still runs slower than parallel quicksort sometimes. This is now also with a vector instead of a concurrent_vector, though.

RafSchietekat · ‎07-13-2013

For your education and entertainment, here's my contribution for parallel_merge(), parallel_stable_sort(), and now also parallel_sample_sort(), well, sort of.

Maybe somebody would like to try this, perhaps run some benchmarks (with pretty graphs), and provide some feedback (things I may have overlooked, ...)?

(Added 2013-07-14) I thought I'd post this early in the weekend and forget about it, but of course I should have known it wouldn't work out that way. I've already modified the code to be able to use a non-square buckets array (why should it be square at all, because more columns means a better partial sort, right?). And now I'm thinking about modifying the pack-and-sort step to do some splitting of its own, because it seems wasteful not to have at least a binary split at this point (each column packed to two subranges). I've already seen meaningful speedups over parallel quicksort in places (though not across the board), and there's some further progress in store. That's close contact with the evolutionary landscape of algorithms (with a touch of intelligent design)...

RafSchietekat · ‎07-14-2013

A question: what are the expectations that an optimising build will be able to translate std::copy() or std::move() into memcpy()-like territory on its own for various kinds of iterators (pointer, vector iterator, ...) and various trivially copyable types (fundamental types, structs, ...)? At some intermediate level it should become something recognisable for an optimiser, but can you actually trust that it will generate optimal code?