Parallel Stable Sort

Riyad_Parvez · ‎05-28-2013

According to http://threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_sort_func.htm.

Performs an unstable sort of sequence [begin1, end1). An unstable sort might not preserve the relative ordering of elements with equal keys. The sort is deterministic; sorting the same sequence will produce the same result each time.

So I need a parallel stable sort. How can I get that?

robert-reed · ‎05-28-2013

Good question! My flip answer was going to be "implement quicksort using Intel^® Cilk™ Plus" but then I remembered that quicksort itself is not a stable sort (and in fact faces worst case performance in sorting presorted lists). One of the attributes of Intel Cilk Plus parallelism is that it automatically preserves order of operation, an important attribute in performing a stable sort. And quicksort's nature as a divide-and-conquer algorithm--subdividing activity in ever smaller regions--makes it good for parallelization. But its tendency to rearrange items even on the small end makes it a problematic choice for ensuring stability.

jimdempseyatthecove · ‎05-28-2013

An easy way to perform a stable sort would be to modify the comparison function. When content of keys are ==, use address of key to specify key of lower address as less than other key. IOW the external results of the comparison will never show ==.

Jim Dempsey

RafSchietekat · ‎05-28-2013

It seems counterintuitive that it would be as simple as modifying the comparison function, because then why would we even be talking about it? I think that quicksort will happily disturb order in equivalence class A while partitioning around an element of equivalence class B even if in-class disturbance is avoided... but such transient preservation of order is later trashed anyway by subsequent partitioning around pivots from different classes. A solution might be to sort an array of indexes/pointers (so the original location is being preserved), and then pick the original values into a result collection. Alternatively to the picking phase, one could sort the original collection alongside a collection of original indexes, simultaneously performing equivalent operations on both. Or transcribe to pairs, sort, and transcribe to elements of the original type. Maybe it even depends on cache effects which of these will be a winner: your choice between trying each of these yourself or doing a bit more research first (I didn't).

Robert, could you expand on the thing with order of operation and its relevance for sorting?

(Correction 2013-06-01) during subdivision->while partitioning, replaced doubt about not disturbing order in same equivalence class as pivot with explicit remark about its ultimate irrelevance

jimdempseyatthecove · ‎05-29-2013

Raf,

In lieu of manufacturing a pointer and carrying it about (assuming result is vector of T as opposed to vector of T*) the == condition could perform a byte by byte comparison of the entirity of each object being sorted (memcmp), the one with the lesser value can be chosen, or in the event both records are identical, it will not make a difference as to which is used (as the result vector is returning a copy of T and you could not tell the difference).

Jim Dempsey

RafSchietekat · ‎05-29-2013

Not only will that not make sort stable, it will make stable sort non-stable, e.g., for "bool f(int x, int y) { return (x%2)<(y%2); }" and input {2,1,0,3} you always get {0,2,1,3} instead of {2,0,1,3}.

(Correction) Hopefully nobody saw the original version...

(Added 2013-05-30) For clarity: I recognise that your suggestion makes sort "deterministic" (result is predictable), but not in the specific way that is meant by "stable" (order of equivalent elements is preserved).

(Correction 2013-06-01) Removed remark about "idempotent" from addition (strike-through didn't work). Note to self: don't add unnecessary comments, especially not without thinking. :-)

jimdempseyatthecove · ‎05-30-2013

Raf, "bool f(int x, int y) { return (x%2)<(y%2); }"

The order returned depends on how you structure the merge phase of the sort paying attention to left-leg or right-leg.

Jim Dempsey

RafSchietekat · ‎05-30-2013

Now I think I've lost you, Jim: there's no merge in quicksort. (The predicate "even before odd" was just to have short example sequences.)

Robert, I don't think there's any indeterminacy related to parallelism at all, only related to pivot choice and partitioning algorithm, and the latter is annoyingly sequential. So you can plug the same strategy into a sequential quicksort, or ones implemented in Cilk Plus or TBB, and expect to get the same outcome for all three.

(Added 2013-06-01) I think that TBB deserves a parallel_stable_sort() implemented either on top of parallel_sort as indicated in an earlier comment or as a parallel merge sort (inherently stable). The former is easy to implement, the latter may well be faster.

(Corrected 2013-06-04) In English, it's "even" and "odd", not "pair" and "unpair" (Dutch: "paar" and "onpaar"). :-)

robert-reed · ‎06-03-2013

Raf, you might be right. When I first thought about the issue, the possibility occurred to me that parallel splitting of the work might be another source for instability, much in the same way that parallelization of reductions can sometimes induce instabilities in floating-point results because of varying orderings of the binary ops involved. I don't have a specific case in mind but I wasn't discounting the possibility. And I don't have a specific case in mind, so there may be no "there" there.

RafSchietekat · ‎06-03-2013

Unless you have something that explicitly depends on order of evaluation, like a pseudo-random choice of pivot element, it's always the same subdivision tree. But that's all academic because quicksort isn't stable even in a serial implementation.

(Added 19:40Z) I just completed successful testing of a parallel merge sort implementation based on "Chapter 13: Merge Sort" in "Structured Parallel Programming", the book by Michael McCool, Arch D. Robison and James Reinders featured on the TBB website. I have a feeling that it could be further improved by using multi-way merge (beyond binary merge), because a shallower merge tree means less data motion/movement (?).

(Added 2013-06-06) I just completed successful testing of such a parallel multi-way merge sort implementation. There are still some corner cases that may warrant special attention (empty and mutually disjunct ranges), and there are some values to be tuned, but then I think it'll be ready to be declared a winner (of hearts). :-)

Riyad_Parvez · ‎06-06-2013

Hi, can I use std::stable_sort on tbb::concurrent_vector without any problem if the concurrent_vector isn't being modified by other threads?

RafSchietekat · ‎06-06-2013

Yes. You might also want to take the opportunity to first compact() the container.

Riyad_Parvez · ‎06-06-2013

If I don't compact the concurrent_vector, should there be any problem? Just for curiosity.

RafSchietekat · ‎06-06-2013

concurrent_vector trades constant-time access away for the guarantee that elements stay in place (necessary for concurrency, good for cache locality). If you're not absolutely certain that most of the data is hot in cache, you might want to try to gain back some access speed, especially if you are going to be navigating around quite a bit. But by all means do your own tests (and let us know the outcome).

(Added 2013-06-09) More explicitly, you don't have to first compact() for std::stable_sort() to work correctly.

Riyad_Parvez · ‎06-09-2013

I've found it GCC parallel has a method __gnu_parallel::stable_sort. Problem is it's only on GCC, not portable across paltforms

RafSchietekat · ‎06-09-2013

Hold on, I've just submitted an implementation to TBB, so you can develop using __gnu_parallel::stable_sort() and substitute tbb::parallel_stable_sort()... at an undefined time in the future. :-)

(Added 2013-06-10) Hmm, the timings so far are all over the place. If I look only at 50000 elements and 4 threads (the maximum in test_parallel_sort.exe, running on a machine with 4 hyperthreaded cores), I see speedups over parallel_sort() anywhere between about 0.25 and 5. random->forward->reverse shows decreasing speedup. Best performance is for array of float, possibly because of memmove() optimisation. Food for thought...

(Added) There should also be tests with "almost sorted" sequences, because now the speedup results for "pre-sorted" are narrowly distributed around 1, because the same preliminary test is run each time. It's not even an exceptional situation, so it's not as if I just want to see quicksort fall hopelessly behind...

(Added) At least parallel multi-way merge sort (my hunch) shows better results than binary (from a book): over 5x vs. less than 2x speedup over quicksort. I'm not entirely sure they've each been optimised to the same level, but I'm going to spend at least an entire day not worrying about that possibility. :-)

RafSchietekat · ‎06-10-2013

For 500000 elements in the test, a silly bug reared its head (only then!) and was squashed, but also binary merge stagnated at speedups below 2 and multi-way merge (narrowly) exceeded 10, prior to any tuning and adjustments (especially about choice of number of parts in the division). Another thing to consider is to not eagerly swap over subranges that are already sorted but to leave them in place unless they need to be merged, just to take another head start on a race that has not been run yet: sorting a mostly sorted sequence. Still, the results were very uneven again, down to 0.2 for some inputs, mostly concurrent_vector with reverse-sorted input, so if that were realistic you might even want to copy that over to a vector, sort there, and copy back afterwards.

(Added 2013-06-11) Another addition to my mini-blog... The upside of all-over-the-place timings is that if you divided times the wrong way around, you still just get all-over-the-place timings after correcting that mistake, so you don't really have to go hide from the world right away. Still, that glorious factor 10 has now become a terrible slowdown, so I'm not very happy about that. Lazy swapping wasn't what I hoped for, but I still don't have the mostly sorted input data where it might be most useful. Next thing would be to improve the multi-way merge itself, which now still does a full sweep each time, which gets costly as the merge gets wider. But that's for another evening.

(Added 2013-06-14) I've implemented the multi-way merge with a priority_queue, but not (yet?) with the outcome I wanted.

RafSchietekat · ‎06-15-2013

Time to face the truth: even after pulling out all the stops, with type traits, trivial copy, multi-way merge and whatnot, merge sort is not a clear winner on performance alone when stability is not a requirement, or at least I couldn't make it be one. Quicksort does take a few partitioning rounds before getting up to full speed, but only one is sequential, then it fills 2 threads, then 4, 8, etc., and apparently it's not that big of a handicap. Then, as long as swapping is cheap enough and there's no existing order to exploit, it wipes the floor with merge sort. In the results below for 8-way parallelism (not necessarily the width of the merge, which was arbitrarily set to 64 here) and 5000000 elements as input, disregarding the rather expensive concurrent_vector, the only consistent advantage to speak of is reverse-sorted input (but if you know that property in advance you are probably a lot better off with std::reverse). Still, it could have been worse, I guess.

Minimal[] (comp) sin p=8 n=5000000 :0.057377 seconds 0.214162 seconds relative speed 0.267914
Minimal[] (comp) pre-sorted p=8 n=5000000 :0.001647 seconds 0.001834 seconds relative speed 0.898037
Minimal[] (comp) reverse-sorted p=8 n=5000000 :0.073003 seconds 0.042997 seconds relative speed 1.697863
float[] (no comp) sin p=8 n=5000000 :0.088172 seconds 0.394136 seconds relative speed 0.223710
float[] (no comp) pre-sorted p=8 n=5000000 :0.001702 seconds 0.001755 seconds relative speed 0.969801
float[] (no comp) reverse-sorted p=8 n=5000000 :0.089717 seconds 0.050909 seconds relative speed 1.762301
float[] (no comp) mostly sorted separate p=8 n=5000000 :0.080557 seconds 0.053944 seconds relative speed 1.493345
float[] (no comp) mostly sorted mixed p=8 n=5000000 :0.058281 seconds 0.076188 seconds relative speed 0.764963
float[] (comp) sin p=8 n=5000000 :0.088116 seconds 0.377433 seconds relative speed 0.233461
float[] (comp) pre-sorted p=8 n=5000000 :0.001624 seconds 0.001624 seconds relative speed 1.000000
float[] (comp) reverse-sorted p=8 n=5000000 :0.090205 seconds 0.049636 seconds relative speed 1.817330
float[] (comp) mostly sorted separate p=8 n=5000000 :0.078963 seconds 0.050833 seconds relative speed 1.553381
float[] (comp) mostly sorted mixed p=8 n=5000000 :0.057468 seconds 0.076679 seconds relative speed 0.749462
concurrent_vector<float> (no comp) sin p=8 n=5000000 :0.174679 seconds 0.465071 seconds relative speed 0.375596
concurrent_vector<float> (no comp) pre-sorted p=8 n=5000000 :0.004398 seconds 0.004435 seconds relative speed 0.991657
concurrent_vector<float> (no comp) reverse-sorted p=8 n=5000000 :0.266924 seconds 0.081756 seconds relative speed 3.264886
concurrent_vector<float> (no comp) mostly sorted separate p=8 n=5000000 :0.220463 seconds 0.090405 seconds relative speed 2.438615
concurrent_vector<float> (no comp) mostly sorted mixed p=8 n=5000000 :0.147378 seconds 0.114316 seconds relative speed 1.289216
concurrent_vector<float> (comp) sin p=8 n=5000000 :0.177702 seconds 0.465342 seconds relative speed 0.381874
concurrent_vector<float> (comp) pre-sorted p=8 n=5000000 :0.004786 seconds 0.004368 seconds relative speed 1.095696
concurrent_vector<float> (comp) reverse-sorted p=8 n=5000000 :0.259108 seconds 0.082077 seconds relative speed 3.156889
concurrent_vector<float> (comp) mostly sorted separate p=8 n=5000000 :0.216838 seconds 0.078982 seconds relative speed 2.745410
concurrent_vector<float> (comp) mostly sorted mixed p=8 n=5000000 :0.141579 seconds 0.114651 seconds relative speed 1.234869
string[] (no comp) sin p=8 n=5000000 :1.072494 seconds 1.620175 seconds relative speed 0.661962
string[] (comp) sin p=8 n=5000000 :1.435451 seconds 1.717984 seconds relative speed 0.835544
concurrent_vector<Minimal> (comp) sin p=8 n=5000000 :0.135827 seconds 0.274697 seconds relative speed 0.494461
concurrent_vector<Minimal> (comp) pre-sorted p=8 n=5000000 :0.004724 seconds 0.005422 seconds relative speed 0.871265
concurrent_vector<Minimal> (comp) reverse-sorted p=8 n=5000000 :0.247340 seconds 0.078512 seconds relative speed 3.150346

jimdempseyatthecove · ‎06-15-2013

Raf,

I have an observation and a suggestion (if you are up for the task).

Observaton: First pass of Quick Sort is made by making a best guess at median key and using that as pivot point. As you stated, this restricts the first pass to single thread (unless you use leader-follower where leader selects the to-be-swapped elements and follower performs the swap). For second pass you have a better guess at the two pivot points, and two threads can run, then third pass with 4 threads, etc...

Suggestion: First pass to select multiple pivot points such that at the end of the first pass you have N partitons (N= number of threads). At the start of the second pass, each of the threads will know the number of entries in all partitions. Using this information, they then each select an optimal number of sub-sub-partitions. At some point each sub-...-sub-partition is optimally partitioned by 2.

Note, the first pass is still run with single thread (unless effective leader-follower technique can be found). This pass should be slighty slower since you are comparing N keys however the N-1 keys of the next cycle will be in L1 cache.

Jim Dempsey

RafSchietekat · ‎06-16-2013

How would you partition into more than 2 parts? Use lots of external storage? An extra (M-1)*N spaces would allow direct partitioning into 2*M parts, I guess, as a first approach anyway. Otherwise, unless I missed something, the partitioning thread first has to count how many elements will be in each partition before it actually goes about partitioning, and that doesn't seem very efficient, because it creates more total work, unless there's nothing else for the machine to do in the meantime and you don't mind the extra proverbial CO2. Would you really think that two-phase fully in-place multi-way partitioning would already be able to prevail over the current implementation, though? Maybe there's a minimally required degree of parallelism (2 wouldn't work)?

Meanwhile, I'm still with the merge sort, because I may have found a cause why lazy swapping didn't work yet, and a possible workaround.

(Added) I had made a copy&paste mistake in the test, so the supposedly mostly sorted data was actually mostly reverse-sorted, which caused the algorithm to always detect the need to merge, which always required swapping. The only benefit for merge sort was that merging disjunct ranges was optimised, especially for trivially copyable types and pointer iterators. However, when I corrected the mistake... quick sort benefited more than merge sort! So I had to do some more work to allow better lazy swapping and merging (by detecting an already sorted prefix in the current subrange). I'm not unhappy with the outcome, and you can see for yourself:

Minimal[] (comp) sin p=8 n=5000000 :0.049921 s 0.185660 s relative speed 0.268884
Minimal[] (comp) pre-sorted p=8 n=5000000 :0.001697 s 0.001600 s relative speed 1.060625
Minimal[] (comp) reverse-sorted p=8 n=5000000 :0.067058 s 0.040170 s relative speed 1.669355
float[] (no comp) sin p=8 n=5000000 :0.096416 s 0.376344 s relative speed 0.256191
float[] (no comp) pre-sorted p=8 n=5000000 :0.001785 s 0.001505 s relative speed 1.186047
float[] (no comp) reverse-sorted p=8 n=5000000 :0.087632 s 0.045927 s relative speed 1.908072
float[] (no comp) mostly sorted separate p=8 n=5000000 :0.024296 s 0.019774 s relative speed 1.228684
float[] (no comp) mostly sorted mixed p=8 n=5000000 :0.047737 s 0.050908 s relative speed 0.937711
float[] (comp) sin p=8 n=5000000 :0.084850 s 0.335659 s relative speed 0.252786
float[] (comp) pre-sorted p=8 n=5000000 :0.001589 s 0.001535 s relative speed 1.035179
float[] (comp) reverse-sorted p=8 n=5000000 :0.083609 s 0.052413 s relative speed 1.595196
float[] (comp) mostly sorted separate p=8 n=5000000 :0.023327 s 0.019870 s relative speed 1.173981
float[] (comp) mostly sorted mixed p=8 n=5000000 :0.054237 s 0.048402 s relative speed 1.120553
concurrent_vector<float> (no comp) sin p=8 n=5000000 :0.158414 s 0.430017 s relative speed 0.368390
concurrent_vector<float> (no comp) pre-sorted p=8 n=5000000 :0.004316 s 0.004287 s relative speed 1.006765
concurrent_vector<float> (no comp) reverse-sorted p=8 n=5000000 :0.240558 s 0.081098 s relative speed 2.966263
concurrent_vector<float> (no comp) mostly sorted separate p=8 n=5000000 :0.063967 s 0.026681 s relative speed 2.397474
concurrent_vector<float> (no comp) mostly sorted mixed p=8 n=5000000 :0.121318 s 0.064965 s relative speed 1.867436
concurrent_vector<float> (comp) sin p=8 n=5000000 :0.168943 s 0.439503 s relative speed 0.384396
concurrent_vector<float> (comp) pre-sorted p=8 n=5000000 :0.004989 s 0.004517 s relative speed 1.104494
concurrent_vector<float> (comp) reverse-sorted p=8 n=5000000 :0.238984 s 0.080120 s relative speed 2.982826
concurrent_vector<float> (comp) mostly sorted separate p=8 n=5000000 :0.065125 s 0.027061 s relative speed 2.406600
concurrent_vector<float> (comp) mostly sorted mixed p=8 n=5000000 :0.121747 s 0.065032 s relative speed 1.872109
string[] (no comp) sin p=8 n=5000000 :1.060586 s 1.513491 s relative speed 0.700755
string[] (comp) sin p=8 n=5000000 :1.213007 s 1.704814 s relative speed 0.711519
concurrent_vector<Minimal> (comp) sin p=8 n=5000000 :0.125093 s 0.254090 s relative speed 0.492318
concurrent_vector<Minimal> (comp) pre-sorted p=8 n=5000000 :0.004860 s 0.004798 s relative speed 1.012922
concurrent_vector<Minimal> (comp) reverse-sorted p=8 n=5000000 :0.238240 s 0.076140 s relative speed 3.128973

Remarkably, binary merge seems better on random input than 64-way merge, but not on mostly sorted input:

Minimal[] (comp) sin p=8 n=5000000 :0.049962 s 0.057742 s relative speed 0.865263
Minimal[] (comp) pre-sorted p=8 n=5000000 :0.001503 s 0.001596 s relative speed 0.941729
Minimal[] (comp) reverse-sorted p=8 n=5000000 :0.068662 s 0.040166 s relative speed 1.709456
float[] (no comp) sin p=8 n=5000000 :0.093805 s 0.086748 s relative speed 1.081351
float[] (no comp) pre-sorted p=8 n=5000000 :0.001773 s 0.001875 s relative speed 0.945600
float[] (no comp) reverse-sorted p=8 n=5000000 :0.083610 s 0.052436 s relative speed 1.594515
float[] (no comp) mostly sorted separate p=8 n=5000000 :0.023288 s 0.045520 s relative speed 0.511599
float[] (no comp) mostly sorted mixed p=8 n=5000000 :0.047748 s 0.043098 s relative speed 1.107894
float[] (comp) sin p=8 n=5000000 :0.084759 s 0.062855 s relative speed 1.348485
float[] (comp) pre-sorted p=8 n=5000000 :0.001603 s 0.001500 s relative speed 1.068667
float[] (comp) reverse-sorted p=8 n=5000000 :0.084579 s 0.048307 s relative speed 1.750864
float[] (comp) mostly sorted separate p=8 n=5000000 :0.023402 s 0.044682 s relative speed 0.523746
float[] (comp) mostly sorted mixed p=8 n=5000000 :0.046864 s 0.046264 s relative speed 1.012969
concurrent_vector<float> (no comp) sin p=8 n=5000000 :0.156299 s 0.120326 s relative speed 1.298963
concurrent_vector<float> (no comp) pre-sorted p=8 n=5000000 :0.004358 s 0.005489 s relative speed 0.793952
concurrent_vector<float> (no comp) reverse-sorted p=8 n=5000000 :0.230710 s 0.080521 s relative speed 2.865215
concurrent_vector<float> (no comp) mostly sorted separate p=8 n=5000000 :0.065413 s 0.074384 s relative speed 0.879396
concurrent_vector<float> (no comp) mostly sorted mixed p=8 n=5000000 :0.121653 s 0.087962 s relative speed 1.383018
concurrent_vector<float> (comp) sin p=8 n=5000000 :0.155550 s 0.117563 s relative speed 1.323120
concurrent_vector<float> (comp) pre-sorted p=8 n=5000000 :0.004350 s 0.004372 s relative speed 0.994968
concurrent_vector<float> (comp) reverse-sorted p=8 n=5000000 :0.233149 s 0.080685 s relative speed 2.889620
concurrent_vector<float> (comp) mostly sorted separate p=8 n=5000000 :0.063447 s 0.075664 s relative speed 0.838536
concurrent_vector<float> (comp) mostly sorted mixed p=8 n=5000000 :0.117638 s 0.077263 s relative speed 1.522566
string[] (no comp) sin p=8 n=5000000 :1.012791 s 1.638284 s relative speed 0.618202
string[] (comp) sin p=8 n=5000000 :1.220548 s 1.987904 s relative speed 0.613987
concurrent_vector<Minimal> (comp) sin p=8 n=5000000 :0.125427 s 0.115113 s relative speed 1.089599
concurrent_vector<Minimal> (comp) pre-sorted p=8 n=5000000 :0.004465 s 0.004210 s relative speed 1.060570
concurrent_vector<Minimal> (comp) reverse-sorted p=8 n=5000000 :0.232239 s 0.081898 s relative speed 2.835710

RafSchietekat · ‎06-23-2013

Not too shabby:

string[] (no comp) essentially random p=8 n=5000000 :1.214567 s 1.188623 s relative speed 1.021827
string[] (no comp) pre-sorted p=8 n=5000000 :0.126007 s 0.117482 s relative speed 1.072564
string[] (no comp) reverse-sorted p=8 n=5000000 :3.546244 s 0.395822 s relative speed 8.959189
string[] (no comp) mostly sorted separate p=8 n=5000000 :1.485870 s 0.320982 s relative speed 4.629138
string[] (no comp) mostly sorted mixed p=8 n=5000000 :2.114791 s 0.484057 s relative speed 4.368888
string[] (comp) essentially random p=8 n=5000000 :1.493962 s 1.341368 s relative speed 1.113760
string[] (comp) pre-sorted p=8 n=5000000 :0.097033 s 0.081324 s relative speed 1.193166
string[] (comp) reverse-sorted p=8 n=5000000 :3.022029 s 0.411889 s relative speed 7.336999
string[] (comp) mostly sorted separate p=8 n=5000000 :1.293284 s 0.281310 s relative speed 4.597362
string[] (comp) mostly sorted mixed p=8 n=5000000 :2.002046 s 0.473519 s relative speed 4.228016

This is a straight win against quicksort for std::string[] (should be the same for std::vector<std::string>), with the best relative speed easily twice as good as the worst relative speed elsewhere (well, before prospect theory, of course...)!