Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1695 Discussions

## algorithms

Beginner
228 Views

Hello,

look down the the following link...
it's about parallel partition...

http://www.redgenes.com/Lecture-Sorting.pdf

I have tried to simulate this parallel partition method ,
but i don't think it will scale cause we have to do a merging,
which essentially is an array-copy operation but this array-copy
operations will be expensive compared to an integer compare
operation that you find inside the partition fuinction, and it will still
be expensive compared to a string compare operation that you find
inside the partition function. So since it's not scaling i have abondoned
the idea to implement this parallel partition method in my parallel
quicksort.

I have also just read the following paper about Parallel Merging:

http://www.economyinformatics.ase.ro/content/EN4/alecu.pdf

And i have implemented this algorithm just to see what is its performance.
and i have noticed that the serial algorithm is 8 times slower than the
merge
function that you find in the serial mergesort algorithm.So 8 times slower,
it's too slow.

So the only way to do a somewhat better parallel sorting algorithm,
it's to use the following algorithm;

http://www.drdobbs.com/parallel/parallel-merge/229204454?queryText=parallel%2Bsort

The idea is simple:

Let's assume we want to merge sorted arrays X and Y. Select X
median element in X. Elements in X[ .. m-1] are less than or equal to
X. Using binary search find index k of the first element in Y greater
than X. Thus Y[ .. k-1] are less than or equal to X as well.
Elements in X[m+1..] are greater than or equal to X and Y[k .. ]
are greater. So merge(X, Y) can be defined as
concat(merge(X[ .. m-1], Y[ .. k-1]), X, merge(X[m+1.. ], Y[k .. ]))
now we can recursively in parallel do merge(X[ .. m-1], Y[ .. k-1]) and
merge(X[m+1 .. ], Y[k .. ]) and then concat results.

"Parallel hybrid merge algorithm was developed that outperformed
sequential simple merge as well as STL merge by 0.9-5.8 times overall
and by over 5 times for larger arrays"

http://www.drdobbs.com/parallel/parallel-merge/229204454?pgno=3

This paper as you have noticed has fogot to tell that this method is
dependant on the distribution of the data

Read for exemple this:

"Select X median element in X. Elements in X[ .. m-1] are less than
or equal to X. Using binary search find index k of the first element in
Y greater than X."

So if "median element:" of X is not near or equal to the "median
element" of Y so this method can have bad parallel performance
and it may not scale as you think.

There is another parallel method for parallel partition, here it's:

http://www.cs.sunysb.edu/~rezaul/Spring-2012/CSE613/CSE613-lecture-9.pdf

but as you will notice it's still too expensive, causeyou have to create
3 arrays and copy in them:

3. array B[ 0: n ? 1 ], lt[ 0: n ? 1 ], gt[ 0: n ? 1 ]

You can use SIMD instructions on the parallel-prefix-sum function

8. lt [ 0: n ? 1 ] ¬ Par-Prefix-Sum ( lt[ 0: n ? 1 ], + )
:
But the algorithm is still expensive i think on a quad or eight cores or
even
16 cores, you have to have much more than 16 cores to be able to benefit
from this method i think.

Bucket sort is not a sorting algorithm itself, rather it is a
procedure for partitioning data to be sorted using some given
sorting algorithm-a "meta-algorithm" so to speak.

Bucket sort will be better, when elements are uniformly distributed
over an interval [a, b] and buckets have not significantly different
number of elements.

bucketsort sequential computational complexity using quicksort is
= p × (n/p) log(n/p) = n log(n/p)

bucket sort parallel computational complexity using quicksort
= (n/p) log(n/p)

Parallel BucketSort gave me also 3x scaling when sorting strings on a
quad cores, it scales better than my parallel quicksort and it can be
much faster than my parallel quicksort.

I have thought about my parallel quicksort , and i have found
that parallelquicksort is fine but the problem with it is that the
partition function is not parallelizable , so i have thought about this
and i have decided to write a parallel BucketSort,and this parallel
bucketsort can give you much better performance than parallel quicksort
when sorting 100000 strings or more, just test it yourself and see,
parallel bucketsort can sort just strings at the moment, and it can use
quicksort or mergesort, mergesort is faster.

I have updated parallel bucketsort to version 1.02 , i have
changed a little bit the interface, now you have to pass
to the bucketsort method four parameters: the array,
a TSortCompare function and a TSort1 function and a constant

ctDescending or ctAscesending to sort in ascending or descending order..

Here is a small example in Object Pascal:

==
program test;

uses parallelbucketsort,sysutils,timer;

type

TStudent = Class
public
mystring:string;
end;

var tab:Ttabpointer;
myobj:TParallelSort;
student:TStudent;
i,J:integer;
a:integer;

function comp1(Item1:Pointer): string;
begin
result:=TStudent(Item1).mystring ;
end;

function comp(Item1, Item2: Pointer): integer;
begin
if TStudent(Item1).mystring > TStudent(Item2).mystring
then
begin
result:=1;
exit;
end;
if TStudent(Item1).mystring < TStudent(Item2).mystring
then
begin
result:=-1;
exit;
end;

if TStudent(Item1).mystring = TStudent(Item2).mystring
then
begin
result:=0;
exit;
end;
end;

begin

myobj:=TParallelSort.create(1,ctQuicksort); // set to the number of cores...

setlength(tab,1000000);
?
for i:=low(tab) to high(tab)
do
begin
student:=TStudent.create;
student.mystring:= inttostr(i);
tab[high(tab)-i]:= student;
end;

HPT.Timestart;

myobj.bucketsort(tab,comp,comp1,ctAscending); // use ctAscending or CtDescending.
//myobj.qsort(tab,comp);
writeln('Time in microseconds: ',hpt.TimePeriod);

writeln;
writeln('Please press a key to continu...');

for i := LOW(tab) to HIGH(Tab)-1
do
begin
if tstudent(tab).mystring > tstudent(tab[i+1]).mystring
then
begin
writeln('sort has failed...');
halt;
end;
end;

for i := LOW(tab) to HIGH(Tab)
do
begin
writeln(TStudent(tab).mystring,' ');
end;

for i := 0 to HIGH(Tab) do freeandnil(TStudent(tab));

setlength(tab,0);
myobj.free;

end.

===