Parallel merge sort

stupidquestion · ‎03-07-2011

Hi,

I want to implement the merge sort algorithm using TBB.

[cpp]void mergeSort(float* someArray, int firstEl, int lastEl)
{
	if(firstEl < lastEl)
	{
		int q = (int) floor((firstEl + lastEl) / 2.0);
		mergeSort(someArray, firstEl, q);
		mergeSort(someArray, q + 1, lastEl);
		merge(someArray, firstEl, q, lastEl);
	}
}

void parallelMergeSort(float* someArray, int firstEl, int lastEl)
{
	if(lastEl - firstEl >= 50000)
	{
		int q = (int) floor((firstEl + lastEl) / 2.0);
		parallel_invoke( [=] {mergeSort(someArray, firstEl, q);},
			[=] {mergeSort(someArray, q + 1, lastEl);} );
		merge(someArray, firstEl, q, lastEl);
	}
	else
	{
		mergeSort(someArray, firstEl, lastEl);
	}
}
[/cpp]

But standart serial algorithm is faster than my parallel algorithm. How howcan Iimprove the performance?

And how I can provide efficient implementation for any number of cores and for any array size(line 14).

I apologizefor my English.

Kirill_R_Intel · ‎03-22-2011

In your sample you just divide the whole job into two parts by paralle_invoke. To have the code scalable, you may consider using another parallel mechanism that allow runtime scaling to any number of available cores.

Implementation of the "merge" function is not described here, but standard implementations have several "for" loops. You can change them to tbb::parallel_for. In this case the loops will be executed in parallel and the data will be divided automatically between worker threads. In this case you can remove parallel_invoke.

Another way is using tbb tasks, but maybe a little bit more complicated.

Also what is the reason of comparison with 50000 in line 14? Have you tried to change this value, as well as the whole data array size?
Invoking parallel processing brings some overhead. If amount of processed data is little, the overhead can be more than positive effect and you can see slowdown. It will work efficiently if amount of data is big enough - try to increase it, maybe you'll get improvement.

Regards,
Kirill

Alexey-Kukanov · ‎03-22-2011

In addition to what Kirill suggested, I would also try recursive parallelism; for a first attempt, it can be as simple as replacing mergeSort with parallelMergeSort in the lambda functionspassed to parallel_invoke.

And, the cutoff threshold of 50000 is likely too big. For example, TBB implementation of parallel_sort uses threshold of 500. You may run some experiments but I'd guess that the value of about 1000 should be sufficient, and possibly can even be reduced.