Maximum recursion depth

chrisfeist · ‎05-23-2012

Hello,

I have a fairly simple question, but haven't been able to find an answer so far.

I have a mergesort algorithm which is partly parallelised. I haven't programmed a parallel merge yet, since I wanted to see, if I could also speed things up while dividing my array. It's giving me a couple of seconds of speedup, but only really if I put a cap on the recursion level. Since I'm sorting 100.000.000 Numbers here, I'm probably just generating a load of Overhead at some point.
My question: Is there any standard technique to fight this sideeffect? Without the upper bound, I'm also getting some speedup, but much less than with it! Here I have set it to 1024. But what if I want to sort 1024 Numbers, I wouldn't be doing anything in parallel? I'd say: Well that's not a lot, why go parallel, the speedup is tiny? And if I would want to sort 1024 numbers very often, I'd use a cilk_for loop which calculates the grainsize for me anyways so I don't really have to worry about overhead there.

Here is the part where I spawn the strands:

void merge_sort(int* c, int* a, int length)
{
if (length == 1)
...
else
{
int* c = new int[length];

if (n > 1024)
cilk_spawn merge_sort(c, a, length/2);
else
merge_sort(c, a, length/2);
merge_sort(c + length/2, a + length/2, length - length/2);
cilk_sync;

.....MERGE....
}
}

Parallelising the merge funktion will probably help quite a bit aswell. Doing that with a divide and coquer strategy will probably also be faster if the maximum recursion level is set to a fixed number.

Is there maybe kind of magic number?

I hope someone can help me clear things up.

Cheers,
Chris

p.s. What I have also noticed...if I spawn strands like so:

cilk_spawn mergeSort(c, a, n/2);
cilk_spawn mergeSort(c + n/2, a + n/2, n - n/2);
cilk_sync;

instead of:

cilk_spawn mergeSort(c, a, n/2);
mergeSort(c + n/2, a + n/2, n - n/2);
cilk_sync;

I'm just generating more Overhead in the first version, right?

jimdempseyatthecove · ‎05-24-2012

Each spawn and sync has overhead. The overhead is usually coarse granular depending on a few factors.

First time
spawn when other thread(s) sleeping
spawn when other thread(s) searching for task
spawn when other thread(s) busy with task

The amount of overhead will vary amongst different hardware configurations.
(number of threads, cach size, RAM speed, etc...)

Your likely best tuning parameter would likely be based on a function of:

(number of threads, cach size, RAM speed, etc...)
nest level
length
representative order of data

As to what the values and function to use, this will be for you to find out through experimentation.

int len_cutoff = 1024; // you determine this

bool doSplit(int nest, int len)
{
if(len < len_cutoff) return false;
if(nThreads == 1) return false;
// your nest vs. number of threads test
if(nest > (4 << nThreads)) return false;
return true;
}

Jim Dempsey

Barry_T_Intel · ‎05-24-2012

As a rough guide, you should break your problem into approximately 10P "chunks", where "P" is the number of cores on your system. This will usually give you sufficient "slackness" - enough chunks so if the amount of work isn't equal in each chunk, there will be enough chunks to keep all of the cores busy while the big chunk is being worked on.

Of course, the goal of Cilk is to keep you from having to worry about things like how many processors you've got available. We've discussed ways to have the code generated for the spawn sense the deque depth and automatically convert to a simple call, but haven't done anything more than talk about it so far.

For now, the best advice is to have "enough" slackness, so you work around the "big chunk" problem, but not too much, since that adds overhead. Unfortunately there is no magic number I can tell you.

- Barry

BradleyKuszmaul · ‎05-24-2012

@Jim Dempsey: In my experience, the overhead of a spawn is independent of whether it's the first time, and of what other threads are doing, and I'm not sure what you mean by "nThreads", but really isn't a meaningful variable to use that way in Cilk. I'd suggest doing some actual Cilk programming if you haven't.

@chrisfeist: Barry's advice is good, but maybe you would like some more concrete advice. Here are some ideas that may help you improve your performance more:

I have observed that, for icc, to speed up the base case of a recursion, I need to completely separate the base case of my recursion into a separate function

You are doing the following, which seems reasonable but turns out to be slow: I'll rearrange it a little to make it easier for me to explain, so you can see the similarity between what you did and what I am recommending. You wrote:

[cpp]sort (int *to, int *from, int n) { if (n==1) *from = *to; else if (n<1024) { int * tmp = new int; sort(tmp, to, n/2); sort(tmp+n/2, to+n/2, n-n/2); merge(to, tmp, tmp+n/2, n/2, n-n/2); else { int * tmp = new int; cilk_spawn sort(from, to, n/2); sort(from+n/2, to+n/2, n-n/2); merge(to, tmp, tmp+n/2, n/2, n-n/2); }[/cpp]

But what you need to do is

[cpp]sort_base(int *to, int *from, int n) { if (n==1) *from = *to; else { int * tmp = new int; sort(tmp, to, n/2); sort(tmp+n/2, to+n/2, n-n/2); merge(to, tmp, tmp+n/2, n/2, n-n/2); } } sort (int *to, int *from, int n) { if (n<1024) sot_base(to, from, n); else { int * tmp = new int; cilk_spawn sort(from, to, n/2); sort(from+n/2, to+n/2, n-n/2); merge(to, tmp, tmp+n/2, n/2, n-n/2); }[/cpp]

Is it clear what I did? The key being that the coarsened recursion (n<1024) has no cilk keywords in it at all. When icc compiles a function that has any cilk keywords in it, all calls to that function seem to be a little bit more expensive, even calls that don't use cilk_spawn.

Now if you really want to make this sort run fast, you need to use a faster sort for the base case. For example, quicksort is often about twice as fast as merge sort, so that would be a better choice for the serial base case.

And, as you have pointed out, you need to parallelize the merge to have more speedup. This code with serial merge has parallelism O(log n). For sorting a billion numbers, log n is only about 30. Harking back to Barry's advice, if you have P processors you need about 10*P parallelism. So if you have 30-fold parallelism, you may only be able to keep three processors busy. If you parallelize the merge, the program will have parallelism something like O(n/log n) [if I recall correctly], which if n=1e9 is about 3e7 parallelism, which is plenty.

Please let me know if this is unclear (and I'd also be happy to hear about success...)

-Bradley

SergeyKostrov · ‎05-25-2012

Hi,

As far as I remember we had a really good discussion regarding optimization of Merge Sorting algorithm some time in March 2012:

http://software.intel.com/en-us/forums/showthread.php?t=103715&o=a&s=lr

>> Maximum recursion depth

I've done lots of testing withMerge Sorting algorithm and my version always splits input data set down
to 2. A threshold of1024 didn't speed up sorting of large data sets in my case.

>> Is there any standard technique to fight this sideeffect?

In case ofMerge Sorting algorithm I do an 'Exchange Pass' before sorting starts. Here is an example:

Input array: 10 9 8 7 6 5 4 3 2 1
Exchange Pass: 9 10 7 8 5 6 3 4 1 2 Note: Consider this as some kind of data mining
Execute Merge Sort with athreshold of2
Output array: 1 2 3 4 5 6 7 8 9 10

It really increases performance and I tested that "trick" with data sets up to 128MB.

chrisfeist · ‎05-25-2012

Thanks for the great replies and tips! This really helps me a lot!

Cheers,
Chris

SergeyKostrov · ‎05-26-2012

Hi Chris,

Here is a graphthat demonstrates a performance of different sorting algorithms:

As you can see for a 64MBarray Merge Sorting algorithm significantlyoutperforms Heap and Quick Sorting algorithms.

Best regards,
Sergey