- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a fairly simple question, but haven't been able to find an answer so far.
I have a mergesort algorithm which is partly parallelised. I haven't programmed a parallel merge yet, since I wanted to see, if I could also speed things up while dividing my array. It's giving me a couple of seconds of speedup, but only really if I put a cap on the recursion level. Since I'm sorting 100.000.000 Numbers here, I'm probably just generating a load of Overhead at some point.
My question: Is there any standard technique to fight this sideeffect? Without the upper bound, I'm also getting some speedup, but much less than with it! Here I have set it to 1024. But what if I want to sort 1024 Numbers, I wouldn't be doing anything in parallel? I'd say: Well that's not a lot, why go parallel, the speedup is tiny? And if I would want to sort 1024 numbers very often, I'd use a cilk_for loop which calculates the grainsize for me anyways so I don't really have to worry about overhead there.
Here is the part where I spawn the strands:
void merge_sort(int* c, int* a, int length)
{
if (length == 1)
...
else
{
int* c = new int[length];
if (n > 1024)
cilk_spawn merge_sort(c, a, length/2);
else
merge_sort(c, a, length/2);
merge_sort(c + length/2, a + length/2, length - length/2);
cilk_sync;
.....MERGE....
}
}
Parallelising the merge funktion will probably help quite a bit aswell. Doing that with a divide and coquer strategy will probably also be faster if the maximum recursion level is set to a fixed number.
Is there maybe kind of magic number?
I hope someone can help me clear things up.
Cheers,
Chris
p.s. What I have also noticed...if I spawn strands like so:
cilk_spawn mergeSort(c, a, n/2);
cilk_spawn mergeSort(c + n/2, a + n/2, n - n/2);
cilk_sync;
instead of:
cilk_spawn mergeSort(c, a, n/2);
mergeSort(c + n/2, a + n/2, n - n/2);
cilk_sync;
I'm just generating more Overhead in the first version, right?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First time
spawn when other thread(s) sleeping
spawn when other thread(s) searching for task
spawn when other thread(s) busy with task
The amount of overhead will vary amongst different hardware configurations.
(number of threads, cach size, RAM speed, etc...)
Your likely best tuning parameter would likely be based on a function of:
(number of threads, cach size, RAM speed, etc...)
nest level
length
representative order of data
As to what the values and function to use, this will be for you to find out through experimentation.
int len_cutoff = 1024; // you determine this
bool doSplit(int nest, int len)
{
if(len < len_cutoff) return false;
if(nThreads == 1) return false;
// your nest vs. number of threads test
if(nest > (4 << nThreads)) return false;
return true;
}
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As a rough guide, you should break your problem into approximately 10P "chunks", where "P" is the number of cores on your system. This will usually give you sufficient "slackness" - enough chunks so if the amount of work isn't equal in each chunk, there will be enough chunks to keep all of the cores busy while the big chunk is being worked on.
Of course, the goal of Cilk is to keep you from having to worry about things like how many processors you've got available. We've discussed ways to have the code generated for the spawn sense the deque depth and automatically convert to a simple call, but haven't done anything more than talk about it so far.
For now, the best advice is to have "enough" slackness, so you work around the "big chunk" problem, but not too much, since that adds overhead. Unfortunately there is no magic number I can tell you.
- Barry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As far as I remember we had a really good discussion regarding optimization of Merge Sorting algorithm some time in March 2012:
http://software.intel.com/en-us/forums/showthread.php?t=103715&o=a&s=lr
>> Maximum recursion depth
I've done lots of testing withMerge Sorting algorithm and my version always splits input data set down
to 2. A threshold of1024 didn't speed up sorting of large data sets in my case.
>> Is there any standard technique to fight this sideeffect?
In case ofMerge Sorting algorithm I do an 'Exchange Pass' before sorting starts. Here is an example:
Input array: 10 9 8 7 6 5 4 3 2 1
Exchange Pass: 9 10 7 8 5 6 3 4 1 2 Note: Consider this as some kind of data mining
Execute Merge Sort with athreshold of2
Output array: 1 2 3 4 5 6 7 8 9 10
It really increases performance and I tested that "trick" with data sets up to 128MB.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Cheers,
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a graphthat demonstrates a performance of different sorting algorithms:
![](/skins/images/8492E06DAD1A420837AF16E991F628F9/2021_redesign/images/image_not_found.png)
As you can see for a 64MBarray Merge Sorting algorithm significantlyoutperforms Heap and Quick Sorting algorithms.
Best regards,
Sergey
![](/skins/images/D2683F18326913BBA0436CB7114DD569/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page