Cache performance of quicksort

prasun_gera · ‎03-30-2010

Hello,
I ran the quicksort code which comes with the TBB book. (It can be downloaded from http://softwarecommunity.intel.com/isn/downloads/examples.zip) In its current implementation, it uses parallel_for and a custom range with the splitting constructor doing the partioning. The function body itself is just a serial sort. Now, the partitioning often splits the range into unequal partitions. However, because of the nature of parallel_for and the splitting constructor, the thread that partitions the range consumes the left partition next and the right partition is pushed to the task queue to be stolen. Can a better decision be made by using explicit tasks? At any stage the following scenarios are possible:
1) Both the partitions fit in the cache
2) Only the smaller partition fits in the cache
3) Neither of them fits fully in the cache

If both of them fit in the cache, it would be better to execute the bigger one next and push the smaller one on the queue right? I'm not sure what a good strategy would be for the other two cases. Comments?

ARCH_R_Intel · ‎04-08-2010

That's an intriguing observation. I believe it is possible to implement by minor modifications of the parallel_sort.h header; that is, no explicit tasking is necessary. The reason is that there is no requirement that a splitting constructor construct the right subrange. It's just a common convention to do so. You could make the splitting constructor swap the constructed subrange and"old"subrange (modified argument range).

Whether it has a noticeable effect can only be answered by experimentation. If you run the experiment, please post the results.

prasun_gera · ‎04-09-2010

I had written a task-based code which was a rehash of the example code already. So, I'm contining with that. I feel that I'll have more control for further experimentation if I use explicit tasks. I hope there is no serious downside to using explicit tasks (given that I use continuation, recycling, bypass etc.) apart from losing the ability to use a partitioner to make the decision about when to split the range. Anyway, for the time being I can compare the performace to the parallel_for version with a simple partitioner. Also, after some tinkering, I've managed to map a thread's core affinity (which I'm assigning explicitly using pthread_setaffinity_np ) to the affinity_id assigned by tbb. So, I can specify the affinity to the core that I wish to dispatch a task to. Of course, its just a hint but it'll do for the time being. It'll be interesting to try out different things for a shared cahce architecture. Having done this preliminary work, I now want to monitor cache behaviour on a per task execution basis. As of now I'm trying to use Simics, which is a full system simulator, to do this. Are there any better alternatives for doing this kind of profiling?