I'm new here and got redirected to here by a stackoverflow user.
I'm a beginner when it comes down to shared-memory programming.
I'm implementing a parallel version of the QuickHull algorithm (http://en.wikipedia.org/wiki/QuickHull) to find the convex hull of a set of data points. (http://en.wikipedia.org/wiki/Convex_hull). And I wanted to get an implementation by both using OpenMP and TBB, to get familiarized with both of them.
Basically the following tasks can be done in parallel:
Note that part 3 and 4 are done recursively until every subset is empty.
First I've done this with OpenMP using mostly the #pragma omp parallel for an I get a nice speedup. Secondly, I've made an implementation using Intel TBB. This implementation resulted in a negative speedup (even for large data sets). I used both tbb::parallel_for() and tbb::parallel_reduce().
Here is the link the source code: http://pastebin.com/fggJd3zP
I'll start by adding some benchmark results. Note that I left out the runtime results for 100 - 10,000 points as it is quite trivial that no speedup is gained with these.
Just to be able to situate my problem a small comparision : Running the algorithm using OpenMP I got a 3x speedup with 4-8 threads. OpenMP did not gave a positive speedup with < 10,000 elements.
Mister Arch D. Robinson of stackoverflow (who is quite familiar himself with TBB) adviced:
"To get parallel speedup for the QuickHull algorithm requires extreme care with data structures. The top-level partitioning does indeed dominate the running time in most cases, so it should be parallel. However, using concurrent_vector::push_back introduces much cross-thread cache contention as different threads push onto the same cache line (so called "false sharing") and those threads contend for the cache line that has the size of the concurrent_vector.
In the TBB distribution, look in directory examples/parallel_reduce/convex_hull/ for a complete QuickHull algorithm that does speed up. It uses parallel_reduce (not parallel_for) to do the top-level partitioning. It does use and concurrent_vector, but batches up points before copying them to the concurrent_vector, to minimize the false sharing problem."
As can be seen in the sourcecode I've added, I did what he told to fix the false sharing problem (using portions of the code from the quickhull example). As this did not fix my speedup I decided to benchmark the parallel portions seperatly. Here I found some odd things. At the top-level, the tbb::parallel_for() call taking an InitialSplitter instance took 0.071055 second (for this single parallel_for!) to run whilst the entire OpenMPI implementation runs in 0.067581. Note that this is even before going into recursion. The work done inside the operator()-method is not very intensive.
This is strange as a significant speedup can be noticed when timing inside the InitialSplitter-object. The for-loop which is iterated in the InitialSplitters' operator()-method shows a speedup that is expected when the number of threads are increased (near-linear), but when timing outside the parallel_for() the runtime stays the same for any number of threads.
Can anyone help me out with why my code results in a negative speedup and more specifically, why at the top level a single parallel_for() taking a InitialSplitter instance is so slow?
Thanks in advance!
I would characterise Arch D. Robison (sic) as more than just "quite familiar" with TBB (I am surprised about his characterisation of "false sharing" for competition on the size of the vector, though).
But your problems are a lot more serious than just cache issues: FindExtremum::fPoints is copied (instead of just shared), and there are only disadvantages to making it a tbb::concurrent_vector instead of a std::vector (because its contents do not change). There may be other things, but try to change that, look for any similar issues, and you should already see a meaningful improvement.