I am currently the head of the Luminance HDR project (http://qtpfsgui.sourceforge.net/) and I am cosidering to introduce TBB inside the project. TBB will be mainly used for simple vector-vector operations (scaling, elem-to-elem arithmetic operations and colorspace conversion), who in general have fairly easy "kernels". In the past I have tried OpenMP, but the unreliability of the software on Windows and bugs in GCC 4.2 convinced me that this was not the way to go. TBB seems to be able to cope with really easy loop vectorizations and much more complex algorithms.
My first experiment was to try to make a parallel version of an element-wise square root of a vector (a really easy "kernel"). Unfortunately, I already have a problem with that. Comparing the speed up with the same example made in OpenMP, I can see that OpenMP performs better on a dual-core (my current machine) than TBB. At the same time, for small vector, TBB runs even slower that the serial version. Running the same example on an 8-core machine, things seems to be pretty different, TBB and OpenMP behaving in the same way, with a constant speed of 2X.
My questions are:
1. is there a minimum size for a certain problem where it is better to leave the serial version run?
2. am I doing something wrong?
Source code is attached to this post (uses CMake for the build).
(Added 2011-04-29) What I mean is: please make those changes and whatever other changes they inspire you to make, and then present whatever problems remain (free of some red herrings).
I did a small modification to my test program, to keep the size of the vector and pass the chucksize as a parameter. Unfortunately, I cannot really see any difference in performance when the chuck size changes, and honestly I wonder why. I wasn't expecting huge improvements, but at least something measurable.
Regarding dotprd_tbb, it was just an experiment. In that case the problem is different: results don't match between serial and tbb version, probably because of numerical rapresentation approximations. I guess so.
DotPrd also evaluates range.end() inside the loop, which may prevent some compiler optimisations.
i did read on the reference manual that grainsize works only with the simple partitioner, in fact I used that.
I'm quite convinced that for this kind of easy operations, OpenMP is slightly lighter compared to TBB.
However, in the past week i worked on a more complex algorithm, a function that calculated vertical and horizontal gradient of an image. In this case, reshaping the code in order to get the maximum linearity in the memory accesses, I can clearly see an improvement compared to OpenMP.
I am looking now at DotPrd
OpenMP, in general, ought to be more stable than TBB - it's been around much longer. This is not to say TBB is unstable. It simply is a generalization that stability increases with maturity.
If you have unstable OpenMP applications this is indicative that you have multi-thread safety issues in your code. Switching to TBB (Cilk++, etc...) will not fix an underlaying problem unless the port gives you an opportunity to fix a bug in your code or you replace a piece of your code with a solid MT library (e.g. MKL, IPP).
Where TBB has an advantage over OpenMP is when your problem can be decomposed application-wide into tasks. IOW the entire application is a collection of task (with synchronization here and there).
Were OpenMP (pre V3 with task sets) works well is were components of the application benefit from a n-way fork and join (e.g. parallel for where application is single threaded before and after parallel for). Nested parallel regions with nowaitcan improve the parallelization to some extent and OpenMP V3 task can extend parallelization further but the two tasking systems widely differ. As to which is better.... this depends on the extent to which you want to introduce parallellism into your code.
A secondary concern is:
Are you shipping a complete application?
Or, a utility library to be integrated by the user into their applicaiton?
If the former, then choice of threading model is your choice. If the latter, then the end user would want to choose the threading model (which means you may need multiple threading models for your library).
A third concern is:
Is the HDR working on a single still image?
Or, are you HDR-ing a video file?
If the former, then you will want to place your parallization inside each (the only) frame.
If the latter, then you will want to keep each frame single threaded and run multiple frames in seperate threads (parallel_pipeline on a frame by frame basis).
Luminance HDR is an HDR for still images. It is currently shipped with a few external libraries, while some (bad manteined and really small), became part of the main trunk. In future I would like to strip out those libraries again in order to share with the community all the good work I've done on it. This is the part where most of the parallelism will be introduced.
My application is mainly based on Qt and I mainly use its threading structures when I need multithreading (from Qthread to Qrunnable). In many cases, I create an external Qthread and I let it do the processing of the image, while keeping the GUI in another thread (a well known idea to keep the GUI responsive). However, how Qthread is implemented is out of my control and this seems to be a problem, because usually the elaboration part uses function with OpenMP pragma inside. So OpenMP code is used INSIDE a Qthread.
This link [ http://www.qtcentre.org/threads/20079-QThread-and-OpenMP-on-Mac-problem ] will show you a quick example that generates the problem I'm fighting against. GCC 4.2 seems to have a well know bug and I'm sure it is fixed now (4.3 should not have it already), but I can't change GCC 4.2 on my Mac machine because I have other requirements.
For reasons that I don't know (if you can help me to understand, I'll be glad), GCC 4.4 under MinGW shows exactly the same behaviour. And I can't use VS Express because it does not support OpenMP. Obviously, I don't want to buy a VS Professional license for an open source project: I get nothing out of it!
Here it is, my situation. Stuck. :)
With earlier versions of gcc, linking against Intel libiomp5 improved results significantly. Apparently, the Intel TBB team deserves some of the credit for certain versions of libiomp5. Too bad no way was found of making TBB and OpenMP compatible.