Performance: OpenMP vs. TBB

danastas · ‎04-24-2011

I am sorry, you have probably discussed this topic milion times, but I could not find the exact answer to my problem.

I am currently the head of the Luminance HDR project (http://qtpfsgui.sourceforge.net/) and I am cosidering to introduce TBB inside the project. TBB will be mainly used for simple vector-vector operations (scaling, elem-to-elem arithmetic operations and colorspace conversion), who in general have fairly easy "kernels". In the past I have tried OpenMP, but the unreliability of the software on Windows and bugs in GCC 4.2 convinced me that this was not the way to go. TBB seems to be able to cope with really easy loop vectorizations and much more complex algorithms.

My first experiment was to try to make a parallel version of an element-wise square root of a vector (a really easy "kernel"). Unfortunately, I already have a problem with that. Comparing the speed up with the same example made in OpenMP, I can see that OpenMP performs better on a dual-core (my current machine) than TBB. At the same time, for small vector, TBB runs even slower that the serial version. Running the same example on an 8-core machine, things seems to be pretty different, TBB and OpenMP behaving in the same way, with a constant speed of 2X.

My questions are:
1. is there a minimum size for a certain problem where it is better to leave the serial version run?
2. am I doing something wrong?

Source code is attached to this post (uses CMake for the build).

Thanks,
Davide

RafSchietekat · ‎04-27-2011

Since no one else has answered yet: vector_sqrt_tbb() could probably use a reasonable grainsize (to not subdivide ranges to less than something on the order of 10000 instructions), while vector_dotprd_tbb() has a grainsize equal to the input size and will therefore always run serially. Does that explain what you see?

(Added 2011-04-29) What I mean is: please make those changes and whatever other changes they inspire you to make, and then present whatever problems remain (free of some red herrings).

danastas · ‎04-30-2011

Thanks for the reply Raf,
I did a small modification to my test program, to keep the size of the vector and pass the chucksize as a parameter. Unfortunately, I cannot really see any difference in performance when the chuck size changes, and honestly I wonder why. I wasn't expecting huge improvements, but at least something measurable.

Regarding dotprd_tbb, it was just an experiment. In that case the problem is different: results don't match between serial and tbb version, probably because of numerical rapresentation approximations. I guess so.

RafSchietekat · ‎04-30-2011

Setting a significant grainsize makes a difference when using simple_partitioner (the program uses auto_partitioner, which hides the effect of grainsize for large-enough input sizes) or with small inputs (which you did mention), so at least some results should have improved.

DotPrd also evaluates range.end() inside the loop, which may prevent some compiler optimisations.

danastas · ‎04-30-2011

Thanks Raf,
i did read on the reference manual that grainsize works only with the simple partitioner, in fact I used that.
I'm quite convinced that for this kind of easy operations, OpenMP is slightly lighter compared to TBB.
However, in the past week i worked on a more complex algorithm, a function that calculated vertical and horizontal gradient of an image. In this case, reshaping the code in order to get the maximum linearity in the memory accesses, I can clearly see an improvement compared to OpenMP.

I am looking now at DotPrd

jimdempseyatthecove · ‎04-30-2011

Davide,

OpenMP, in general, ought to be more stable than TBB - it's been around much longer. This is not to say TBB is unstable. It simply is a generalization that stability increases with maturity.

If you have unstable OpenMP applications this is indicative that you have multi-thread safety issues in your code. Switching to TBB (Cilk++, etc...) will not fix an underlaying problem unless the port gives you an opportunity to fix a bug in your code or you replace a piece of your code with a solid MT library (e.g. MKL, IPP).

Where TBB has an advantage over OpenMP is when your problem can be decomposed application-wide into tasks. IOW the entire application is a collection of task (with synchronization here and there).

Were OpenMP (pre V3 with task sets) works well is were components of the application benefit from a n-way fork and join (e.g. parallel for where application is single threaded before and after parallel for). Nested parallel regions with nowaitcan improve the parallelization to some extent and OpenMP V3 task can extend parallelization further but the two tasking systems widely differ. As to which is better.... this depends on the extent to which you want to introduce parallellism into your code.

A secondary concern is:

Are you shipping a complete application?
Or, a utility library to be integrated by the user into their applicaiton?

If the former, then choice of threading model is your choice. If the latter, then the end user would want to choose the threading model (which means you may need multiple threading models for your library).

A third concern is:

Is the HDR working on a single still image?
Or, are you HDR-ing a video file?

If the former, then you will want to place your parallization inside each (the only) frame.
If the latter, then you will want to keep each frame single threaded and run multiple frames in seperate threads (parallel_pipeline on a frame by frame basis).

Jim Dempsey

danastas · ‎04-30-2011

Unfortunately, my browser just helped me out in wasting my long reply. :)

Luminance HDR is an HDR for still images. It is currently shipped with a few external libraries, while some (bad manteined and really small), became part of the main trunk. In future I would like to strip out those libraries again in order to share with the community all the good work I've done on it. This is the part where most of the parallelism will be introduced.

My application is mainly based on Qt and I mainly use its threading structures when I need multithreading (from Qthread to Qrunnable). In many cases, I create an external Qthread and I let it do the processing of the image, while keeping the GUI in another thread (a well known idea to keep the GUI responsive). However, how Qthread is implemented is out of my control and this seems to be a problem, because usually the elaboration part uses function with OpenMP pragma inside. So OpenMP code is used INSIDE a Qthread.

This link [ http://www.qtcentre.org/threads/20079-QThread-and-OpenMP-on-Mac-problem ] will show you a quick example that generates the problem I'm fighting against. GCC 4.2 seems to have a well know bug and I'm sure it is fixed now (4.3 should not have it already), but I can't change GCC 4.2 on my Mac machine because I have other requirements.

For reasons that I don't know (if you can help me to understand, I'll be glad), GCC 4.4 under MinGW shows exactly the same behaviour. And I can't use VS Express because it does not support OpenMP. Obviously, I don't want to buy a VS Professional license for an open source project: I get nothing out of it!

Here it is, my situation. Stuck. :)

RafSchietekat · ‎04-30-2011

"i did read on the reference manual that grainsize works only with the simple partitioner, in fact I used that."
Grainsize always applies.

danastas · ‎05-01-2011

Ok, I have to read it more carefully :)

TimP · ‎05-01-2011

gcc 4.2 was the first version with OpenMP, so it wasn't fully developed. Current release and pre-release versions of gcc OpenMP for linux have improved greatly, to the point where I've dropped the use of the Intel OpenMP library with gcc alone. The SourceForge open64 compiler, based partly on gcc 4.2, is another story.
With earlier versions of gcc, linking against Intel libiomp5 improved results significantly. Apparently, the Intel TBB team deserves some of the credit for certain versions of libiomp5. Too bad no way was found of making TBB and OpenMP compatible.

danastas · ‎05-03-2011

Where can I find this libiomp5 and how can I link it insted of the libomp provided with GCC?

TimP · ‎05-03-2011

libiomp5 OpenMP library comes with Intel Fortran and C++ compilers, and with stand-alone versions of performance libraries such as MKL and IPP. As I said, if you are doing OpenMP work with gcc, it's well worth while to upgrade to gcc-4.5, when you would need libiomp5 only for mixed Intel/gnu linux (or Mac?) builds. libiomp5 isn't compatible with Windows gcc.