The code should be able to show how the cache/resources are better utilized or how the time taken to run the program is lesser.
The trouble with most people who evaluate, is that they don't know how to go about evaluating. I understand that evaluations are problem dependent, but even then, some sample programs can be created couldn't they?
What exactly are you trying to look at for evaluation purposes? What are you trying to see?
Correct question, and that's my point. In order to evaluate, I need to know what I should be looking for. That's the first thing I don't know.
I've tried array addition in a parallel_for with varying number of threads, for small to huge (1 million elements) array sizes, but I couldn't come to a conclusion because it was confusing that when I ran the program serially and when I ran it parallely with one thread, it gave varying results (serial was sometimes faster, sometimes slower), and two or more threads showed no improvement in performance. I've seen Alexy's blog post about the explanation on why serial programs can sometimes be faster, and I've also seen a Dr.Dobb
s article on calculation of Julia points to evaluate performance.
Have seen MIT OpenCourseware video's where they suggest trying out algorithms like merge sort to evaluate performance.
The main questions that still remain are, what should one be looking for, when deciding on a parallel programming library? What would be a good way to evaluate it, to have some measurable results at hand. I'm mainly comparing with OpenMP.
The questions below would seem amateurish, but I seem to have a mind-block with it for a long time and would really appreciate some help with it.
If anybody could help with these...
If I needed to check if OpenMP and TBB were using the cache effectively, how would I do that?
What would be a better way to know whether OpenMP does array/datastructure operations faster or TBB does it faster? (the main issue here is deciding the right number of threads)
In general, what parameters would you suggest to compare both of them?
The best evaluation would be something that resembles the application which you plan to parallelize.
Also, unless youhave a lot ofshort parallel regions, the time spent in parallel runtimelibrary (its overhead) will likely be negligible comparing to real computations. What matters more for performanceare high level properties of the implementation such as data locality, amount of synchronization, load balancing etc.
I also suggest reading the article "Intel TBB, OpenMP, or native threads?" if you did not yet.
Thank you. I was contemplating building a scaled down version of the actual program, but peers thought it'd be better to try out small programs. Your reply gives me a better general direction to proceed.
Had seen the article earlier, and although the article says that OpenMP performs better for applications with a lot of array operations, my little array addition program could not prove it. TBB was faster.
This led me to conclude that I'm evaluating it incorrectly.
I can't go according to the claims on a website. I need to see numbers that prove to me that a program is really faster than another program.
Even when you say that data locality etc govern performance, is there a way to know for sure (by measuring) that TBB or OpenMP's optimization for load balancing or data locality really works?
Right now the only measurement tool I have is the program running time. Is there some other way to compare or am I in the wrong direction in the first place?
For comparison, however, try to ensure that all auxiliary code (which is usually serial) is the same or equivalent. Or, measure the time spent in parallel regions, in addition to the total time. This way you will have more information to judge. Also, I'd recommend doing several runs for each implementation, to see whether the performance results are stable, or vary.
Unfortunately, there is no single parallel solution that works for all applications. Both OpenMP and TBB provide some knobs that in some (but not all) cases can lead to more performance (OpenMP has more such knobs, TBB has some but generally relies more on good defaults). E.g. there are choices available for data partitioning - 'schedule' clause in OpenMP, and partitioner classes in TBB.
My belief is that if one spends enough time and effort, in most cases approximately the same level of performance can be obtained. The question then is aboutwhich solution requires less time and effort; this question is hard to answer a priori; and also "your mileage may vary". That's why, unless you know a benchmark that is similar to what you want to do, you better make your hands dirty with some prototyping. And then you'll see what are performance benefits, probably whether performance scales with more cores added, whether the result is satisfactory, and if not - you have a code that might be profiled with tools such as Parallel Amplifier, PTU, gprof, or other depending on the platform. In many cases, rethinking the (initially serial) algorithm and/or data structures with the parallelism in mind (i.e. different algorithmic optimization) gives most significant performance boost.
For the particular case of array addition, I guess the main issue is too little amount of computation per a memory operation. Such programs are called "memory bound" or sometimes "bus bound" as their performance is primarily limited by the throughput of CPU-to-memory interconnect. For array summation, all loop iterations have basically the same amount of work (so load balancing is not a concern), iterations are independent (so no locking is necessary), and arrays should be rather large (for the code to take enough time for meaningful performance measurement and comparison), which means those are unlikely to fit in cache and so cache locality does not play significant role. Also for such codes vectorization has huge impact on performance. I saw some cases when TBB based solution lost much performance to OpenMP based one because compiler was unable to vectorize the code; after the roadblocks were removed, the performance became on par.
Hope that helps, at least to some extent.
You have not provided many details about your problem domain, and so I'm kinda guessing here at your intent for evaluation. If I hear you correctly, you are looking at the raw speed of TBB vs. OpenMP (and perhaps other libraries) for manipuations of some unspecified data structure. In particular I hear that you are looking at "only the numbers" which I assume is wall-clock time for execution of your benchmark.
I sense that you are looking for some form of marketing whitepaper, that shows to anyone willing to believe the marketing hype, that product A is significantly better than product B and here is a convincing graph that shows that. Am I right?
Let me first start off by saying that numbers lie. I have read one paper in particular, which compared a sort routine on two different software platforms. On platform A, the graph demonstrated a huge speed boost (the product that vendor had for sale), and you see on platform B this really pathetic performance, thus proving that platform A was superior. It turns out (in fine print) that the benchmark was rigged. Platform A was run on brand new top-of-the-line hardware, with lots of processors using an O(n) algorithm. Platform B was run on older hardware, with a couple of processors using an O(n log n) algorithm. The benchmark also pushed the data sizes out to the point where the difference in complexity was really significant. You can't compare the performance of the two platforms at all in that particular paper. My point is, that benchmarks can be rigged... and you have to watch out for that. It may not always be obvious that it was rigged. The paper I mentioned, in particular, came from a reputable journal... not just some marketing website, so be careful out there.
Parallel computing is a wonderful thing, it promises to run your applications much faster than before. However, parallel computing also is going to push out the size of the data you are looking at, and this means that algorithmic complexity becomes very important. From my own personal experience, I came across a very nice sequential algorithm with O(n) complexity. No matter how hard I tried to parallelize my application, it all came down to the fact I couldn't get rid of that O(n) algorithm... I could do it in O(n*n) very easily, and in parallel.... and with some tricks I could maybe even get it down to O(n log n) in parallel... but I spent months trying to get a parallel O(n) algorithm out of it. My point is, that if I had looked at the amount of time that the scheduler contributed to my runtime with an O(n log n) algorithm it would have told me nothing. I should have spent my time finding a better algorithm, not looking at what the scheduler does (which only adjusts a hidden constant). If you are in a position, where you absolutely know that there is no better parallel algorithm in terms of complexity, then I very much envy your position. Remember that even if you have an O(n) algorithm, say... and when you look inside it does 5 parallel loops over all data for a total work of 5N... if you can get that down to 4.5N or 4N... that should probably make a bigger difference in runtime than the time spent by the scheduler balancing load. You can come up with examples where the scheduler is just going to eat up all your time, but I'm assuming that you are not going to do something silly like... say, have a parallel algorithm that just adds 1 to an array of integers, with a grainsize of 1.
I know that it's easy to get caught up with all of the cool things you can do to get better performance out of your program. If you have very good cache locality, you can get a performance boost. If you have parallelization, you can get a boost. If you have an amazing scheduler, you can get a boost. Just remember that all of that doesn't matter when it comes down to the overall complexity of your problem. You can have the most parallel, memory optimized, amazing application that internally runs an O(n^3) algorithm... if someone comes along with an O(n) algorithm... well, it doesn't matter so much if it's sequential, misses the cache all the time, etc. (unless of course, if you have many many many more processors to throw at the problem... then you might catch up for performance). All of these cool things you can do for increased performance are very important... and when you combine them with good algorithmic complexity, it doesn't matter as much (within reason) how good the scheduler of system A is vs. system B.
I hope that this has in some way provided you with insight to your own problem.
I'm very surprised when you say that OpenMP has more such 'knobs'. I've been through a lot of OpenMP concepts but didn't know how they could be tweaked or were related to performance. I guess this is because TBB is better documented and explained.
Your reply has been very helpful. Thank you very much. Will followup on the pointers and post back with my findings coz I'm sure this thread will be of help to others too.
As for tools, two tools I have at hand are Purify and Quantify, and both have trouble finding a .so file of TBB. Steph apparently didn't have any problem with it but I could'nt contact Steph. If anyone has solved this problem, please share.
I don't have VTune, but will go with Valgrind.
Yes, raw time, but NOT looking for a paper which shows whether OpenMP or TBB is better. Spending many years on planet Earth acquaints one with marketing hype :) Which is why I'm evaluating for myself.
I'm aware of platform inconsistencies. I've already seen a huge difference when I've tested with GCC and Intel C++ compilers on an Intel Dual Core with different versions of Linux and 32bit and 64 bit OS'es. Also tried with a normal single processor, as a control experiment. The results were indeed puzzling.
Thank you very much for detailing your experiences with performance tuning.
Both Alexy and you have given me quite a lot of information to work with already. What I need now is a TBB to parellelize my brain to sort out all this info :)
Will be back (after n number of days) to share my experiences :)
Below I've listed many documented examples of OpenMP knobs that affect performance
Search for these pages using the button on the upper left corner of the page:
OpenMP* Environment Variables
and in particular: KMP_AFFINITY, KMP_BLOCKTIME, KMP_LIBRARY, OMP_SCHEDULE, OMP_NUM_THREADS, OMP_DYNAMIC, OMP_NESTED, OMP_THREAD_LIMIT, OMP_MAX_ACTIVE_LEVELS
Thread Affinity Interface
This one addresses thread placement for multi-core and NUMA architectures (like Core i7 and Nehalem) and is very extensively documented.
OpenMP* Run-time Library Routines
Execution Environment Routines section in particular
Intel Extension Routines to OpenMP*
In particular see these sections: Execution Environment Routines, Memory Allocation, & Thread Sleep Time
OpenMP* Options Quick Reference
Note that all of these options have additional links to more complete descriptions.
OpenMP* Support Libraries
In particular the section marked Execution Modes, but the whole page may be useful if you want inter-compiler compatibility.
For more detailed information about the OpenMP standard environment variables, API routines, and directives and clauses, and many examples of usage, see the OpenMP specification v3.0 here:
Hope that helps.
I've done extensive search (thru Google) for OpenMP, but never came across the page you showed me. I assume that either Google has given it a low page rank because not many people would have accessed it, or you didn't allow search engines to index it.
Nevertheless, the website is going to be extremely helpful. Thanks a million! :)
And yup, been thru the OpenMP specifications. Thanks :)
In my case, I dont only watch the speed at running algoritms. For me is mandatory that the library can be portable. I need that the library work in a predictable and realible way in different platforms (at least windows, linux and mac) I dont want the library to be compiler dependent. And of course they must be usable to make the developers life as productive as possible.
As you can see, depending on what you need, one library can be wonderful, or a waste of time.
Thanks for the feedback!
I did a followup too and found that what you said is true.
Now I'm not complaining, I'm just seeing if I can help....one thing I noticed is that even if you're searching for "Openmp Examples", this page shows up on the second page of the Google search results and the content isn't relevant to what I searched for.
Second point: Even though the page showed up, I'm just stuck at that page. The page does not have any navigation (I know there are some links, but I'm talking of something like a home button that's there on the top of this forum) through which I can go to another page on the same website or find info that I'm looking for. So the first thing that I'd do as a user, is go back to the google results and search for the info that I'm looking for, on some other website.
Hope this helps.