Performance comparison between Intel TBB task_list, openMP task and parallel for

Muhammad_M_ · ‎01-23-2015

I am planning on parallelizing a hotspot in a project. And I would like to know your opinion between the performance evaluation between parallel for, omp single followed by task and intel TBB task_list, under ideal conditions where number of threads are equal to computation items and when computation are much greater than available threads to see scheduling overhead(in order to evaluate the most efficient scheduler). I will also, be writing some sample test programs to evaluate myself but I also wanted to know if anybody had previously made these evaluations.

Thanks in advance.

Jeongnim_K_Intel1 · ‎01-23-2015

It is difficult to make a very general statement about your question. However, these are some empirical observations I made with various cases.

All of them should perform similarly if each task performs similar workloads, i.e., fairly well-balanced tasks. There are some differences when the load balance per task is not perfect. Also, it can depend on if the LB is dynamic or static.

Anton_M_Intel · ‎01-24-2015

As as I said at SO, this is not right way for such a performance comparison and probably for your specific problem too. If you really need to create all the tasks from the same single thread for some untold reason, there are at least different ways to do so in TBB, i.e. don't use task_list. task_list is not intended for scalability, it provides very limited improvement over consequent spawn calls by reducing overheads for entering TBB library and space reservation in the task pool. However, for big number of the tasks, caching all the tasks in the task_list before spawning them will prevent other threads from stealing ready tasks while other are being allocated and thus it will enlarge the serial part and so reduce the overall performance according to the Amdahl law. It is not the case for your OpenMP-based scheme since each task becomes available for other threads instantly on creation.

So, you compare different approaches. And even if you exclude task_list from your scheme, and compare "apples to apples", this is still rather a corner case with inherently limited scalability which will not give you a general picture about the schedulers.

RafSchietekat · ‎01-25-2015

There is a Note in the documentation all right, warning that: "Spawning a long linear list of tasks can introduce a bottleneck, because tasks are stolen individually. Instead, consider using a recursive pattern or a parallel loop template to create many pieces of independent work."

But perhaps it would be more useful in the section about task_last, not (just) about the spawn(task_list&) overload, to counter the understandable assumption that something that gets its own object class has some special significance? It's really more for the fine details, like using spawn_and_wait_for_all() instead of spawn() followed by wait_for_all() etc.

It also seems somewhat like a remnant from a previous implementation of the task pool (an array of singly linked lists), with inconsistent meaning in the present implementation. Tasks are spawned by pushing them onto the tail of the deque, with an intermediate reversal in the case of a task_list, so now, relative to the task_list order, local use is front to back and stealing back to front (right?), where previously both uses were front to back (stealing was governed by depth of a list, not so much location in a list at a particular depth).

I'm not sure how useful a task_list still is: there's some overhead compared to when they could just be spliced in, and I can think of better ways to spawn large numbers of tasks...

(2015-01-26 Corrections, additions both in italics) I had forgotten that tasks are spawned by pushing them to the tail of the queue, not the front, and about the order reversal. Suggesting deprecation was a bit rash.

jimdempseyatthecove · ‎01-29-2015

Muhammad,

It sounds like it would be easy enough for you to run your tests as you state you intend to do. In considering between OpenMP and TBB I suggest you consider what you intend to do after you parallelize this hot spot. If the answer is none, or not much, then I would consider using OpenMP.

There was no indication as to the number of hardware threads and tasks you intend to run. Both threading paradigms have tuning capability that may differ or may be similar. The choice of OpenMP, and programming considerations, may provide you with being able to take advantage of affinitizing your threads.

Also, be mindful that your first attempt at comparing paradigms might not be optimal. Posting a sketch of your code here might yield some helpful suggestions.

Jim Dempsey

RafSchietekat · ‎02-04-2015

I'd like to come back to this, because, other than presumably maintaining status quo for the local thread, I don't understand the decision to first execute the task that was least recently added to the task_list (unless I'm mistaken about that, of course), which seems to run counter to the rationale about executing depth-first.

Either the tasks in a task_list are all of equivalent size (and should maybe be automatically converted into a task tree), or you would successively spawn half the work until a base case remains: this is natural if each spawn call is for one task, but if you want to use a task_list you would first have to collect all the tasks in an intermediate vector, then pour its contents back to front into a task_list, and then spawn the task_list. What's the use of that?

I even replaced fast_reverse_vector with vector to see what would happen, and the test suite still passed. So is it essential or not? Perhaps if there's only one hardware thread, as apparently mentioned in Fibonacci.cpp, but then why doesn't the test suite verify this? And if it is not essential, shouldn't the current behaviour be inverted for better performance by getting rid of fast_reverse_vector as I did?

Or did I overlook something?

Anton_M_Intel · ‎02-05-2015

Raf Schietekat wrote:

I don't understand the decision to first execute the task that was least recently added to the task_list (unless I'm mistaken about that, of course), which seems to run counter to the rationale about executing depth-first... shouldn't the current behaviour be inverted for better performance by getting rid of fast_reverse_vector as I did?

task_list is mostly a legacy interface which came from the first design of the task scheduler where it was quite beneficial. With current scheduler design, it is not so relevant but parallel_do and parallel_while are still depend on it. And it is important to preserve the execution order on one thread for these algorithms thus the fast_reverse_vector.

RafSchietekat · ‎02-05-2015

I see... perhaps it would be useful to have a test for that, though. I see several options now:

Deprecate task_list and only keep it for internal use and backward compatibility.

Document the order of execution, provide an extra task_list::push_front() operation, and recommend that instead of push_back() in situations where spawn order should not be inverted.

Document the order of execution, provide an additional operation or overload that avoids fast_reverse_vector, and recommend that in situations where spawn order should not be inverted.

Then there's the issue that all these tasks may have to be stolen one by one, which isn't good for performance. Obviously recursive parallelism is better, but maybe there should be an equivalent for parallel_for() at the task level, taking either a task_list or a new type, and automatically constructing a task tree with the provided tasks as leaves?

(Added) Stealing one by one is OK for exponentially sized tasks, but otherwise, if this weren't an issue, there wouldn't be a need for parallel_invoke() to try and construct trees from its relatively few arguments, either.

(2015-02-07 Added) TBB could also just evaluate the current number of workers before deciding whether to do an intermediate reversal (only when there are no workers), but that would probably still be more confusing when workers are present than seeing generally the same order from parallel_while/do() with occasional stealing.

RafSchietekat · ‎02-06-2015

Just for fun, and for what it's worth, I threw out the vector I earlier substituted for fast_reverse_vector (which would fail for any test that verified the single-thread execution order of parallel_while/do, unless TBB would differentiate there):

/** Conceptually, this method should be a member of class scheduler.
    But doing so would force us to publish class scheduler in the headers. */
void generic_scheduler::local_spawn( task& first, task*& next ) {
    __TBB_ASSERT( governor::is_set(this), NULL );
    if ( &first.prefix().next == &next ) { // TODO: still useful optimization even without dynamic allocation involved?
        // Single task is being spawned
        size_t T = prepare_task_pool( 1 );
        my_arena_slot->task_pool_ptr = prepare_for_spawning( &first );
        commit_spawned_tasks( T + 1 );
    }
    else {
        // Task list is being spawned
        task* t = &first;
        for (bool end = false; !end; ) {
            // Task list is being spawned
            task* tasks[min_task_pool_size];
            task** tasks_end = tasks; // as in begin/end
            task** const tasks_max = tasks + min_task_pool_size;
            do {
                // If t is affinitized to another thread, it may already be executed
                // and destroyed by the time prepare_for_spawning returns.
                // So milk it while it is alive.
                end = &t->prefix().next == &next;
                task* const t_next = t->prefix().next;
                *tasks_end++ = prepare_for_spawning(t);
                t = t_next;
            } while (!end && tasks_end != tasks_max);
            const size_t num_tasks = tasks_end - tasks;
            const size_t T = prepare_task_pool( num_tasks );
            memcpy( my_arena_slot->task_pool_ptr + T, tasks, num_tasks * sizeof( task* ) );
            commit_spawned_tasks( T + num_tasks );
        };
    }
    if ( !in_arena() )
        enter_arena();
    my_arena->advertise_new_work</*Spawned=*/true>();
    assert_task_pool_valid();
}

Anton_M_Intel · ‎02-06-2015

Raf, it looks good. It is not clear how it affect performance, but if it removes excessive entity from the code, it makes sense. Could you contribute it please?

RafSchietekat · ‎02-06-2015

It only applies to code that does not need the legacy order of execution, and would therefore have to be accessed through an extension of the API, since legacy parallel_while/do() relies on spawn(task_list&), which must therefore not be changed. A real solution depends on your evaluation of what I wrote in #8, and only then could I work out a proper proposal, not just a partial proof of concept. To put it more clearly: it does not remove the need for fast_reverse_vector, and I have not evaluated the impact on performance (if it is not beneficial, task_list::push_front() makes more sense, and that would probably have been a good addition in the original implementation as well).

RafSchietekat · ‎02-06-2015

Here's the task_list::push_front() that maybe should have been there all along:

    //! Push task onto back of list (FIFO local execution, like individual spawning in the opposite order).
    void push_back( task& task ) {
        task.prefix().next = NULL;
        *next_ptr = &task;
        next_ptr = &task.prefix().next;
    }

    //! Push task onto front of list (LIFO local execution, like individual spawning in the same order).
    void push_front( task& task ) {
        if (empty()) {
            push_back(task);
        } else {
            task.prefix().next = first;
            first = &task;
        }
    }

(Added) I'm struggling a bit with the descriptions...

(2015-02--08 Added) I think that the solution is probably a combination of adding task_list::push_front() and keeping task_list sizes comfortably within the min_task_pool_size stack-allocated initial capacity of fast_reverse_vector. If the list becomes longer, the associated scalability problems are probably more severe than any inefficiency in dynamic allocations for fast_reverse_vector. So then the question becomes whether and, if so, how to allow large numbers of tasks being spawned at once. The easy solution is to just discourage the user (there should then be a warning in the documentation!), and this is not even unreasonable given that this is a low-level interface and that recursive parallelism is recommended. Otherwise, to support situations where recursive task creation might be more difficult than creating them all at once, TBB could provide functionality to implicitly build a task tree, and then the question is whether to do that on any task_list (magically fixes scalability in some existing code), or only on request, or on a separate type with random access. But any of that is orthogonal to the decision whether to add task_list::push_front().

Anton_M_Intel · ‎02-09-2015

Raf, as we discussed internally, changing behavior of task_list-based spawn methods and documenting the new ordering (which wasn't clearly stated anyway) is fine. That is the fast_reverse_vector should be removed and the new task_list::push_front() added. Then, we can change parallel_do and parallel_while implementations to use push_front instead. This is a minimum.

Ideally, I'd throw intermediate array at all and move task pointers directly to the task_pool but it might require additional counter inside task_list which will indicate the number of tasks in the list in order to prepare_task_pool() for this number initially. But it is trickier because of the backward compatibility which will force us to support counter-less variant anyway. Thus I'd not like to spent time on this.

RafSchietekat · ‎02-09-2015

Hmm, I only mentioned that option because I didn't know any better when I was still just investigating, to see if there was anything in TBB that currently relied on that order, and apparently there is no test for local execution order of parallel_while/do() that would have been my first clue.

You could do that if you are willing to sacrifice in-order local execution for parallel_while/do() compiled against any previous release, as well as possibly causing other "surprises" in code that is not recompiled for the occasion, so it's an issue for backward compatibility, but I'm also not sure that it's the best way forward, i.e., even if code is recompiled but not adapted: it could help some code that naively assumed that pushing tasks onto a task_list or spawning them individually in the same order made no difference, but would also sabotage any code whose author looked a bit closer and did the right thing for depth-first local execution (performing an intermediate reversal for lack of a task_list::push_front() to simplify his code), so it would be a bit of a betrayal. fast_reverse_vector would indeed be removed, but I don't think it has any noticeable overhead (unless compared to a hypothetical task_list with an internal element count), except for code that shouldn't be using task_list in the first place (spawning lots of tasks at once is not good for scalability, even through use of a task_list).

It's your call, of course, but I would just add task_list::push_front() and then possibly concentrate on scalable spawning for many tasks at once, instead, for situations where true recursive parallelism is annoyingly complicated, if you can confirm that there is such a need.

RafSchietekat · ‎02-10-2015

I just did a quick test with:

class TaskSpawner : tbb::internal::no_copy {
private:
    typedef std::deque<task*> Container;
    typedef Container::iterator Iterator;
    typedef tbb::blocked_range<Iterator> Range;
public:
    void push_back(tbb::task* t) { tt.push_back(t); }
    void push_front(tbb::task* t) { tt.push_front(t); }
    void spawn() { AuxTask::spawn(Range(tt.begin(), tt.end(), 8)); }
private:
    Container tt;
    struct AuxTask : public tbb::task {
        Range r;
        AuxTask(const Range r_) : r(r_) {}
        static void spawn(/*var*/ Range r) {
            tbb::task_list l;
            while (r.is_divisible()) {
                const Range r2(r, tbb::split()); // r2 gets the latter part
                l.push_front(*new(tbb::task::allocate_root()) AuxTask(r2));
            }
            tbb::task::spawn(l); // l is implicitly cleared afterwards
            for (Iterator it = r.begin(); it != r.end(); ++it) {
                l.push_back(**it);
            }
            tbb::task::spawn(l);
        }
        task* execute() /*override*/ {
            if (!r.empty()) {
                spawn(Range(r.begin() + 1, r.end(), r.grainsize()));
                return *r.begin();
            } else {
                return NULL;
            }
        }
    };
};

and calling that... from inside parallel_do()! The tests completed as normal, although there's no test for order (did I get it right?). Note the use of task_list::push_front() as well as push_back(), just to show them off... max_arg_size in parallel_do.h was increased to 40 for the occasion, but I have no idea what that might mean for performance, although it seems likely that performance might improve if more elements are scalably spawned at the same time in situations where feeding is limited.

The idea was not (so much) to change the implementation of parallel_do(), but rather to use it as a test bed to verify a solution for spawning many tasks that does not rely on low-level changes to the scheduler. Any programmer can do this without modifying TBB itself, although it might be nice to have a prepackaged solution. Any overhead related to std::deque and tbb::parallel_for()-related tasks should be offset by better scalability, I think.

Did I overlook anything?

(2015-02-11 Added) Note that the deque has to stay alive until all AuxTask instances have finished executing, which is of course guaranteed when all pushed tasks have finished executing, but that might not always be known. I have an idea to make things more foolproof, but it would be more fun to try that out after some intermediate feedback...

Anton_M_Intel · ‎02-11-2015

Raf, we don't think the TaskSpawner makes sense for the following reasons: it does not seem useful internally for TBB algorithms, and on the user side, parallel_for applied on a array of the tasks (or simpler things) will work fine as well. Moreover and more importantly, a big part of the scalability problem resides in task allocation since after execution, tasks are returned in the free list of the original thread. Thus, it always makes sense to implement recursion for the whole process, not only the spawn part.

RafSchietekat · ‎02-11-2015

Absolutely, recursive parallelism where tasks are created "lazily" is the best solution, no doubt about that, and if you can use a ready-made algorithm like parallel_for then that's the way to go. Note that I initially did try parallel_for with a task-spawning Body, but that solution would also already start executing some of those tasks, which could interfere with any expectation of LIFO local execution.

It's just that I see parallel_invoke building a tree from as few as 3 arguments (although it also does pass along several arguments at once for recursive task creation), while task_list is unbounded but makes no attempt to facilitate work distribution, and it didn't even provide for spawning in the correct order for local depth-first execution, which partitioners might use if offer_work() were to return a task instead of spawning it individually (no, I didn't try that myself). Of course task_list itself should never be modified in that way, because that would sabotage existing code that somehow filled it with exponential amounts of work, but it would still be nice to have another efficient way to distribute lots of tasks with similar amounts of work, I thought.

Are you saying that a simple parallel_do(), without feeder use, or other potential uses, would derive negligible benefit from this, even with something like 63 workers to keep busy, otherwise all attacking each other as well as the master like piranhas in a feeding frenzy? It's not as if those tasks would not all be allocated from the master thread, either, unless parallel_do() were adapted to dole out multiple elements per spawned task like in the implementation of parallel_invoke(). But would the latter be the only possible improvement because returning nonlocal allocations would allegedly undo any benefit from more efficient work distribution?

Anyway, as this forum thread suggests, there are going to be those who want to put lots of tasks in a task_list, and the documentation does not dissuade them from such an attempt, so at least they can now potentially find and use the code from my little experiment above (as long as task_list::push_front() gets added), but I'll leave it as it is instead of letting that deque be managed automatically (using dependencies).

RafSchietekat · ‎02-15-2015

Is anybody interested to prove me wrong about increased scalability of no-feeder parallel_do? That seems a better model than if I were to also provide the benchmark and the testing, which would have confirmation bias etc. Of course the real reason is that I'd rather generate ideas... :-)

The hypothesis is that scalable spawning/stealing is valuable even if one thread allocates all the tasks.. There's a back-up plan that only has a logarithmic number of locally allocated tasks, which should be better still and could leverage parallel_for(), so I'm surprised that this is not the current implementation.

Another intended upside of increasing max_arg_size is to amortise the impact of the effective barrier between each block of iterations. Although I'm suspicious that this is not already done so in the current implementation: am I overlooking something obvious?

Of course with lots of feeder activity, all of this is less relevant, but that is not necessarily how parallel_do() is being used most of the time.

Some non-goals at this time: sliding window (at block granularity), asynchronous input.