Parallel tree processing

Dmitry_Vyukov · ‎07-07-2010

Assume I have an unbalanced N-ary tree, and I need to do some processing for every tree node. Also assume that processing per node is very small, let it be 100 cycles.
What is the best way to implement it with TBB?
TIA

Alexey-Kukanov · ‎07-07-2010

If the tree implementation is array based, and all nodes can be processed independently (including those that have "parent-child" relation in the tree), then I would do it with parallel_for.

If not - well, then the problem asks for some other way to efficiently group processing of several nodes, or at least quickly assess whether it makes sense to split processing. For example, if every subtree root can have a count of how many nodes are in the subtree, it might help to quickly make the decision which subtrees are worth a task.

Dmitry_Vyukov · ‎07-07-2010

Quoting Alexey Kukanov (Intel)
If not - well, then the problem asks for some other way to efficiently group processing of several nodes, or at least quickly assess whether it makes sense to split processing. For example, if every subtree root can have a count of how many nodes are in the subtree, it might help to quickly make the decision which subtrees are worth a task.

The question is exactly about that "other ways to efficiently group processing". The problem is somehow typical, so I thought it should have some canonical solution.

Alexey-Kukanov · ‎07-07-2010

I think that efficient grouping of computations is impossible without application-specific knowledge. I do not even say "grouping of tasks", because at that level of granularity you strive to avoid overhead associated with tasks as much as possible. Due to that, I am afraid it can not be done very efficiently in TBB; but I am open for suggestions.

Let's speculate what can be done with this regard. Make processing of every node as a separate task, but spawn and steal and executethose only in bunches of a given size? That would still have per-task allocation cost. If all job functions are of the same type (to avoid indirections, e.g. virtual calls), it might be possible to make an aggregator classof (presumably) fixed size, which would collect several jobs and spawn those as a single task. It would still incur some overhead however; do you think it would be affordable? Any other/better ideas?

Alexey-Kukanov · ‎07-07-2010

I just got a thought that such an aggregator as I described would not know whether a node tree processing creates a lot more job for subtree nodes, or it has none. So it would not be very efficient for load balancing I guess. And I have no idea how to balance the load efficiently without havinginformation of subtree sizes.

Dmitry_Vyukov · ‎07-07-2010

Quoting Alexey Kukanov (Intel)

And I have no idea how to balance the load efficiently without havinginformation of subtree sizes.

The problem is that sizes of sub-trees may be unknown, or even sub-trees may not exist yet (consider that processing *is* building of sub-trees, for example, a game state space exploration).

As for suggestions, I have no suggestions. Currently I am using a work-requesting scheduler for that, also I have an asymmetric work-stealing scheduler, both have per-task overheads on the order of handful of cycles, so I am able to express a parallel algorithm in it's natural form (i.e. have task at hand, spawn it). I am curious how that can be handled with TBB w/o incurring 2-20x slowdown.

jimdempseyatthecove · ‎07-13-2010

Presumably it is possible for you to know (remember) the size of the tree.

Have one thread traverse the tree without doing work other than placing a pointer/reference to each node into a vector/array. As you are filling up this vector/array, the other threads can process the nodes referenced/pointed to/by the being filled vector/array. e.g. one thread traverses the tree and fills the vector/array and 3 threads monitor the progress and take every 3rd pointer/referece and process those nodes. If your node traversal an insertion time into the vector/array is under 33 cycles then this will work fine. You can signifiy end of tree by inserting a NULL pointer/reference into the vector/array. The other threads monitor the end() or fill pointer (make volatile), and issue _mm_wait() when they overrun the fill pointer.

You could keep this vector/array around for future use. There is no reason why your "data base" cannot have a tree structure .AND. a "flat unsorted index" structure at the same time.

Jim Dempsey

RafSchietekat · ‎07-14-2010

If the tree is not pathologically asymmetrical,have you considereda parallel_for with its default auto_partitioner and a bespoke range object that attempts a breadth-first traversal? The range records a current father and the indices to the start and end of the father's child nodes under consideration, but there's no apparent need for a globally consistent linear index, so I don't see why this wouldn't be a good fit. If you have already tried this, what were your findings?

Potentially the constant factor in auto_partitioner might need to be tuned from the skewedness of the tree (just guessing), and then an out-of-the-box auto_partitioner wouldn't do anymore, but that doesn't seem like an immediate concern. It is also not tunable for linear ranges with highly non-uniform work distributions (the headlight regions in a recently mentionedraytracing problem), so it's not exactly a tree-specific issue either.

Dmitry_Vyukov · ‎07-19-2010

Quoting Raf Schietekat

If the tree is not pathologically asymmetrical,have you considereda parallel_for with its default auto_partitioner and a bespoke range object that attempts a breadth-first traversal? The range records a current father and the indices to the start and end of the father's child nodes under consideration, but there's no apparent need for a globally consistent linear index, so I don't see why this wouldn't be a good fit. If you have already tried this, what were your findings?

It's not pathologically asymmetrical (list like), but asymmetrical enough for this method to produce some basically empty parallel tasks and some huge parallel tasks (potentially single task will contain all the useful work).

Dmitry_Vyukov · ‎07-19-2010

Quoting jimdempseyatthecove

Presumably it is possible for you to know (remember) the size of the tree.

Have one thread traverse the tree without doing work other than placing a pointer/reference to each node into a vector/array. As you are filling up this vector/array, the other threads can process the nodes referenced/pointed to/by the being filled vector/array. e.g. one thread traverses the tree and fills the vector/array and 3 threads monitor the progress and take every 3rd pointer/referece and process those nodes. If your node traversal an insertion time into the vector/array is under 33 cycles then this will work fine. You can signifiy end of tree by inserting a NULL pointer/reference into the vector/array. The other threads monitor the end() or fill pointer (make volatile), and issue _mm_wait() when they overrun the fill pointer.

You could keep this vector/array around for future use. There is no reason why your "data base" cannot have a tree structure .AND. a "flat unsorted index" structure at the same time.

That won't work.

A tree may be not yet built (processing is-a building). So once you create the array, all useful work is already done.

Or if a tree is already built, it's more economically expedient to complete with a node once it's reached than to incur additional memory load in another thread.

And if I would have an array in advice, I would name the topic "parallel array processing", answer to myself, and close it :)

Dmitry_Vyukov · ‎07-19-2010

Note that such parallel tree processing works perfectly is task spawning overheads is some few cycles (work requesting, asymmetric work stealing, and potentially work balancing).

jimdempseyatthecove · ‎07-19-2010

>>A tree may be not yet built (processing is-a building). So once you create the array, all useful work is already done.

The original post stated there was some work to be done per node (as opposed to traversing the tree doing no work).

The suggestion I made was to incorporate a "pathfinder" thread that quickly traverses the tree (each time you use the task processing the tree), building a shared list of pointers/references to the nodes (or sublist of nodes of interest). Cocurrent with the building of this list, the remaining taskteam members process the nodes, pulling the pointers/references out of the list while it is being built (as opposed to after it is built).

[----time to scan tree and build vector of nodes---]
[node0][node3]...
[node1][node4]...
[node2][node5]...

Where node0 is the time to compute the 0'th entry in the vector/array of node references/pointers (as opposed to the 0'th node in the tree).

The above shows one thread building the list, and 3 threads picking and processingfrom the list (while the list is being built). The thread additional threads each process 1/3rd of the work.

In the even that the work portion is large, then when the thread building the list completes (and/or gets ahead of the worker threads), it can join in on the processing of the nodes.

The threads can coordinate their work by monitoring progress of each other through shared memory variables. Each of the worker nodes can see the index pointer in the array being built. And when warranted, the thread building the list can monitor the progress of the worker threads. When the list builder gets far in advance of the worker threads then it can process a node instead of adding it to the list to be processed. What this mean then is the list of nodes to be processed can be a smaller ring buffer. (or n ring buffers).

Jim Dempsey

Dmitry_Vyukov · ‎07-19-2010

Quoting jimdempseyatthecove

>>A tree may be not yet built (processing is-a building). So once you create the array, all useful work is already done.

The original post stated there was some work to be done per node (as opposed to traversing the tree doing no work).

The suggestion I made was to incorporate a "pathfinder" thread that quickly traverses the tree (each time you use the task processing the tree), building a shared list of pointers/references to the nodes (or sublist of nodes of interest). Cocurrent with the building of this list, the remaining taskteam members process the nodes, pulling the pointers/references out of the list while it is being built (as opposed to after it is built).

[----time to scan tree and build vector of nodes---]
[node0][node3]...
[node1][node4]...
[node2][node5]...

Where node0 is the time to compute the 0'th entry in the vector/array of node references/pointers (as opposed to the 0'th node in the tree).

The above shows one thread building the list, and 3 threads picking and processingfrom the list (while the list is being built). The thread additional threads each process 1/3rd of the work.

In the even that the work portion is large, then when the thread building the list completes (and/or gets ahead of the worker threads), it can join in on the processing of the nodes.

The threads can coordinate their work by monitoring progress of each other through shared memory variables. Each of the worker nodes can see the index pointer in the array being built. And when warranted, the thread building the list can monitor the progress of the worker threads. When the list builder gets far in advance of the worker threads then it can process a node instead of adding it to the list to be processed. What this mean then is the list of nodes to be processed can be a smaller ring buffer. (or n ring buffers).

> The original post stated there was some work to be done per node (as opposed to traversing the tree doing no work).

For sure, the work is: calculate number of child nodes and their parameters, allocate memory for them, link them to current node, repeat for each child node. Consider that all that (w/o recursive part of course) takes roughly 100 cycles.

I do not see how to apply your approach here. If pointers to child nodes collected, then all useful work is already done.

RafSchietekat · ‎07-19-2010

"It's not pathologically asymmetrical (list like), but asymmetrical enough for this method to produce some basically empty parallel tasks and some huge parallel tasks (potentially single task will contain all the useful work)."
Maybe not "degenerate", but it seems quite pathological to me if a modification of auto_partitioner (trading some efficiency for more resilience to tree skewedness by increasing an internal factor) would still not find enough parallelism. (Or does "pathological" also have a reserved meaning?)

Hmm, I tried to find something about "work requesting" and found a link from 2008 where one "Dmitriy V'jukov" declared "The more I am thinking about work-requesting, the more I like work-stealing."... Did you change your mind (like about the appropriate transliteration of your name :-) ), or does it just depend on the specific problem?

Dmitry_Vyukov · ‎07-19-2010

Quoting Raf Schietekat

"It's not pathologically asymmetrical (list like), but asymmetrical enough for this method to produce some basically empty parallel tasks and some huge parallel tasks (potentially single task will contain all the useful work)."
Maybe not "degenerate", but it seems quite pathological to me if a modification of auto_partitioner (trading some efficiency for more resilience to tree skewedness by increasing an internal factor) would still not find enough parallelism. (Or does "pathological" also have a reserved meaning?)

A tree may contain a kind "bottle neck" (or several of them), if auto partitioner cuts the tree above it, then most work is contained within a single parallel task.

Dmitry_Vyukov · ‎07-19-2010

Quoting Raf Schietekat

Hmm, I tried to find something about "work requesting" and found a link from 2008 where one "Dmitriy V'jukov" declared "The more I am thinking about work-requesting, the more I like work-stealing."... Did you change your mind (like about the appropriate transliteration of your name :-) ), or does it just depend on the specific problem?

No, I didn't. There always was 2 spellings of my name "Dmitry Vyukov" and "Dmitriy V'jukov"... well and my ISN name is actually intended to be "Dmitriy V'jukov", but I had to remove the apostrophe to prevent ISN platform from freaking the hell out.

Regarding scheduling, yes, it depends on a problem type. I was talking about a scheduler for a general-purpose library like TBB (it definitely must include a stealing scheduler to prevent users from shooting their limbs off).

Here is an except from my write-up on the TC2010 Skip List problem, which superficially covers schedulers types (btw, Bradley Kuszmaul reported 25% slowdown from parallelization with Cilk++ which has ~5 times lower overheads than TBB):

Scheduling Strategy
There are 4 main strategies for a fine-grained distributed dynamic task scheduling:
Work-stealing. That's a reactive asynchronous strategy. The essence: when a thread is out work, it randomly chooses a victim thread and asynchronously tries to steal some work from it.
Work-requesting. That's a reactive synchronous strategy. The essence: when a thread is out of work, it randomly chooses a victim thread and sends a synchronous request to it; the victim receives the request, and sends some work back (if any).
Work-distribution. That's a proactive synchronous strategy. The essence: during submission of a new work, it's divided and proactively distributed to some threads (idle or lightly loaded).
Work-balancing. That's a proactive asynchronous strategy. The essence: dedicated thread (or potentially one of the worker threads) periodically collects information about load of all worker thread, then calculates optimal distribution of work, and then re-distributes work among them.

It's worth noting that a scheduler may employ several (or even all of the) above strategies. Reactive strategies (stealing and requesting) deal with inevitable dynamic load imbalance; but usually have very limited local information about a system's state, so make sub-optimal decisions. Proactive strategies (distribution and balancing), on the other hand, have information about a system's state, so make one-shot optimal scheduling decisions; but unable to cope with inevitable dynamic load imbalance.
A scheduler must employ at least one of the reactive strategies in order to cope with continuous and inevitable dynamic load imbalance, and optionally include one or both proactive strategies in order to cut down stealing/requesting costs. So, the general recipe for a scheduler is:

SCHEDULER = (STEALING ^ REQUESTING) [+DISTRIBUTION] [+BALANCING]

Work-Stealing vs. Work-Requesting
So we have to choose between work-stealing and work-requesting. Work-stealing has a serious advantage over work-requesting due to it's asynchronous nature: a thief thread is able to get some work, even if a victim thread is busy processing a user task or even de-scheduled by an OS. With work-requesting a thief thread is able to get some work only if a victim thread condescends to send it (which it is just unable to do if it is de-scheduled by an OS). However, in our case it's not a problem, because tasks are small and limited in size, plus we are not going to over-subscribe processors with threads.
There are also 2 problems with work-stealing due to it's asynchronous nature. First, it inherently incurs some observable per-task overhead, because every pop operation from a thread's work deque must be synchronized with a potential asynchronous steal operation from another thread. Stealing is rare, but one still has to pay that price on every pop operation. The price is at least a single store-load style memory fence (MFENCE instruction for x86 architecture), or a single atomic RMW operation (LOCK prefixed instruction on x86). Here is an illustration of the problem:

Work-requesting is free of the problem. Task deque is completely local to a thread and requires no synchronization.
The second problem has similar nature and relates to a join phase of parallel algorithms (game state backtracking in our case). Traditional handling of task completion involves decrement of a pending child counter of a parent task. Due to asynchronous nature of work-stealing, the decrement has to be synchronized with other potential concurrent decrement operations. Here is an illustration of the problem:

Work-requesting is free of the problem. During execution of work-requesting protocol a victim thread can mark a parent task with a 'has_stolen_children' flag, and synchronization will be used only for such tasks. While great bulk of tasks will proceed without any synchronization.
Due to afore-mentioned problems I've chosen a work-requesting protocol for task scheduling. Basically, it allows us to cut 2 atomic RMW operations per task.

Work-Distribution and Work-Balancing
Now, we must decide as to whether to use work-distribution and/or balancing in addition to work-requesting. Work-balancing is too cumbersome to implement and my educated guess is that it's not going to provide any significant speedup (it's more suitable for work DAGs with no busy leaves property).
Work distribution is a different story. It's easier to implement, and may provide some speedup. I decided to implement work distribution only on initial stage. Namely, all initial game states are evenly distributed across all worker threads in a round-robin fashion. So each worker thread has quite a lot of work to begin with, and work-requesting comes into play towards an end of computation.

Scheduler Algorithm
Here is a bit simplified code of my scheduler (some functions are omitted for brevity). Full source code is presented in the Appendix B.

[cpp]class worker_thread
{
    // global index of current thread
    size_t                      my_index;
    // local tasks
    deque               tasks;
    // 1-element queue for work requests from other threads
    atomic              steal_req;
    // 1-element queue for work request acknowledges from other threads
    atomic              steal_ack;

    // main thread loop
    void loop()
    {
        for (;;)
        {
            // while we have local work process it
            while (tasks.size() != 0)
            {
                // pop task in LIFO order
                task_t task = tasks.back();
                tasks.pop_back();

                // user processing
                execute(task);

                // check for steal requests from other threads
                if (steal_req.load(memory_order_relaxed) != no_requests)
                    process_work_request();
            }

            // try to get some work from other threads
            if (false == send_steal_request())
                break;
        }
    }

    void process_work_request()
    {
        // get thief descriptor
        size_t thief_idx = steal_req.load(memory_order_relaxed);
        worker_thread& thief = get_thread(thief_idx);
        if (tasks.size())
        {
            // pop task in FIFO order
            task_t task = tasks.back();
            tasks.pop_back();

            // synchronous user processing
            steal(task);

            // give it to the thift
            thief.tasks.push_back(task);
        }
        // notify the thief that the operation is completed
        thief.steal_ack.store(1, memory_order_release);
    }

    bool send_steal_request()
    {
        for (;;)
        {
            // choose a victim
            size_t victim_idx = choose_randomly();
            worker_thread& victim = get_thread(victim_idx);

            // send a request to it (if it's not busy processing another request)
            steal_ack.store(0, memory_order_relaxed);
            size_t cmp = victim.steal_req.load(memory_order_relaxed);
            for (;;)
            {
                if (cmp != (size_t)-1)
                    break;
                if (victim.steal_req.compare_exchange_strong(cmp, my_index, memory_order_acq_rel))
                    break;
            }
            // request is sent?
            if (cmp == no_requests)
            {
                // wait for ack
                while (steal_ack.load(memory_order_acquire) == 0)
                    Sleep(0);
                // check as to whether we got some work or not
                if (tasks.size())
                    return true;
            }
            // termination condition
            if (no_work_in_the_system())
                return false;
        }
    }
};[/cpp]

RafSchietekat · ‎07-19-2010

Thanks for the info. I'm still in love with the idea of actually stealing work, as opposed to having a worker thread carefully prepare tasks for a "helper" to pick up (calling it a "thief" seems a bit silly in this context) and then probably mostly execute its own tasks because no helper ever arrived. It would be both asynchronous (like stealing) and only dividing work as needed (like requesting). It would be based on a lockable range or perhaps atomics (for admittedly limited applicability), where a worker regularly nibbles an optimisable number of iterations, and a thief would just divide it down the middle. Unfortunately there is this huge mystery (for me) that synchronisation has to occur through common locks/atomics (fences aren't enough), but cost is still assigned to those fences even if the lock is never accessed by another thread in the meantime, which seems somewhat contradictory, so I have no idea how it would compare to auto_partitioner even if simple_partitioner should be beatable. Maybe I should just accept that the synchronisation operations are prohibitively expensive even if they have no counterpart in another thread. Otherwise I may have to actually try it... :-)

Dmitry_Vyukov · ‎07-20-2010

Quoting Raf Schietekat

Thanks for the info. I'm still in love with the idea of actually stealing work, as opposed to having a worker thread carefully prepare tasks for a "helper" to pick up (calling it a "thief" seems a bit silly in this context) and then probably mostly execute its own tasks because no helper ever arrived.

The situation is actually inside-out.

>and then probably mostly execute its own tasks because no helper ever arrived

With stealing a thread must carefully prepare for each "own" task to be asynchronously stolen. While with requesting there is a careful preparation only for a single actually "stolen" task.

So it's not actually real stealing, you know, in real-life one at least does not have to prepare each thing for potential robbery. It would be quite barefaced if one would have to spend his time for the benefit of a thief.

Dmitry_Vyukov · ‎07-20-2010

Quoting Raf Schietekat

Maybe I should just accept that the synchronisation operations are prohibitively expensive even if they have no counterpart in another thread.

That's just a serious defect in x86 instruction set. Fenceless atomic RMW operations (like on SPARC) can be made basically costless *and* enough for most algorithms.

RafSchietekat · ‎07-20-2010

So how about a new (separate) instruction without all the baggage, Intel?