What is the best way to implement it with TBB?
If not - well, then the problem asks for some other way to efficiently group processing of several nodes, or at least quickly assess whether it makes sense to split processing. For example, if every subtree root can have a count of how many nodes are in the subtree, it might help to quickly make the decision which subtrees are worth a task.
The question is exactly about that "other ways to efficiently group processing". The problem is somehow typical, so I thought it should have some canonical solution.
I think that efficient grouping of computations is impossible without application-specific knowledge. I do not even say "grouping of tasks", because at that level of granularity you strive to avoid overhead associated with tasks as much as possible. Due to that, I am afraid it can not be done very efficiently in TBB; but I am open for suggestions.
Let's speculate what can be done with this regard. Make processing of every node as a separate task, but spawn and steal and executethose only in bunches of a given size? That would still have per-task allocation cost. If all job functions are of the same type (to avoid indirections, e.g. virtual calls), it might be possible to make an aggregator classof (presumably) fixed size, which would collect several jobs and spawn those as a single task. It would still incur some overhead however; do you think it would be affordable? Any other/better ideas?
The problem is that sizes of sub-trees may be unknown, or even sub-trees may not exist yet (consider that processing *is* building of sub-trees, for example, a game state space exploration).
As for suggestions, I have no suggestions. Currently I am using a work-requesting scheduler for that, also I have an asymmetric work-stealing scheduler, both have per-task overheads on the order of handful of cycles, so I am able to express a parallel algorithm in it's natural form (i.e. have task at hand, spawn it). I am curious how that can be handled with TBB w/o incurring 2-20x slowdown.
Have one thread traverse the tree without doing work other than placing a pointer/reference to each node into a vector/array. As you are filling up this vector/array, the other threads can process the nodes referenced/pointed to/by the being filled vector/array. e.g. one thread traverses the tree and fills the vector/array and 3 threads monitor the progress and take every 3rd pointer/referece and process those nodes. If your node traversal an insertion time into the vector/array is under 33 cycles then this will work fine. You can signifiy end of tree by inserting a NULL pointer/reference into the vector/array. The other threads monitor the end() or fill pointer (make volatile), and issue _mm_wait() when they overrun the fill pointer.
You could keep this vector/array around for future use. There is no reason why your "data base" cannot have a tree structure .AND. a "flat unsorted index" structure at the same time.
Potentially the constant factor in auto_partitioner might need to be tuned from the skewedness of the tree (just guessing), and then an out-of-the-box auto_partitioner wouldn't do anymore, but that doesn't seem like an immediate concern. It is also not tunable for linear ranges with highly non-uniform work distributions (the headlight regions in a recently mentionedraytracing problem), so it's not exactly a tree-specific issue either.
It's not pathologically asymmetrical (list like), but asymmetrical enough for this method to produce some basically empty parallel tasks and some huge parallel tasks (potentially single task will contain all the useful work).
That won't work.
A tree may be not yet built (processing is-a building). So once you create the array, all useful work is already done.
Or if a tree is already built, it's more economically expedient to complete with a node once it's reached than to incur additional memory load in another thread.
And if I would have an array in advice, I would name the topic "parallel array processing", answer to myself, and close it :)
The original post stated there was some work to be done per node (as opposed to traversing the tree doing no work).
The suggestion I made was to incorporate a "pathfinder" thread that quickly traverses the tree (each time you use the task processing the tree), building a shared list of pointers/references to the nodes (or sublist of nodes of interest). Cocurrent with the building of this list, the remaining taskteam members process the nodes, pulling the pointers/references out of the list while it is being built (as opposed to after it is built).
[----time to scan tree and build vector of nodes---]
Where node0 is the time to compute the 0'th entry in the vector/array of node references/pointers (as opposed to the 0'th node in the tree).
The above shows one thread building the list, and 3 threads picking and processingfrom the list (while the list is being built). The thread additional threads each process 1/3rd of the work.
In the even that the work portion is large, then when the thread building the list completes (and/or gets ahead of the worker threads), it can join in on the processing of the nodes.
The threads can coordinate their work by monitoring progress of each other through shared memory variables. Each of the worker nodes can see the index pointer in the array being built. And when warranted, the thread building the list can monitor the progress of the worker threads. When the list builder gets far in advance of the worker threads then it can process a node instead of adding it to the list to be processed. What this mean then is the list of nodes to be processed can be a smaller ring buffer. (or n ring buffers).
> The original post stated there was some work to be done per node (as opposed to traversing the tree doing no work).
For sure, the work is: calculate number of child nodes and their parameters, allocate memory for them, link them to current node, repeat for each child node. Consider that all that (w/o recursive part of course) takes roughly 100 cycles.
I do not see how to apply your approach here. If pointers to child nodes collected, then all useful work is already done.
"It's not pathologically asymmetrical (list like), but asymmetrical enough for this method to produce some basically empty parallel tasks and some huge parallel tasks (potentially single task will contain all the useful work)."
Maybe not "degenerate", but it seems quite pathological to me if a modification of auto_partitioner (trading some efficiency for more resilience to tree skewedness by increasing an internal factor) would still not find enough parallelism. (Or does "pathological" also have a reserved meaning?)
Hmm, I tried to find something about "work requesting" and found a link from 2008 where one "Dmitriy V'jukov" declared "The more I am thinking about work-requesting, the more I like work-stealing."... Did you change your mind (like about the appropriate transliteration of your name :-) ), or does it just depend on the specific problem?
No, I didn't. There always was 2 spellings of my name "Dmitry Vyukov" and "Dmitriy V'jukov"... well and my ISN name is actually intended to be "Dmitriy V'jukov", but I had to remove the apostrophe to prevent ISN platform from freaking the hell out.
Regarding scheduling, yes, it depends on a problem type. I was talking about a scheduler for a general-purpose library like TBB (it definitely must include a stealing scheduler to prevent users from shooting their limbs off).
Here is an except from my write-up on the TC2010 Skip List problem, which superficially covers schedulers types (btw, Bradley Kuszmaul reported 25% slowdown from parallelization with Cilk++ which has ~5 times lower overheads than TBB):
There are 4 main strategies for a fine-grained distributed dynamic task scheduling:
Work-stealing. That's a reactive asynchronous strategy. The essence: when a thread is out work, it randomly chooses a victim thread and asynchronously tries to steal some work from it.
Work-requesting. That's a reactive synchronous strategy. The essence: when a thread is out of work, it randomly chooses a victim thread and sends a synchronous request to it; the victim receives the request, and sends some work back (if any).
Work-distribution. That's a proactive synchronous strategy. The essence: during submission of a new work, it's divided and proactively distributed to some threads (idle or lightly loaded).
Work-balancing. That's a proactive asynchronous strategy. The essence: dedicated thread (or potentially one of the worker threads) periodically collects information about load of all worker thread, then calculates optimal distribution of work, and then re-distributes work among them.
It's worth noting that a scheduler may employ several (or even all of the) above strategies. Reactive strategies (stealing and requesting) deal with inevitable dynamic load imbalance; but usually have very limited local information about a system's state, so make sub-optimal decisions. Proactive strategies (distribution and balancing), on the other hand, have information about a system's state, so make one-shot optimal scheduling decisions; but unable to cope with inevitable dynamic load imbalance.
A scheduler must employ at least one of the reactive strategies in order to cope with continuous and inevitable dynamic load imbalance, and optionally include one or both proactive strategies in order to cut down stealing/requesting costs. So, the general recipe for a scheduler is:
SCHEDULER = (STEALING ^ REQUESTING) [+DISTRIBUTION] [+BALANCING]
Work-Stealing vs. Work-Requesting
So we have to choose between work-stealing and work-requesting. Work-stealing has a serious advantage over work-requesting due to it's asynchronous nature: a thief thread is able to get some work, even if a victim thread is busy processing a user task or even de-scheduled by an OS. With work-requesting a thief thread is able to get some work only if a victim thread condescends to send it (which it is just unable to do if it is de-scheduled by an OS). However, in our case it's not a problem, because tasks are small and limited in size, plus we are not going to over-subscribe processors with threads.
There are also 2 problems with work-stealing due to it's asynchronous nature. First, it inherently incurs some observable per-task overhead, because every pop operation from a thread's work deque must be synchronized with a potential asynchronous steal operation from another thread. Stealing is rare, but one still has to pay that price on every pop operation. The price is at least a single store-load style memory fence (MFENCE instruction for x86 architecture), or a single atomic RMW operation (LOCK prefixed instruction on x86). Here is an illustration of the problem:
Work-requesting is free of the problem. Task deque is completely local to a thread and requires no synchronization.
The second problem has similar nature and relates to a join phase of parallel algorithms (game state backtracking in our case). Traditional handling of task completion involves decrement of a pending child counter of a parent task. Due to asynchronous nature of work-stealing, the decrement has to be synchronized with other potential concurrent decrement operations. Here is an illustration of the problem:
Work-requesting is free of the problem. During execution of work-requesting protocol a victim thread can mark a parent task with a 'has_stolen_children' flag, and synchronization will be used only for such tasks. While great bulk of tasks will proceed without any synchronization.
Due to afore-mentioned problems I've chosen a work-requesting protocol for task scheduling. Basically, it allows us to cut 2 atomic RMW operations per task.
Work-Distribution and Work-Balancing
Now, we must decide as to whether to use work-distribution and/or balancing in addition to work-requesting. Work-balancing is too cumbersome to implement and my educated guess is that it's not going to provide any significant speedup (it's more suitable for work DAGs with no busy leaves property).
Work distribution is a different story. It's easier to implement, and may provide some speedup. I decided to implement work distribution only on initial stage. Namely, all initial game states are evenly distributed across all worker threads in a round-robin fashion. So each worker thread has quite a lot of work to begin with, and work-requesting comes into play towards an end of computation.
Here is a bit simplified code of my scheduler (some functions are omitted for brevity). Full source code is presented in the Appendix B.
// global index of current thread
// local tasks
// 1-element queue for work requests from other threads
// 1-element queue for work request acknowledges from other threads
// main thread loop
// while we have local work process it
while (tasks.size() != 0)
// pop task in LIFO order
task_t task = tasks.back();
// user processing
// check for steal requests from other threads
if (steal_req.load(memory_order_relaxed) != no_requests)
// try to get some work from other threads
if (false == send_steal_request())
// get thief descriptor
size_t thief_idx = steal_req.load(memory_order_relaxed);
worker_thread& thief = get_thread(thief_idx);
// pop task in FIFO order
task_t task = tasks.back();
// synchronous user processing
// give it to the thift
// notify the thief that the operation is completed
// choose a victim
size_t victim_idx = choose_randomly();
worker_thread& victim = get_thread(victim_idx);
// send a request to it (if it's not busy processing another request)
size_t cmp = victim.steal_req.load(memory_order_relaxed);
if (cmp != (size_t)-1)
if (victim.steal_req.compare_exchange_strong(cmp, my_index, memory_order_acq_rel))
// request is sent?
if (cmp == no_requests)
// wait for ack
while (steal_ack.load(memory_order_acquire) == 0)
// check as to whether we got some work or not
// termination condition
The situation is actually inside-out.
>and then probably mostly execute its own tasks because no helper ever arrived
With stealing a thread must carefully prepare for each "own" task to be asynchronously stolen. While with requesting there is a careful preparation only for a single actually "stolen" task.
So it's not actually real stealing, you know, in real-life one at least does not have to prepare each thing for potential robbery. It would be quite barefaced if one would have to spend his time for the benefit of a thief.