Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

iteration through task_list would be desirable

Walter_D_
Beginner
422 Views

here is a use case:

I have written a class (see attached) which automates the common recycle-as-parent continuation-passing style (found in many examples). The design uses CRTP and works best with C++11 (because this allows a variadicly templated member function to enter child tasks into a task list) and  for any number (including none) of (possibly un-balanced) child tasks (which are created by the user-defined derived class). After all children have been created and before they are spawned, each is visited in order to set an appropriate split_count to balance further recursive task splitting.

I've used this for potentially highly unbalanced task splitting into up to 8 child tasks: it works very well.

Because tbb::task_list does not allow iteration through its elements, I currently have to pop_front() each task from the initial list and push_back() into a new list. If tbb::task_list would allow for iteration, this could be avoided. A full standard-compliant iterator support would not be required, but merely two trivial inline functions such as

tbb::task* tbb::task_list::first_task() const { return first; }

tbb::task* tbb::task::next_task() const { return prefix().next; }

If public visibility of the members tbb::task_list::first and tbb::task::prefix().next is an issue, that can be avoided by a forward-iterator.

0 Kudos
15 Replies
RafSchietekat
Valued Contributor III
422 Views

Could you maybe first give an "executive summary" about your motivation? Normally TBB prefers recursive parallelism over task lists (far better scalability), with sufficient parallel slack of course, and then the scheduler just figures out what to do from there. An unbalanced tree is not a big issue for TBB.

0 Kudos
Walter_D_
Beginner
422 Views

Sorry, but I don't understand the sentence starting "Normally...": "prefers" over what? "far better scalabity" than what?

An unbalanced tree can be a *serious problem for TBB*: it results in many tiny tasks, each requiring concurrent overheads. As an example, consider the range [0,N), when each split only split's off the last element. In this case tbb *considerably slower* than serial execution, I just tried it (see attached code).

Perhaps an executive summary could be:

A task_list is a convinient way to collect child tasks for spawning (or other usage). Some information helpful *before* spawning, such as the relative work load of each task or some algorithmically relevant data, will only be available *after* collecting all children. In order to administer this information to each child to be spawned, it is necessary to iterate over the collected children before spawning them. This is not conveniently possible with the current implementation.

0 Kudos
RafSchietekat
Valued Contributor III
422 Views

Unbalancedness and parallel overhead are not necessarily linked: each can occur without the other. TBB was designed for automatic workload distribution, as long as there's enough parallel slack, so your extreme example goes somewhat overboard in that regard. Normally you don't need special planning: roughly down the middle is good enough for recursive parallelism, and if one split doesn't work out very well that is likely to be compensated by other splits that do.

What is the inherited information that you think requires distribution to the child tasks before they are spawned, and why?

Perhaps there's a better solution...

(Added) For the advantage of recursive parallelism over lists of well-balanced tasks, consider that each task in that list will still have to be stolen first.

0 Kudos
Walter_D_
Beginner
422 Views

Sorry, I'm still confused by "For the advantage of recursive parallelism over lists of well-balanced tasks". What is the difference? If I spawn a task_list, aren't all tasks in that list subjected to (recursive) parallelism?

Okay, so you say that it doesn't really matter that the tasks in a task_list are well balanced. Accepted.

However, the tasks in a task list may require some additional algorithmic information (related to the specific task), which can only be obtained after all tasks have been created. Such information must be provided to the tasks in the list before spawning it. This cannot be done conveniently with the current implementation. I currently don't have such a situation, but this may change in the future.

Moreover, while I understand your reluctance to consider changes to the existing code from a principle point of view, I like to point out that

- there is no good reason (AFAIK) why a task list should not allow iteration over its task

- the change would not invalidate or otherwise interfere with existing code and only affects the header files.

0 Kudos
RafSchietekat
Valued Contributor III
422 Views

A task_list is prepended to the local deque, and for the work to be distibuted each task has to be stolen individually. If you recursively spawn nodes in a task tree, each act of theft counts for a far larger portion of the total work, so that work can be distributed in logarithmic time instead of linear time, which can make a significant difference.

With recursive parallelism, ideally the first task you spawn is roughly half the work, then roughly half of the rest, etc. If you do that in one spawn action, that's quite an unbalanced task_list, I would say, but it's just what TBB likes best.

It's hard to argue with hypothetical requirements. :-)

Sometimes less is more, if it nudges you in the right direction.

0 Kudos
Walter_D_
Beginner
422 Views

Okay, now I understand you. So, for my problem there are naturally (up to) 8 child tasks (NOT 2). Would it be beneficial to group the children into 2 sibling-group tasks and spawn only those (to get logarithmic time)? I would guess that stealing mostly occurs only in the very first splits, so the effect is rather minor (because 8 is not very large). Correct?

If I want to implement that, I have to make the decision only after I know how many child tasks there are, i.e. after they have already been created using this->allocate_child(). But the sibling-group tasks would now become direct children and the original children have the sibling-tasks as parent. Is it sufficient to child->set_parent(sibling-group) to alter the parenthood after creation?

Btw, this is best implemented if tbb::task_list allows a split() ... :-)

0 Kudos
RafSchietekat
Valued Contributor III
422 Views

Well, if you must use tasks and not the prepackaged algorithms, then you would spawn a task for 4, a task for 2, a task by itself, and start executing the remaining one, or something to that effect. I'm curious myself about actually observed performance differences, but parallel_invoke() seems to believe in it, and that's only for up to 10 things or thereabouts.

Yes, you can use set_parent() before any spawning has taken place, but those descendants are still all going to be in the deque of the initial thread, where they almost all have to be stolen first, so that's not going to help at all. I still don't see what it would be that makes you postpone structuring the descendants until they have all been created, and then construct a list (or a tree) from the initial thread... If those are small tasks and you're doing this a lot, you are also working against a second goal of recursive parallelism, which is to strike the iron while it's hot (in cache), because it's all being brought into the cache of the first thread initially, possibly overflowing it, and then it has to move into the other caches. (Well, I gather most of the cost of stealing is right there, and the goals are largely the same.)

If you're not going to spawn() the lot anyway (and you shouldn't), you might as well use a vector...

0 Kudos
Walter_D_
Beginner
422 Views

sorry, but I'm lost again. Why do you think that "but those descendants are still all going to be in the deque of the initial thread"? They have not been spawned yet, only created. Does that already put them into the deque of the (initial) parent?

"I still don't see what it would be that makes you postpone structuring the descendants until they have all been created, and then construct a list (or a tree) from the initial thread". I don't know in advance how many children there will be. It could be any number in [0,8] inclusive.

0 Kudos
jimdempseyatthecove
Honored Contributor III
422 Views

Why [0,8]? You only expecting 8 HW threads?

Why not track the total weight of all nodes.
Then divide up the work by example total weight / nHWthreads / 2 chunks. (or /4),
using nHWthreads * 2 (or *4) std::vector's of node*
Then queue each vector as a seperate task

Effectively what you are doing but with two passes of your storage (list or tree).

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
422 Views

I'm somewhat lost, myself. Would the descendants get created in the first thread but then spawned in another? How?

There's no requirement to prefer optimal over simply convenient, of course...

0 Kudos
jimdempseyatthecove
Honored Contributor III
422 Views

Raf,

You create a vector of vectors to nodes in the first (main) thread (which can be preserved for subsequent use). Then use perhaps parallel_for over the vector of vectors. In a perfect system (not usually the case), the number of vectors of vectors of nodes == number of HW threads, and workload of each vector of nodes is equal. This (workload equal) not normally the case, you create more vector of nodes than number of HW threads (2x, 3x, ...).

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
422 Views

Ah, Jim, now I see I missed your previous posting! I was reponding to Walter's, instead.

Indeed, if there were a case for laying out the work in advance, why would one skip high-level parallel_for and resort to low-level task_list instead?

Note that task_list may just be a relic from the past, before the era of the ready pool as a deque. Previously it was implemented as a directly accessible stack (indexed by "depth" for stealing at shallower levels) of linked stacks (implemented exactly like task_list), where it would take only constant time to spawn a task_list. Now there may still be an advantage over repeated spawning of single tasks, but there's already a linear-time component to it, if we may for the moment neglect the possibly greater penalty of stealing later on. So what would be the use case for adding features to task_list at this time, if that might also "enable" suboptimal programs?

Come to think of it, perhaps task_list should just be deprecated, in favour of an iterator-based API (that can also be invoked on an array or a vector). Hmm, is std::copy guaranteed to specialise to memcopy-level performance for arrays or vectors of POD types? If so, just the one template function would suffice, otherwise it may have to be specialised or overloaded. But this would also be "enabling", of course.

(Added) Just to avoid the wrong impression, I think that the current task_list is still sufficient for its sole purpose of efficiently and conveniently spawning a reasonable number of tasks. If you really want to manipulate a collection of tasks, use a standard collection type like std::vector, and link those tasks up into a task_list before spawning them together.

0 Kudos
RafSchietekat
Valued Contributor III
422 Views

Something strange about a task_list is that it is FIFO for the current thread and LIFO for thieves, so, to optimise recursive parallelism, while you can just spawn() from big to small (in the current implementation anyway), you would have to fill a task_list from small to big, probably by using an intermediate vector. You would then populate the task_list from it in reverse order... and inside the scheduler the task_list will be reversed again. :-)

0 Kudos
jimdempseyatthecove
Honored Contributor III
422 Views

The problem with task list is they require at least one atomic operation to obtain the next task (could be several or use of critical section). The vector of vectors of nodes has no atomic operation, except for the launch phase of the parallel_for.

A problem with the vector of vectors of nodes is it assumes that all participating threads are available (common problem with all parallel_for). The usual MO in TBB is to partition into (significantly) more pieces than threads, where each piece is taken using atomics (or critical section) al la task list.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
422 Views

A task_list is just an intrusive singly-linked list of tasks, there are no atomic operations of any kind.

0 Kudos
Reply