Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Using Tasks as a child of parallel_for

ndepalma
Beginner
414 Views
This seems like a rather normal process yet there is no way (as a body of a parallel_for), that you can get a handle on the root task, start_for. So this means you can't allocate a child task.

A simple example (that has come up) is a parallel_loop that pushes on samples from an array that satisfies a function. If pushing on the sample takes a substantial amount of time, a mutex won't be ideal, but spawning/recycling a new task that will handle it would be best. Originally, I had designed it so that there was a task that just called a for loop - and pushed across a boundry, but this doesn't seem like the best solution.

More like a feature request - is this even possible now, reason why it won't be like this in the future?

Nick
0 Kudos
5 Replies
Alexey-Kukanov
Employee
414 Views

If you just need a task instance to allocate and spawn children, then tbb::task::self() is here for you. It returns the reference to the task currently being executed, so you can refer to "this" task from parallel_for bodies etc.

It's possible to obtain the root task (or, more exactly, the root continuation task, as the original root task was recycled); for this you should start from self() and go up the parent() tree: self().parent(), self().parent()->parent(), etc.A task whose parent is NULL is the parent task of the root and the root continuation.

In case you are going to use this complex method and not task::self(), could you please explain in more details why it seems better to you? If there are good motivating cases, we might consider adding a method to get the "root" task.

0 Kudos
robert-reed
Valued Contributor II
414 Views

Hi, Nick. Perhaps you can describe further the environment in which your code is operating? When you say the code uses a parallel_for, are you speaking of the conventional use of a splitable range of array values and a recursive splitting of tasks over that range? Or perhaps your code just uses a parallel_for to create a set of independent tasks that you'd like to manage further?

When you say "pushing," are you being literal like "pushing an item onto a queue," or do you mean something more? You describe a condition where "pushing" will require some time, a mutex wouldn't be ideal but spawning/recycling a new task would be best. If you need a mutex or similar synchronization object to write a protected array element, I'm not sure how you avoid the need for a lock justby deferring the computation to another task. Ultimately data need to be written, and some sort of synchronization step might still be needed.

If you need to be able to add tasks to the tree, you might consider managing initial task allocation in your program rather than relying on the parallel_for, and then you could have all the flexibility you desire. parallel_for is targetted at the domain of splitable ranges. Adding hooks to parallel_for for such an esoteric use as you suggest may be stretching the concept too far.

0 Kudos
Alexey-Kukanov
Employee
414 Views

mad reed:
parallel_for is targetted at the domain of splitable ranges. Adding hooks to parallel_for for such an esoteric use as you suggest may be stretching the concept too far.

I agree, adding anything like this specifically to parallel_for is probably a no-way, mostly because of interface complications.

The method of going up the task tree I described above is possible to legalize as a general taskinginterface, like task::self(). The semantics of this method would be hard to describe though, as theresult for a particular algorithm would depend on algorithmimplementation; and this is quite a serious concern against it. Thus really good motivating cases are necessary to consider it.

0 Kudos
ARCH_R_Intel
Employee
414 Views

I think the right approach might be new loop template that supports loops whose body is pure continuation-passing style code. If done right, that would enable composing loops in a nested fashion where no loop body in the nest leaves anything on the run-time stack when it invokes the loop inside itself. I've thought about this from time to time, but never worked out the details.

As Alexey points out, task::self() could be used to get to the task running the loop body, task::self().parent() would get you to the task in the task graph to which you could add an additional child by using task::allocate_additional_child_of. Using the parent, and not the root, is probably best for scalability. Hanging all additional children off the root makes all processors contend for the same cache line when they update the reference count.

Be warned that wenever intended parallel_for to be introspected that way, and so we can guarantee that hack will work in the future. But if it is a generally useful pattern, we could look into supporting the pattern, perhaps with a different template.

0 Kudos
ndepalma
Beginner
414 Views
Wow, everyone has been very very helpful! I really didn't know about the task::self(), that's really good to know!

As to the post from rreed & adrobiso; I can't really explain the code as it is IP I don't want to post to the forums, but as a broad generalization, it's a producer consumer problem where the producer has a design pattern very similar to a for loop (hence why I used parallel_for). The consumer is more of a FIFO of things that the for loop has produced. The order doesn't really matter though as it's being added to a huge data structure that sorts on insertion (hence why it takes so long). The consumer is highly optimized.
The solution that I came up with (spawning a task to insert when it the producer produces) is kind of lame, as you all have pointed out.
adrobiso pointed out that I should use a nested for loop (and by god, I have actually created such a structure! ... though there are many limitations to my implementation, we are using such a (in my case) whacked out design with much success.

Whatever the case, I will be trying a more producer/consumer oriented implementation and hopefully get some better results.

On the same note as a parallel_for as a producer - I have also considered teaming up the parallel_for with the pipeline which would produce something similar to what adrobiso was talking about; a producer who creates something and immediately pushes it down a pipe. This would be rather ideal for my case as the FIFO would be inherent in the design!
0 Kudos
Reply