I have spent the past semester researching how to combine data parallelism with TBB, so that TBB can offset data tasks to a GPU when available. My work is nearly complete, and now I need to ask the TBB community for some help. My approach is to separate data parallelism from task parallism, which involves the creation of a new type of task: the data task. For the purposes of my work, a data task represents an operation which can utilize data parallelism. For example, multiplying a large vector by a scalar, or a matrix by a scalar.
There are two work pools in my design. There is the task pool which remains completely controlled by TBB. I also have added a single shared data task pool, from which the GPU will consume data tasks. If a CPU has no work to be done, that is normal TBB tasks to consume, it will steal data tasks from the data task pool. Ordinary TBB tasks can only be executed on the CPU, whereas data tasks may be executed on either the GPU or CPU. A data task is very simple, it only provides an virtual execute() function for now.
I would like to integrate data task with the TBB task scheduler in a simple way. Data tasks can be run from within TBB tasks. I would like to allocate a task that spawns that data tasks, putting them into the data task pool. Rather than blocking until the data task is complete, I would like for the worker thead (which ran the tbb::task that spawned data tasks) to continue to process other TBB tasks as normal, and steal tasks as required. When all data tasks are complete, the allocated parent task is ready to be re-executed.
So if there wasn't a mixture of data tasks and tbb::tasks, I would allocate a parent, spawn children, and wait for all to complete. This problem is slightly more complex, because some of these children are of a different type, and are placed into a data task pool instead of the task pool used by TBB. If a program utilizes a large amount of data parallelism which the GPU can help with, TBB worker threads should steal data tasks for themselves too.
Thanks for your input, and as always I'm on #tbb on irc.freenode.net for a chat.
How about having a tbb::task that has the "task" to get a steal a data-task and process it. This was you keep the actual tasks pools separate, so data-task pool implementation can be kept simple. Then what I'd do is have place this steal-task at the top of the tbb::task pool, so when all normal tasks run out this steal task get run. And for the sake of if I'd make this steal-task respawn itself (recycling as per the manual) so its always on the stack.
Sorry if my diagram below is confusing, but something like this:
tbb::tasks: (steal) <- child <-child <-child [CPU]
data tasks: task, task, task, task, task [GPU/CPU]
I will likely change the internal's of TBB's scheduler for the actual theft of data tasks. So a worker thread will try to steal tasks from other threads, and if that doesn't work I'll hit the data task pool. My problem is more so that I want to allocate a task that represents the data parallelism, and once that task has been allocated the worker thread should enter into a work-stealing mode to try to steal TBB tasks from other threads. Otherwise it hits the data task pool for data tasks. That's what I'm thinking.
I kinda want to have a way of forcing a worker-thread to go into task-stealing-mode, and then back to normal once the parent of the data task is finished. Maybe I should just add something like that to the scheduler interface, at which point I'd need help with where to put the logic in the code :-)
Although I haven't touched the corresponding API at all, this situation sounds similar to the mailboxes interface for tbb::tasks. Maybe mailboxing is enough to get the job done, or at least a good choice for an entry point to extending the scheduler. But like I said, I haven't looked into mailboxes API much so it's just a guess. Maybe someone more familiar with mailboxing can give some feedback on this.
Right, that's my intention. Before a worker thread yields because there is no work to be done, it will grab some GPU tasks.
The problem is this... it's possible for a TBB task to have data task children. Data tasks will always be leaf nodes, after all if a data task has a tbb::task as a child, it's a tbb::task by definition (violates what a data parallel task is). Suppose someone writes this code for a parallel_for:
void operator()(Range& x) const
// let's assume X, Y, U are defined
X = matrix_multiply(X, Y);
Vector Z = matrix_multiply(X, U);
So when this runs, each matrix_multiply() spawns a bunch of data parallel tasks, which go into the data task pool. However, ParallelWorker's body is now blocked... it can't proceed until each matrix_multiply() operation completes. This is no different than nested parallelism. Rather than having the worker thread block, I'd like to implement each matrix_multiply in such a way that it lets the worker thread steal tasks. It would steal from other task pools, and as a last resort as you said, get data tasks from the GPU. When the matrix_multiply operation is completed, the parent should be ready to be resumed. This corresponds in concept to having a parent task spawn a number of children, but these children have all been stolen so the worker thread now has to search for work.
Hopefully this all makes sense.
If I am not missing something, you can just spawn a number of data tasks with spawn_root_and_wait() (there is a overload that accepts a list of tasks), all other will be provided by current TBB scheduler - current thread will be blocked until all child tasks will be completed, current thread will be stealing from other threads.
Right, that would mean that only CPUs are consuming data parallel tasks. In this case, the GPU is available so I would like for the GPU to consume these tasks. Thus the need for a separate data task pool, and handling of data tasks.
Oh, I forgot to mention that you still have to incorporate 2 task pools, and incorporate some priority stealing for CPU threads.
I meant that only the following will be solved automatically:
"The problem is this... it's possible for a TBB task to have data task children. [...] I'd like to implement each matrix_multiply in such a way that it lets the worker thread steal tasks. It would steal from other task pools, and as a last resort as you said, get data tasks from the GPU. When the matrix_multiply operation is completed, the parent should be ready to be resumed."
So the complete solution is:
- create additional task pool for data tasks
- incorporate priority stealing for CPU threads (normal tasks first)
- spawn data tasks with spawn_root_and_wait() (while thread is waiting for these tasks to complete it will be stealing other tasks - preferably normal tasks)
This solution sounds like it adds a lot more complexity, although I'm not familiar enough with the TBB scheduler to know just how complex the implementation would actually be. What about the approach suggested by Raf? Have worker threads steal from the GPU as a last resort?
My thought is to create a data_task, which has within it a method get_tbb_task() or something, which will return a TBB compatible task representing that data task. This way the GPU handles data_task objects itself, and TBB will have to call a method to convert a data_task into something it can handle. So it will steal a data task, then create TBB tasks by calling data_task.get_tbb_task().
Adding priority sounds like a complex solution, rather than only stealing from the GPU as an absolute last resort to do useful work. I like your suggestion of doing a spawn and wait. How can I incorporate a spawn and wait with a different type of task? Rather than having actual TBB tasks enter the pool, I'd like for the thread to try to call its work theft functions... stealing tasks from other worker threads, or from the GPU as a last resort. This way both TBB tasks and GPU tasks can run concurrently on the same system.
I had thought that perhaps I could come up with a solution using empty tasks... put the worker thread into the position that it can't do any more useful work without theft, at which point it will try to steal work from other CPUs... and finally hit the GPU for work.
"The problem is this... it's possible for a TBB task to have data task children." Just an idea, and I don't know if this will work because a lot about TBB is still fuzzy to me, but if the parent allocates an equal number of empty_tasks in TBB as it creates "data tasks" outside of TBB, and accounts for them in the ref_count but doesn't actually spawn them before calling wait_for_all(), a watcher tbb_thread could perhaps follow up each "data task" completion by spawning an additional task of the parent, whose only action would be to destroy the associated empty_task. That way the only change to TBB itself would be the one described earlier.
Permit me to rephrase your request then you can tell me if this is what you are thinking.
You want to determine if a GPU is present and if present you set a flag indicating it is available.
When GPU is available you want to seperate the various tasks into two classes:
CPU only and GPU prefered
The CPU only are straitforward tasks
The GPU preferred tasks are in a queue that can be executed by either GPU or CPU. The selection chriteria might be when the GPU task list has n pending requests redirect next tasks to CPU. Or if your task requests to GPU contain an estimated processing weight and as GPU tasks are enqueued the estimated weight is added to a tally and when a GPU task ends the estimated weight is removed from the tally then when the tally exceeds a threshold the CPU route steals GPU tasks.
Also, since you may have one or more GPUs you may want multiple GPU processing tasks.
Is this what you had in mind?
Also, I tried to figure out the irc.freenode.net and could not find the #tbb area.
I'm glad you took a shot at joining freenode. You have to join the #tbb room... I'm not sure what irc client you tried, but you can run "/join #tbb" to join the TBB chat room.
What I'm thinking is to have two separate task pools, one for the CPU, and one for the GPU as you mentioned. The CPU task pool is just ordinary TBB, but the GPU task pool holds a special type of task: the data_task. This type of task may be converted to a TBB task.
So you have touched on what I have in mind. Integrating this approach with TBB is my current task.
"I'm not sure what would happen if I allocate empty_tasks to count toward the ref_count but don't spwan them before I wait... what should happen?" I'm not sure either, but it's what I would try first if I didn't have to be 100% sure in advance about whether it would work or about its reliability (please don't use this to control something like a nuclear reactor). I leave it to you to test it first, which you can do before changing TBB itself or doing anything with the GPU(s), and be prepared to have to revise it later on. But maybe I have to take the question literally: the idea is that TBB thinks that some other thread is working on those threads, and that it just goes out stealing like it normally does, until at some point it no longer thinks that and can continue with the parent, while the GPU-oriented tasks are always preferentially handled by the GPU. I see no reason why you couldn't combine this with Jim's idea of using a (tunable?) threshold for stealing work from the GPU.
"This is exactly what I want to do... the question is: how?" Well, just as I said, you pretend that you've allocated and spawned N tasks, even though you've only allocated them in parallel with "spawning" an equal number of GPU tasks outside of TBB, so you set the ref_count to N+1, allocate N empty_task children, then you wait_for_all() and hope that TBB will take the bait. The GPU stuff has to be handled by whatever means available, and then the dedicated tbb_thread will at some point have to detect that the GPU task has completed, and decrement the ref_count of its TBB "parent" (some housekeeping will be required) in the way I described. You can try this first without any actual GPU work (just a mock-up to validate the TBB side of this approach), then add the GPU work, and later you can modify the scheduler to allow GPU-task stealing by TBB, optionally with a threshold.
"Alright, thanks. I'll try this tonight." Note that the additional-child indirection may very well not be required, but I got spooked somehow about calling destroy() directly from an unrelated thread, and it may play a role if the parent isn't already executing. Anyway, I hope it does the trick, and that I haven't forgotten anything actually relevant instead.