- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm a TBB newbie, but am familiar with concurrent programming etc... I've looked over the docs and read a few threads in the forum, so I am aware that I'm not the first to ask questions about DAGs.
I'm looking to use TBB to evaluate a complex directed graph of operations. I've seen a few suggestions on how to do this (eg: the white/grey/black edge colouring) which look reasonable, but involve quite a bit of book keeping to implement. I was wondering if there is anything equivalent to a task level mutex/lock which would simplify the implementation, and if not, is it a reasonable feature request?
For my purposes, we have a graph of operations. How I see it is that a TTB task is generated for each node in our graph, computes whatever it needs to do and shoves the result back into the graph node. So if we have a simple graph, where node A depends on nodes B and C, this would be fairly straight forward. We recurse up backwards through the dependency chain from A, first generating task for A, which generates tasks for B and C and waits for them (or rather a continuation task does). Tasks B and C do what they need to do then task A runs. All happy and easy enough with TBB.
However if we have another node D, which both B and C depend on it gets complex. We have a diamond dependency, so B and C's tasks will attempt to generate a task for D. Boo, what do we do here? We could...
- compute D twice and waste a thread and associated resources,
- block one thread and wait for D to be computed and waste a thread,
- implement the edge colouring idea to traverse through D, which would be quite complex to implement the glue for.
My suggestion is that there be TBB task level mutexes and locks, with them it would be comparatively trivial to implement this traversal. When a task gets to D it locks such a mutex, checks to see if the value is there, if not, it generates the task for D, waits for it to complete, then unlocks the mutex. So the first tasks computes D, the second one stops. However, being a task level lock, instead of blocking the entire thread, it blocks the calling task. The thread running that task suspends it at the point of the lock, goes away and attempts to do other work, every so often checking to see if the lock has now been gained and then resumes the task at that point. So if there were any parallel tasks needed to produce D, our blocked thread 'teleports' around the block and continue to do useful work on other tasks.
I can't imagine you could implement such a thing client side at all, as it needs to play with the TBB scheduler. Without peering into the source, I can't see this as being to hard to implement (always optimistic me), with a fairly simple lock class a few tweaks to the task queues. Admittedly this is effectively voluntary preemption, which goes against the design philosophy of TBB. If that is out of the question, something similar could be done with a 'lock continuation' task. Either way, the key point is not to have the entire thread waiting on a mutex, but only to have a task wait.
Anyone else see this at all a practical idea, or see some better way of implementing this in TBB?
tx
b
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe I did it all on the client side. Off the top of my head, there are three pieces of state to maintain:
A tbb::task_list can be used to keep the waiting list of tasks.
A tbb::spin_mutex to use for some brief internal locking (not locking the region of interest).
A flag to indicate whether a task is working on the locked region.
The logic for the ACQUIRE/RELEASE operations look something like the following, which assumes thatthe task trying to acquire the lock supplies a continuation task that will process the critical section and release the lock.
[cpp]ACQUIRE: acquire internal spin lock if critical section is busy then put continuation task on list else mark critical section as busy spawn continuation task end release internal spin lock RELEASE: acquire internal spin lock if list is not empty then pop task from list and spawn it else mark critical section as not busy
release internal spin lock [/cpp]
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As you suggest, there is a way to create a lock-like class where a blocked task hangs a continuation task on the lock, and the continuation task is executed when the lock unblocks. I coded one a few years ago, but alas did not keep the code around.
Somewhat related is the example in TBB/2.1/examples/parallel_do/parallel_preorder, which performs a preorder traversal of a graph.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe I did it all on the client side. Off the top of my head, there are three pieces of state to maintain:
A tbb::task_list can be used to keep the waiting list of tasks.
A tbb::spin_mutex to use for some brief internal locking (not locking the region of interest).
A flag to indicate whether a task is working on the locked region.
The logic for the ACQUIRE/RELEASE operations look something like the following, which assumes thatthe task trying to acquire the lock supplies a continuation task that will process the critical section and release the lock.
[cpp]ACQUIRE: acquire internal spin lock if critical section is busy then put continuation task on list else mark critical section as busy spawn continuation task end release internal spin lock RELEASE: acquire internal spin lock if list is not empty then pop task from list and spawn it else mark critical section as not busy
release internal spin lock [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Bruno,
This may look like a shameless plug for vaporware. In the threading tool I am building and nearing completion to the point of WhatIf-like alpha testing your problem could be handled by:
[cpp]// placed in object or make static qtControl qtControl_D; void TaskD() { ... D code ... } void DoD() { if(InterlockedIncrement(&qtControl_D.Count) == 1) { TaskD(); // call direct InterlockedDecrement(&qtControl_D.Count) } parallel_wait(&qtControl_D); } void TaskB() { ... prolog ... DoD(); ... epelog ... } void TaskC() { ... prolog ... DoD(); ... epelog ... } --- or --- void DoD() { if(InterlockedIncrement(&qtControl_D.Count) == 1) { parallel_task(&TaskD); // enqueue as task InterlockedDecrement(&qtControl_D.Count) } parallel_wait(&qtControl_D); } [/cpp]
The second method (though not detailed in the sample code) would permit the TaskD to run on a core that is not the core that issued the task enqueue of TaskD. The reason for this might be you wish to run TaskD on a different NUMA node than TaskC or TaskB. The thread run restrictions are setable in the qtControl structure.
Constructing diamond like dependencies is trivial given the right threading toolkit.
Making posts like this might provide the TBB programmers some talking points.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Making posts like this might provide the TBB programmers some talking points.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Task generation has its overhead, there are some specialized techniques you can use to reduce the overhead.
One technique is a series of ring buffer queues. Using non-interlocked write operations and with defensive code to detect packet loss, you can have relatively fast enqueue/dequeue operations as long as you can live with out-of-order processing. And an occasional long latency occurring during lost packet recovery operations.
I haven't written the code but I think it can be done using 6 non-interlocked writes (to two different locations) as opposed to two interlocked operations. It should run faster as the interlocked operations are very long (relatively speaking).
Jim
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page