Asynchronous message processing while maintaining single threaded support.

michaelmarcin · ‎09-09-2008

I have a publish/subscribe system which is designed to allow for low coupling asynchronous communication between components. The best practices recommended by Intel, Herb Sutter, and others is to write code that can work single threaded but scale to multi/many threads without changing the code's logic.

TBB tasks and algorithms provide a nice way to achieve this for most jobs but AFAIK your work isn't guaranteed to get done until you wait for it much like futures. This makes a lot of sense to me but messages handling can be heavyweight and its completion doesn't cause any signals so I really don't care when it completes as long as it completes sometime and therefor I don't want to wait for it to do so when I publish it.

Since message handlers can publish other messages I can't safely run messages at the publish call in a single threaded program because this recursive publishing could cause the stack to overflow.

I have a main loop where I could wait for pending messages to complete (process to a high water mark?) publishing but this might create an unnecessary synchronization bottleneck for the multi-thread multi/many-core case. However, this is the best idea I can come up with.

I haven't worked much with multi-threaded or asynchronous operations before so I would like to get some feedback and perhaps suggestions for alternative designs.

Dmitry_Vyukov · ‎09-09-2008

MichaelMarcin:

TBB tasks and algorithms provide a nice way to achieve this for most jobs but AFAIK your work isn't guaranteed to get done until you wait for it much like futures. This makes a lot of sense to me but messages handling can be heavyweight and its completion doesn't cause any signals so I really don't care when it completes as long as it completes sometime and therefor I don't want to wait for it to do so when I publish it.

This is not true. TBB tasks, since spawned, start executing in best effort manner, no matter waiting you for them to complete or not.

But here is another caveat. Execution order of tasks is not guaranteed. So for example, you spawn task which recursively and infinitely respawns itself. Then one day this task also spawns second task which must stop whole processing. Second task is not guaranteed to run at all. So you will end up with a kind of deadlock.

michaelmarcin · ‎09-09-2008

randomizer:
This is not true. TBB tasks, since spawned, start executing in best effort manner, no matter waiting you for them to complete or not.

I realize this for the multi-threaded case it will probably start executing immediately but what about the single-threaded case? If TBB spawns no worker threads does this mean that the task executes at spawn or at wait? I don't see any other options.

randomizer:
But here is another caveat. Execution order of tasks is not guaranteed. So for example, you spawn task which recursively and infinitely respawns itself. Then one day this task also spawns second task which must stop whole processing. Second task is not guaranteed to run at all. So you will end up with a kind of deadlock.

So task scheduling is not fair. I can understand that and probably prefer it for most things but I think I can get around that deadlock by having 2 empty tasks for synchronization that I use to double buffer all message handler tasks which would be their children. The active empty task would wait until its children complete and any new messages would create tasks to the back buffer empty task. After the active task completes its wait it would atomically swap itself with the back buffered empty task which could then wait for its children to complete. This gives me the high water mark and I think is somewhat fair.

Dmitry_Vyukov · ‎09-09-2008

MichaelMarcin:

I realize this for the multi-threaded case it will probably start executing immediately but what about the single-threaded case? If TBB spawns no worker threads does this mean that the task executes at spawn or at wait? I don't see any other options.

They will be executed when current task will finish it's execution and give control back to TBB scheduler.

MichaelMarcin:

So task scheduling is not fair. I can understand that and probably prefer it for most things but I think I can get around that deadlock by having 2 empty tasks for synchronization that I use to double buffer all message handler tasks which would be their children. The active empty task would wait until its children complete and any new messages would create tasks to the back buffer empty task. After the active task completes its wait it would atomically swap itself with the back buffered empty task which could then wait for its children to complete. This gives me the high water mark and I think is somewhat fair.

Hmmm... I don't completely get your point. Can this scheme load more than one processor?
Anyway you must be very careful implementing your own message queueing and scheduling. TBB has very efficient low-overhead distributed scheduler. If you will be implementing your own scheduler on top of TBB's scheduler, it will be very easy to destroy performance and scalability of TBB's scheduler.

michaelmarcin · ‎09-09-2008

The point is to be functional in a single threaded environment while scaling to a multi-threaded environment. I don't care about the single threaded performance but it has to function as if it was running in an efficient multi-threaded manner.

The scheme is designed to allow a message A to invoke a handler which sends message A to not infinitely block a single-threaded program.

Since I'm spawning the child tasks immediately I expect that TBB can run the children of a empty synchronization task concurrently. I would also expect that TBB would run the children of both topmost tasks concurrently. However I wouldn't expect it to guarantee that the children finish running until I call wait on the parent and it returns.

I would rather not have this complexity at all. In fact I would like to just spawn a task and know that it will finish. I can't see how I could do that in a single-threaded program however without some sort of call to essentially run all the message handling tasks I have created. And I can't see how that call can not block indefinitely on a recursive message handler unless I limit the work it can do in one call.

michaelmarcin · ‎09-11-2008

I've done some work on the implementation and understand the problem a bit more so I'll try to explain the solution I think I need one more time.

When I publish a message I find the range of subscribers interested in that message and invoke the message handlers on each of them. The caveat is that I don't want to wait for those message handlers to complete, at least not there, and especially not in that call stack.

So I execute the handlers in a task very similar to a non-blocking parallel_for and give it a parent task that I can use to wait on later. When I do this wait later on I create a new parent task which I swap with the current parent task. This new task will be the parent of future non-blocking parallel_for message handling tasks. Then I wait on the old parent task and destroy it when it finishes without ever spawning it (as inspired by the Advanced Task Programming chapter's long running task example in the TBB book).

This is designed to prevent starvation and allow correct execution in a single-threaded environment while minimizing blocking. It also should prevent infinite recursion from handlers that publish messages which spawn the same handlers and can lead to stack overflow. It should also prevent the wait from being too greedy and preventing the main thread from doing work other than processing messages if there are always messages ready to be processed (as the result of publishing messages from handlers).

Does this explanation/proposed implementation make sense?

Dmitry_Vyukov · ‎09-15-2008

If you don't need to wait for tasks to complete, then you can just spawn tasks. That is all. No extra things needed. No parents, no fake tasks, no possibility of stack overflow. Or I am missing something?

michaelmarcin · ‎09-15-2008

If I just spawn the tasks and don't wait for them in a single-threaded program then they will never run right? I suppose I'm trying to prevent starvation of tasks.

Alexey-Kukanov · ‎09-15-2008

Michael, you are right that in order to process spawned tasks in a single-threaded program you need wait_for_all. Your design makes sense to me. In case it can be easily separated from essential IP of your code,I even recommend you to publish it at the forum and/or contribute.

michaelmarcin · ‎09-16-2008

Once I get it stable I will consider generalizing it and contributing it if my employer permits. I don't think they will have any problem.

Thanks.

Dmitry_Vyukov · ‎09-16-2008

MichaelMarcin:
If I just spawn the tasks and don't wait for them in a single-threaded program then they will never run right? I suppose I'm trying to prevent starvation of tasks.

Oh, I see. I was thinking that by 'single threaded environment' you mean situation when you start TBB's worker threads, but there will be only 1 worker thread on single-core machine. It seems that you are talking about situation when you don't start TBB's worker threads at all, only your main thread, right?
Maybe the solution is to start TBB's worker threads? Then all spawned tasks will be executed automatically. I think it can simplify design of your application, because then you don't need to handle 2 different cases. Just single case: your main thread + N worker threads.

Alexey-Kukanov · ‎09-16-2008

Michael,

I thought more of your design, and I think there is some caveat I should warn you of. The TBB scheduler is not FIFO with regard to tasks, which might matter for message processing. In particular, a running task scheduler(including the one starting in wait_for_all) treats its own task pool as LIFO and starts processing most recent tasks first. It might be fine, but I think you'd better be aware of that. Also, a task scheduler is oblivious to parent tasks while draining the pool, which means that the scheduler, once started, won't exit until all tasks in its local pool complete; so if you use two WorkManager instances in the same thread in your application, then call WaitForWork for one of those, in fact all tasks initiated in the second manager will also complete except those that were stolen.

One particularly bad consequence for your design with two swapping roots, I think, is that after roots were swapped, new tasks that use the new root will be executed mainly before tasks that use the previous root, which completion you are waiting for. So waiting could take longer than expected. If this might become a problem, I suggest you use some flag to postpone spawning new tasks after swapping, till the moment wait_for_all finishes. Tasks being added by AddWork can be kept in some shared dynamic data structure (for TBB 2.0 & 2.1, a task_list can be used to both keep the tasks and then spawn them all at once, but in future versions spawning a task_list might become obsolete or inefficient due to possible changes in thetask pool representation).

michaelmarcin · ‎09-16-2008

Ok so this means that if I have a task which always results in spawning another task and it doesn't happen to be stolen by another pool then wait_for_all will never return?

This is indeed very distressing. Thanks for bringing this up.
Waiting to spawn the other tasks would be unfortunate because there might be work that could be accomplished by idle workers while the thread waits for the last few work items to complete.

I may have to think about this problem some more. It seems TBB tasks might not be the correct solution.

P.S.

In the process of implementing the task to recursively do the work I wanted to support auto_partitioner to split the work up. It appears however that auto_partitioner cannot be supported by user made tasks because it only exposes its data via friendship with a select few internal task classes. This seems unfortunate.

RafSchietekat · ‎09-16-2008

Maybe I'm missing something, but why wouldn't concurrent_queue and parallel_do be any help? Otherwise you'll just have to do something with tasks directly what these do together behind the scenes, I presume, but if the scheduler favours performance over fairness you should expect to have to intervene sometimes. To prevent starvation, handle a limited number of tasks locally for efficiency (you can spawn with a counter), and overflow to the back of the queue, perhaps in batch, i.e., assemble overflow stuff locally and then add a container to the back of the queue. Just brainstorming here... does it make any sense?

Dmitry_Vyukov · ‎09-17-2008

MichaelMarcin:

Ok so this means that if I have a task which always results in spawning another task and it doesn't happen to be stolen by another pool then wait_for_all will never return?

Indeed!
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30263843/30263253/ShowThread.aspx#30263253
Even worse. If your task recursively spawns itself, then there is no guarantee not only than wait_for_all will return, but also than other tasks will be eventually executed.

As for now TBB suites well only for HPC, i.e. order of execution doesn't matter, no priorities, all tasks must be executed. This is the reason why .NET TPL Team considers changing strict lifo scheduler to something other, it's just too non-intuitive for user in some situations.
http://blogs.msdn.com/pfxteam/archive/2008/08/01/8800195.aspx

ARCH_R_Intel · ‎09-18-2008

TBB uses an approximation of Cilk-style scheduling (work LIFO; steal FIFO).It has good locality and load balancing properties while making certain space guarantees. It enables clean nesting of software components, analogous to the way serial code composes by nested subroutine calls.

FIFO has terrible locality and often has no space guarantees, which often leads to complex throttling mechanisms. Nonetheless FIFO scheduling definitely has its uses, notably for soft real-time. Finding a rationale marriage ofCilk-style and FIFO scheduling remains an open problem. If done rigorously it's a worthyfor aPh.D. thesis.

The "two roots" idea suggests that perhaps said marriage could be accomplished by some kind of epoch mechanism, where each task belongs to an epoch and older epochs are given precedence. There's a long way between a notion and a practical system.

RafSchietekat · ‎09-21-2008

"If done rigorously it's a worthyfor aPh.D. thesis." Shortest thesis ever: keep successor tasks in local queue, append to global FIFO structure at task steal time, only steal from it if no normal tasks are left?

(Added) Well, no local hidden queue (substitute participation in merge tree), only stealing should be delayed.

Dmitry_Vyukov · ‎09-21-2008

Raf_Schietekat:
"If done rigorously it's a worthyfor aPh.D. thesis." Shortest thesis ever: keep successor tasks in local queue, append to global FIFO structure at task steal time, only steal from it if no normal tasks are left?
(Added) Well, no local hidden queue (substitute participation in merge tree), only stealing should be delayed.

Do I get your idea correctly?

__declspec(thread) work_stealing_deque private_deque;

fifo_queue  global_queue;


void spawn_task(task* t)

 global_queue.push(t);


task* get_next_task_to_process()

 task* t;

 if (t = private_deque.pop())

 return t;

 if (t = steal_from_other_private_deques())

 return t;

 if (t = global_queue.pop())

 return t;

 return 0;

RafSchietekat · ‎09-22-2008

That would be the general idea, but there are obviously ways to trade absolute fairness for better performance. Using a distributed queue could be one. The merge could be done with a coarse timestamp granularity (did somebody say "epochs"?), and an idle thread could take a bigger bite than just the front task. A thread could keep tasks local until the end of the execute() or beyond (based on a maximum delay), as long as a determined idle thread can still get at them (so that would be an additional level in get_next_task_to_process()). It should not be too difficult to improve on the current situation where a user has to be creative with data structures to get anywhere.

What would the PhD title be for? Interprocess scheduling? Cooperation with the kernel? Realtime guarantees? A variable number of worker threads related to blocking behaviour? Condition variables? Or did I overlook something?

RafSchietekat · ‎09-26-2008

This thread should probably rejoin the fork LIFO vs FIFO.