Re: Support for actor programming model

Mukesh_B_Intel · ‎12-09-2008

Hi,

I was wondering if we can implement the actor programming model using TBB.

Thanks in advance

Mukesh

Dmitry_Vyukov · ‎12-09-2008

I was wondering if we can implement the actor programming model using TBB.

It depends on what do you mean by "using TBB" and on functional and performance requirements on actor model implementation.

It simplest way is to create thread (tbb::tbb_thread) per agent and use tbb::concurrent_queue for message passing.

robert_jay_gould · ‎12-09-2008

Quoting - Dmitriy Vyukov

It depends on what do you mean by "using TBB" and on functional and performance requirements on actor model implementation.

It simplest way is to create thread (tbb::tbb_thread) per agent and use tbb::concurrent_queue for message passing.

I remember trying this but it wasn't too good, then I switched the queues to a front buffer and back buffer vectors (just as with swapping video buffers), the actor would process one buffer, while the other was being filled with new tasks, once the actor finished its current buffer it did a swap to the other buffer, this way I greatly reduced locking and increased locality and concurrency, it worked fine. I had a few thousand (I think 10000) actors running in near realtime, but this was just an experiment I did about a year ago, didn't have time to continue this approach but it worked decently. So if someone has the time to invest in this approach it seemed reasonably viable.

robert_jay_gould · ‎12-09-2008

Quoting - robert.jay.gould

I remember trying this but it wasn't too good, then I switched the queues to a front buffer and back buffer vectors (just as with swapping video buffers), the actor would process one buffer, while the other was being filled with new tasks, once the actor finished its current buffer it did a swap to the other buffer, this way I greatly reduced locking and increased locality and concurrency, it worked fine. I had a few thousand (I think 10000) actors running in near realtime, but this was just an experiment I did about a year ago, didn't have time to continue this approach but it worked decently. So if someone has the time to invest in this approach it seemed reasonably viable.

Just remembered, I didn't use one thread per actor I had a container of actors, and a 2x threads (2 per processor), they would iterate through the groups of actors updating them in batches. This way I reduced the context switching, and thread overhead in general to support the thousands of actors scenario.

Dmitry_Vyukov · ‎12-10-2008

Quoting - robert.jay.gould

I remember trying this but it wasn't too good

No, it's not not too good, it depends on requirements :)

Your approach is (1) definitely harder and longer to implement, (2) strictly speaking is not based on TBB and requires manual creation of synchronization primitives, (3) prohibits blocking in actor callbacks.

Mukesh_B_Intel · ‎12-10-2008

Quoting - Dmitriy Vyukov

It depends on what do you mean by "using TBB" and on functional and performance requirements on actor model implementation.

It simplest way is to create thread (tbb::tbb_thread) per agent and use tbb::concurrent_queue for message passing.

Both Actually

1) Using TBB (which you have answered)

2) Improving TBB to provide better support for this model.

robert_jay_gould · ‎12-10-2008

Quoting - Dmitriy Vyukov

No, it's not not too good, it depends on requirements :)

Your approach is (1) definitely harder and longer to implement, (2) strictly speaking is not based on TBB and requires manual creation of synchronization primitives, (3) prohibits blocking in actor callbacks.

I had forgotten what the problem was with my experiment back then, and why I dropped it, but you got it:

"(3) prohibits blocking in actor callbacks."

This was the stumbling stone of my implementation. So yes anyone trying to do this has to be careful of this point in particular

RafSchietekat · ‎12-10-2008

Actors getting callbacks... we're still talking computers, are we?

How big a problem are blocking callbacks in actual practice? Is it perhaps too difficult to use continuations instead? Is anything being done with cluster detection to improve locality?

Dmitry_Vyukov · ‎12-10-2008

Quoting - Raf Schietekat

How big a problem are blocking callbacks in actual practice? Is it perhaps too difficult to use continuations instead?

If you want to access Oracle or PostgreSQL, it is quite difficult to use continuations.

In order to force user to use only continuations, you have to provide async wrappers for anything - network IO, disk IO, COM port IO, every database IO, etc.

RafSchietekat · ‎12-10-2008

Well... I still don't like the idea of having so many threads (stack space, locality). I would prefer a solution that operates a minimal number of preferably mutually exclusive software threads per hardware thread (maybe user threads?), and then it would be nice to have actors migrate to have the set served almost in batch by each hardware thread correlate well with any communication clustering, corrected for fairness. So my answer to the original question would be: er, no? But maybe I should stick to atomics for now. :-)

Dmitry_Vyukov · ‎12-10-2008

Quoting - robert.jay.gould

I remember trying this but it wasn't too good, then I switched the queues to a front buffer and back buffer vectors (just as with swapping video buffers), the actor would process one buffer, while the other was being filled with new tasks, once the actor finished its current buffer it did a swap to the other buffer, this way I greatly reduced locking and increased locality and concurrency, it worked fine.

There is well-know algorithm for this, called something like "IBM freelist with reversing". Cost per enqueue operation is single CAS, and consumer executes single XCHG operation per *batch* of nodes. Consumer is totally wait-free, producers are lock-free wrt each other. Sounds nice...

[cpp]// multi-producer/single-consumer causal-fifo queue
struct agent_queue
{
    struct node
    {
        node*       next;
        void*       data;
    };

    atomic   tail;
    char            pad [cache_line_size];
    node*           head;

    void enqueue(node* n)
    {
        n->next = tail.load(relaxed);
        while (!tail.compare_exchange(n->next, n, release));
    }

    node* dequeue()
    {
        if (head)
        {
            node* n = head;
            head = n->next;
            return n;
        }
        if (tail.load(relaxed) == 0)
            return 0;
        node* n = tail.exchange(0, acquire);
        node* prev = 0;
        for (;;)
        {
            node* next = n->next;
            if (next == 0)
                break;
            n->next = prev;
            prev = n;
            n = next;
        }
        head = n->next;
        return n;
    }
};
[/cpp]

Dmitry_Vyukov · ‎12-10-2008

Quoting - Dmitriy Vyukov

There is well-know algorithm for this, called something like "IBM freelist with reversing". Cost per enqueue operation is single CAS, and consumer executes single XCHG operation per *batch* of nodes. Consumer is totally wait-free, producers are lock-free wrt each other. Sounds nice...

... however it can be further improved. I've designed fifo queue especially for actor/agent oriented libraries. First of all, it removes nasty "reversing" operation (which seems not very cache friendly), second producers are wait-free wrt each other, third consumer executes XCHG even rarer. The only unpleasant thing is that producers potentially can block consumer, but probability of this is extremely small and agent oriented run-time can tolerate this just by switching to another agent.

Here is basic algorithm:

[cpp]struct mpscq_t
{
    mpscq_node_t* volatile  head;
    mpscq_node_t*           tail;
    mpscq_node_t            stub;

};

#define MPSCQ_STATIC_INIT(self) {&self.stub, &self.stub, {0}}

void mpscq_create(mpscq_t* self)
{
    self->head = &self->stub;
    self->tail = &self->stub;
    self->stub.next = 0;

}

void mpscq_push(mpscq_t* self, mpscq_node_t* n)
{
    n->next = 0;
    mpscq_node_t* prev = XCHG(&self->head, n);
    //(*)
    prev->next = n;

}

mpscq_node_t* mpscq_pop(mpscq_t* self)
{
    mpscq_node_t* tail = self->tail;
    mpscq_node_t* next = tail->next;
    if (tail == &self->stub)
    {
        if (0 == next)
            return 0;
        self->tail = next;
        tail = next;
        next = next->next;
    }
    if (next)
    {
        self->tail = next;
        return tail;
    }
    mpscq_node_t* head = self->head;
    if (tail != head)
        return 0;
    mpscq_push(self, &self->stub);
    next = tail->next;
    if (next)
    {
        self->tail = next;
        return tail;
    }
    return 0;

} [/cpp]

You can find more details here:

http://groups.google.com/group/lock-free/browse_frm/thread/55df71b87acb8201

and here:

http://groups.google.com/group/lock-free/browse_frm/thread/8b4784dbe03b7880

robert_jay_gould · ‎12-10-2008

Quoting - Dmitriy Vyukov

Quoting - Dmitriy Vyukov

There is well-know algorithm for this, called something like "IBM freelist with reversing". Cost per enqueue operation is single CAS, and consumer executes single XCHG operation per *batch* of nodes. Consumer is totally wait-free, producers are lock-free wrt each other. Sounds nice...

... however it can be further improved. I've designed fifo queue especially for actor/agent oriented libraries. First of all, it removes nasty "reversing" operation (which seems not very cache friendly), second producers are wait-free wrt each other, third consumer executes XCHG even rarer. The only unpleasant thing is that producers potentially can block consumer, but probability of this is extremely small and agent oriented run-time can tolerate this just by switching to another agent.

Indeed it looks like s a good implementation as far as I can see, if I get around to reviving my actors experiment I might give it a try. My implementation also allowed producers to block consumers, but since the blocking is so short and unlikely I had no problems with it in practice.

However I had the issue with blocking operations in my original design which sent everything to the freezer until I have time to revisit it.

Anyways as mentioned the asynchronous solution to blocking (using a database or other blocking system) is to have a message sent to save X, and get a callback to the actor message queue.Problem with this approach is the Actors become too dependent on these other systems (breaking encapsulation orresponsibilities), and forcing the user to think too much about theasynchronousityof the system (breaking the point of a nice API).

One off the bat solution I had but didn't implement yet, was to keep a normal thread pool, and another pool for blocking actions. Then the Actor's methods/message handlers would be functors that specify their policy as either blocking or not. This way the runtime could choose which actor tasks to run on which pool, thus giving a bit more intelligence to the system

[cpp]Actor robert = ActorFactory();
robert.attach(new handler(callbackA));
robert.attach(new handler(callbackB));[/cpp]

The advantage would be something like this would allow the user to use Lambda's in place of the callbacks, and message sorting, and callback tables would be done through templatespecialization.

Anyways as I said, I didn't actually try this other approach, but it seemed like a decent idea.