Eventcount (gate) proposal - Page 2

Dmitry_Vyukov · ‎12-08-2008

I've implemented a sketch of eventcount for TBB. It can be used as replacement for Gate in current TBB scheduler, and/or it can be used as blocking/signalling logic in concurrent_queue, and/or it can be exposed as public API, and/or on top of it portable condition variable can be implemented and exposed as public API. I want to know what do you think about implementation and these usages, and whether it's worth doing for me to finish the implementation and submit it as official contribution.

The eventcount is portable, fine-grained, fairly efficient, general-purpose and reusable.
Portable in a sense that it requires only semaphore, mutex and full memory fence primitives. No Win32 events and futexes.
Fine-grained in a sense that it supports notifications of interesting set of threads, not only coarse-grained notify_one() and notify_all().
Fairly efficient in a sense that producer overhead is single load, single test, single conditional jump, and possibly single full memory fence (controlled by user depending on an algorithm) on fast-path, consumer overhead is no-op on fast-path.
General-purpose in a sense that it includes no management of user state, this simplifies reasoning and implementation and allows reusability.
Reusable in a sense that it can be used basically with any predicate (user algorithm).

Implementation takes into account Arch's requirements:
http://software.intel.com/en-us/forums/showpost.php?p=71895

I've attached the implementation and will post it along with several examples in following posts.

Dmitry_Vyukov · ‎02-22-2009

I've found an error regarding management of user ctx passed into prepare_wait(). If consumer thread takes optimized path in commit_wait() (i.e. does not blocks on semaphore, but just notices that his in_waitset_ variable == false), then producer and consumer do not correctly synchronize on in_waitset_ variable. This can lead to races on user's ctx.
In order to fix it, modification of in_waitset_ in notify_* functions must be made with memory_order_release semantics, and first unprotected load of in_waitset_ in cancel_wait() must be made with memory_order_acquire semantics.
Another way to fix it is to remove unprotected load of in_waitset_ from cancel_wait(), then in_waitset_ will be synchronized solely by means of the mutex. This variant is nice, because does not introduce additional fences into notify_* functions (not relevant for x86).

Dmitry_Vyukov · ‎02-22-2009

Note that it's possible to pass information not only from consumer to producer, but from producer to consumer too - producer may just modify consumer's ctx before notification. This way consumer may say what he is waiting for, and producer can say about what he notifies. For example, consumer says "I am waiting for state change of items 7, 19 and 47" and producer says "I notify you because of state change of item 47".
Consider TBB task scheduler. Assume we have 128 processors/worker threads. After notification worker thread starts feverishly checking all other worker threads' deques in order to find the item. Assume that check of remote deque incurs 2 cache misses, let's say 600 cycles. In the worst case thread will find the item after check of 126 remote deques. I.e. it's 76800 cycles or 25 microseconds.
With fine-grained eventcount, producer can explicitly pass the index of deque where the item resides, so after notification thread will first check that deque.

Dmitry_Vyukov · ‎02-22-2009

Another improvement that I am thinking of is to allow producer to decide whether he wants to increase eventcount's epoch or not:

[cpp]    template
    void notify_relaxed(predicate_t pred, bool precise)
    {
	...
		if (precise == false)
		{
                      unsigned ep = epoch_.load(memory_order_relaxed);
      	              epoch_.store(ep + 1, memory_order_relaxed);
		}
	...
    }
[/cpp]

This is relevant only to notify(pred) version; notify_one() and notify_all() must increase epoch unconditionally.

What this will give to us? 'precise' flag determines whether selective notification is only an optimization and it's Ok if some other thread will actually consume the item (TBB scheduler case), or selective notification is precise and other threads are unable to consume the item (the above queue example). Informally, 'precise=false' allows spurious wake-ups, good for TBB scheduler because can increase reactivity, bad for queue because spurious wakeups will just add overheads. 'precise=true' does not allows spurious wake-ups, thread wakes up IFF it is selected by predicate.

Along with correct synchronization on in_waitset_ variable (described above), precise notifications allows us to rewrite queue example in the following way:

[cpp]    void enqueue(void* data)
    {
        int idx = ++producer_idx_; // atomic
        buffer_[idx] = data;

        struct local
        {
            int         idx_;
            bool operator () (void* ctx, size_t count, size_t idx)
            {
                return idx_ == *(int*)ctx;
            }
        }
        pred = {idx};
        ec_.notify(pred, true); // <--- precise predicate/notification
    }

    void* dequeue()
    {
        int idx = ++consumer_idx_; // atomic
        void* data = buffer_[idx];
        if (data)
            return data;
        //for (;;) // <--------- no more loops
        //{
            ec_.prepare_wait(&idx);
            data = buffer_[idx];
            if (data)
            {
                ec_.retire_wait();
                return data;
            }
            ec_.wait();
            data = buffer_[idx];
            assert(data); // <--------- item MUST BE there
            return data;
            //if (data)
            //{
            //    return data;
            //}
        //}
    }

[/cpp]

So, spurious wake-ups are totally eliminated. So no more loops. As opposed to condition variables, which just forces to use loops.

What do you think?

ARCH_R_Intel · ‎02-23-2009

It's an intriguing variation. Are there measurable performance differences for the queue example?

The only hazard I see is portability of the interface. Precise wakeups might be difficult to implement in alternative implementations.For example,perhaps Eventcount might be implemented quite differently on a thousand thread system. But then whatever is using Eventcount might be implemented quite differently too.

Dmitry_Vyukov · ‎02-26-2009

Quoting - Arch Robison (Intel)

It's an intriguing variation. Are there measurable performance differences for the queue example?

The only hazard I see is portability of the interface. Precise wakeups might be difficult to implement in alternative implementations.For example,perhaps Eventcount might be implemented quite differently on a thousand thread system. But then whatever is using Eventcount might be implemented quite differently too.

No, I don't have any performance data. I believe that it's possible to construct synthetic test where performance difference will be significant. But as for real usages... difficult to say... the window between prepare_wait() and commit_wait() is quite small, so probably we may remove epochs completely, and rely only on in_waitset_ so that commit_wait() will look like:

[cpp]void commit_wait(ec_thread* th)
{
    if (th->in_waitset_)
        th->sema_.wait();  
    else  
        cancel_wait();  
}  [/cpp]

This, in turn, may hit usage in scheduler (increased latency). But, once again, what will be performance difference?

Another possibility is to remove check in commit_wait() at all:

[cpp]void commit_wait(ec_thread* th)
{
    th->sema_.wait();  
}  [/cpp]

Anyway thread just re-checked the predicate, so what sense to check epoch_ or in_waitset_ few moments later?..

As for portability, I think that eventcount may NOT document absence of spurious failures, just says that they are reduced.