Bug in concurrent_queue

Dmitry_Vyukov · ‎09-09-2008

TBB 2.1 (20080825)

Consider following code from concurrent_queue.cpp

void concurrent_queue_base_v3::internal_push( const void* src ) {
[...]
// it's sequentially consistent atomic RMW,
// so we can consider that it's followed by #StoreLoad style memory fence
ticket k = r.tail_counter++;
[...]
// it's relaxed load
if( r.n_waiting_consumers>0 ) {
EnterCriticalSection( &r.mtx_items_avail );
// signal consumers
[...]
LeaveCriticalSection( &r.mtx_items_avail );
}
[...]
}

void concurrent_queue_base_v3::internal_pop( void* dst ) {
[...]
// it is really empty.. go to sleep FOREVER!
EnterCriticalSection( &r.mtx_items_avail );
// it's relaxed load, followed by relaxed store (or relaxed pseudo RMW)
r.n_waiting_consumers++;
// it's load-acquire
while( r.tail_counter<=k ) {
// block thread
[...]
}
LeaveCriticalSection( &r.mtx_items_avail );
[...]
}

Here is a bunch of problems.
I. Most serious problem. This logic can lead to DEADLOCK.
Consider following execution sequence:
1. Producer increments tail_counter
2. Producer loads n_waiting_consumers, n_waiting_consumers==0, so it doesn't signal consumers
3. Consumer increments n_waiting_consumers
4. Consumer loads tail_counter, load returns stale value, consumer goes to sleep forever
To fix this you must issue #StoreLoad style memory fence after incrementing n_waiting_consumers.

II. Because n_waiting_consumers participates in lock-free algorithm, i.e. not always protected by mutex, it's better to be declared volatile, in order to calm down compiler.

III. n_waiting_consumers and n_consumers_to_wakeup declared as uint16_t, i.e. maximum number of waiting threads is 65k. Well, it's enough for any sane situation for a couple of years. But this can be not enough even today for all those insane situations in which insane user will decide to use TBB. On my 32-bit system with 1GB of physical memory I'm able to create 30719 threads.
I don't see reasons why n_waiting_consumers and n_consumers_to_wakeup declared as uint16_t, and not as uint32_t, they doesn't used with CAS, they doesn't affect critical data layout.

And exactly the same situation with reverse signaling, i.e. when consumer signals producer about decreasing of queue's size.

Btw, Relacy Race Detector can easily reveal such errors (item I). Moreover unit-test can be as simple as: 2 threads, thread 1 enqueues single item, thread 2 dequeues single item. On this simple unit-test Relacy will be able to detect deadlock in a second. (SPIN/Promela won't detect, because it doesn't support relaxed memory models)

Dmitry_Vyukov · ‎09-09-2008

Oh, sorry, this one must go to TBB forum.