Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Possible concurrent_queue improvement

Elad_G_
Beginner
517 Views

Hi,

I was wondering why the concurrent queue is using compare and swap to get the next ticket in stead of using fetch and increment. (in concurrent_queue_base_v3::internal_pop_if_present and in concurrent_queue_base_v3::internal_push_if_not_full).

Calling compare and swap may harm the performance of the queue under high contention - the reasons for this are best explained by Dave Dice here: https://blogs.oracle.com/dave/entry/atomic_fetch_and_add_vs

So if the call to compare_and_swap( tk+1, tk ) would be changed to fetch_and_inc(tk) than it will make the queue more scalable.
any thoughts?

Thanks,
Elad. 

0 Kudos
11 Replies
Alexey-Kukanov
Employee
517 Views

Hi Elad,

the very use of shared atomic counters (for sake of ordering guarantees) makes the implementation inherently limited at scaling. Nevertheless, it's worth an experiment whether changing CAS to atomic increment helps concurrent_queue scalability on modern HW, and how much.

0 Kudos
RafSchietekat
Valued Contributor III
517 Views

And maybe throw in an atomic_backoff?

0 Kudos
Elad_G_
Beginner
517 Views

Alexey, 
You are right saying that the FIFO property means that the scalability is limited. However, I do believe that using fetch-and-inc instead ofCAS will give better performance. 

Raf, 
If fetch-and-inc will be used there will be no need for atomic_backoff as it is always successful. This is its main advantage on CAS, and the reason it is likely to be more scalable, as there is no need to retry when the contention is high.

0 Kudos
RafSchietekat
Valued Contributor III
517 Views

There's still a loop around that: as long as you're evaluating performance in that detail, you might as well have a quick look? I keep wondering about how to avoid that threads that were unsuccessful a few times remain at a disadvantage (and might get starved!), but it still seems somehow more polite than just hammering away...

0 Kudos
Wooyoung_K_Intel
Employee
517 Views

Hi Elad,

Thanks for the suggestion/ As a matter of fact, tbb::concurrent_queue is already using atomic increment (based on tbb::fetch_and_inc()) in push()/pop() routines.

Note that head_counter/tail_counter are declared as tbb::atomic<ticket>. And their values are incremented with '++' in internal_push/Internal_pop.

Now, I think you found uses of compare_and_swap in internal_pop_if_present/internal_push_if_not_full.These are for bounded queues. As such, we would like a calling thread to return immediately if the condition is no longer met. WIth xchg, things get really messy (and I am no sure a correct implementation is even feasible) because one of the assumptions that the concurrent queue is built upon is that these counters are monotonically increasing.

Thanks.

 

0 Kudos
RafSchietekat
Valued Contributor III
517 Views

Sorry I missed that! I'm trying to concentrate on something else right now, and I've already failed by getting distracted about uses of backoff. :-)

But while on that theme, what is the view about backoff in the queue? I see there's one for pushing, but it's mainly about concurrent_monitor, which seems to go straight to a kernel-level semaphore (right?). I seem to remember that this was a change from an earlier spin-only implementation, but shouldn't there be some middle ground? Just curious...

0 Kudos
jimdempseyatthecove
Honored Contributor III
517 Views

Raf, the general idea is:

Assume you have two queues. Each queue is a ring buffer who’s size (rbSize) is greater than the number of max(producers, consumers, arbitrarySize). NULL is used to indicate cell not occupied. One buffer is a list of tokens (address of struct/node) that your thread must possess in order to insert into the second queue. The second queue is the DoWork queue. IOW in order to enqueue a node into the DoWork queue the thread must first obtain a node (into which it places the work items or references). Each DoWork queue has an exclusive free node queue. Pseudo code

Enqueue:
index = mod(XCHGADD(&fillIndex, 1), rbSize)
while(rb[index]) WaitOrWorkStealOrDoNothing; // in only Debug, occupied should never happen
rb[index] = node
end Enqueue

Dequeue:
index = mod(XCHGADD(&EmptyIndex, 1), rbSize)
while(rb[index] == NULL) WaitOrWorkStealOrDoNothing; // or after while use CAS to backoff and return NULL
node = rb[index]
rb[index] = NULL
end Dequeue

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
517 Views

I'm afraid I don't recognise any of that in TBB's concurrent_queue.

0 Kudos
jimdempseyatthecove
Honored Contributor III
517 Views

Raf, It is not in concurrent_queue, its used in place of concurrent_queue. Though in some respects it behaves like concurrent_queue.

Using TBB currently, Raf might obtain a node from scalable allocator or heap, insert data, then enqueue into current concurrent_queue. Then on other side extract node from concurrent_queue, do work, return node to scalable allocator or heap.

Using suggested system, Raf would obtain node from private pool for use exclusively by new_concurrent_queue. You change the names to protect the innocent. The new_concurrent_queue object would be constructed to contain the private pool of free tokens (nodes). Enqueue node to new_concurrent_queue(work queue). Then on other side, extract node from work queue, do work, return node to private queue.

Enqueue can never overflow and will always have a free slot and who's slot number can be obtained by one XCHGADD (no failure as with CAS).
Dequeue of free node could potentially end up empty handed, but if coded well will always/mostly succeed with the expense again of one XCHGADD.

The question then becomes twofold:

a) When queue is not under contention, are 4 XCHGADD's less overhead than: allocate node from scalable allocator or heap + concurrent_queue fill + concurrent_queue empty + return node to scalable allocator or heap?

b) When under highly contentious circumstance are 4 XCHGADD's less overhead than numerous failures on CAS's as implimented in current scheme of things?

Code would have to be written before one could answer both questions, however, my gut feel is a) will be less overhead, and b) will be orders of magnitude less overhead.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
517 Views

I see, you were not correcting my analysis but proposing an alternative... Well, the CAS is only for non-blocking try_push() and try_pop() with concurrent_queue, if I'm not mistaken, so b) might have to be reformulated?

(2013-04-03) Shortened.

0 Kudos
jimdempseyatthecove
Honored Contributor III
517 Views

An additional optimization would be for each thread to have a private queue of free nodes (for use in the new_concurrent_queue) and where the ring buffer is .gt. the sum of the sizes of the private queue of free nodes. Using this technique then full cycle through new_concurrent_queue would take a hit of 2 XCHGADDs.

Jim Dempsey

0 Kudos
Reply