Single-producer/single-consumer shared queue

scott_meyer · ‎10-22-2008

I'm looking at parallelizing a family of algorithms which traffic in 8-byte integers.

The single threaded form of these algorithms is a function-call tree in which the function calls all return 8-byte values. Something like:

long long get_next( ...) { static long long x; x +=3; return x; }

except that the actual computation would be somewhat more complex (say 50x) than a single integer addition. The objective is to use multiple cores to evaluate such a tree more tapidly than a single core can.

There is substantial natural parallelism. In particular, at most points where the tree branches all branches can be evaluated in parallel. The problem is the cost of moving 8-byte values from one cpu to another. Any sort of atomic value-at-a-time atomic mechanism is going to be hundreds of times slower than the function call, probably prohibitive given the cost of what we do in an individual function. However, we can use a bounded shared queue (circular buffer). And now the thread title becomes clear.

What I'm after is the current best practice for such structures on Intel hardware, Xeons to be specific.

-Scott

jimdempseyatthecove · ‎10-22-2008

Scott,

Single producer/single consumer ring buffer (two threads) can be handled without interlocked operations.

Ring buffer, each element is cache line sized andinitialized to 0's, producer's fill index set to 0 and is cache aligned and nothing else in cache line (padds). The consumer's empty index is set to 0 and is cache aligned and nothing else in cache line (cache line now at 64 bytes, latter may be different.). Each index, when indexed is MOD buffer size.

Producer on filling examines buffer to assure current value is 0 (typically you have a pointer or value at the front of a cache line size struct. If not 0 then spin (_mm_pause() or SwitchToThread()). When 0, write to buffer at [index] then increment index mod the buffer size.

Consumer, reads entry at [index] If 0 then spin (_mm_pause() or SwitchToThread()). When not0, obtain value and save, then write 0to buffer at [index] then increment index mod the buffer size.

Note, if passing 0 as data then choose a different value for "empty".

Jim Dempsey

scott_meyer · ‎10-22-2008

Quoting - jimdempseyatthecove

Scott,

Single producer/single consumer ring buffer (two threads) can be handled without interlocked operations. [...]

Right, that's why I'm looking at it. No need to Heimlich the pipeline gratuitously.

Running a simple test, 10-element ring buffer, I show an Opteron as running about 54x as slow as the corresponding function call. The Xeon is about 80x as slow and, unlike the Opteron case, the numbers are very erratic, and don't seem to improve down the expected axes. Increasing the size of the buffer, for example, doesn't seem to help much.

OK obviosly, cache lines are key and we'd like to keep the head and tail in separate cache lines as much as we can. Etc, etc.

However, before I start off a long series of experiments, I though it might be worth asking what the best practice for this sort of thing is. Intel has been known to distributed highly optimized assembler libraries for doing interesting things and I was hoping that this was one of them...

-Scott

Chris_M__Thomasson · ‎10-22-2008

Quoting - scott_meyer

Right, that's why I'm looking at it. No need to Heimlich the pipeline gratuitously.

Running a simple test, 10-element ring buffer, I show an Opteron as running about 54x as slow as the corresponding function call. The Xeon is about 80x as slow and, unlike the Opteron case, the numbers are very erratic, and don't seem to improve down the expected axes. Increasing the size of the buffer, for example, doesn't seem to help much.

OK obviosly, cache lines are key and we'd like to keep the head and tail in separate cache lines as much as we can. Etc, etc.

However, before I start off a long series of experiments, I though it might be worth asking what the best practice for this sort of thing is. Intel has been known to distributed highly optimized assembler libraries for doing interesting things and I was hoping that this was one of them...

-Scott

I don't have time to give a proper response...

;^(...

However, you can take a look at my unbounded spsc-queue here:

http://webpages.charter.net/appcore

There are several improvements that can be applied to this data-structure. One, its not optimized for head-tail segregation across separate cache-lines.

One neat thing, it does indeed have an eventcount algorithm I invented here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380

This primitive allows one to wait on boundary conditions; eg..g queue empty, perhaps queue full.

I am planning anAppCoretweak that should be ready in a week. It will include several improvements on the queue as a whole.

I discuss programming/implementation details about AppCore on `comp.programming.threads'.

Dmitry_Vyukov · ‎10-22-2008

Quoting - scott_meyer

There is substantial natural parallelism. In particular, at most points where the tree branches all branches can be evaluated in parallel. The problem is the cost of moving 8-byte values from one cpu to another. Any sort of atomic value-at-a-time atomic mechanism is going to be hundreds of times slower than the function call, probably prohibitive given the cost of what we do in an individual function. However, we can use a bounded shared queue (circular buffer). And now the thread title becomes clear.

What I'm after is the current best practice for such structures on Intel hardware, Xeons to be specific.

I submit exactly this kind of data structure to ISN Knowledge Base few days ago:

http://software.intel.com/en-us/articles/single-producer-single-consumer-queue

It is cache friendly and uses no atomic RMW nor heavy memory fences on fast-path. Although it is unbounded and uses dynamically allocated nodes.

As Chris noted, it's possible to make it blocking, i.e. so that pop() on empty queue will block and wait.

Dmitry_Vyukov · ‎10-22-2008

Quoting - CHRIS M THOMASSON

However, you can take a look at my unbounded spsc-queue here:

http://webpages.charter.net/appcore

Chris, can you, please, review that I have done all correctly and have not made some foolish mistake:

http://software.intel.com/en-us/articles/single-producer-single-consumer-queue

Dmitry_Vyukov · ‎10-23-2008

Quoting - scott_meyer

Right, that's why I'm looking at it. No need to Heimlich the pipeline gratuitously.

Running a simple test, 10-element ring buffer, I show an Opteron as running about 54x as slow as the corresponding function call. The Xeon is about 80x as slow and, unlike the Opteron case, the numbers are very erratic, and don't seem to improve down the expected axes. Increasing the size of the buffer, for example, doesn't seem to help much.

OK obviosly, cache lines are key and we'd like to keep the head and tail in separate cache lines as much as we can. Etc, etc.

However, before I start off a long series of experiments, I though it might be worth asking what the best practice for this sort of thing is. Intel has been known to distributed highly optimized assembler libraries for doing interesting things and I was hoping that this was one of them...

To the best of my knowledge, non of Intel libs contain such queue. TBB contains only multi-producer/multi-consumer queue, which is very slow as compared to function call (there are several atomic RMW operations, and several cache-line transfers per every operation).

Cache-lines are the key! Undoubtedly! If you will make even single error in data layout, you will get 100x slower solution! No jokes!

You can post your implementation here. On array-based unbounded queue I was able to achieve performance around 15 cycles per enqueue+dequeue.

Dmitry_Vyukov · ‎10-23-2008

Quoting - jimdempseyatthecove

Single producer/single consumer ring buffer (two threads) can be handled without interlocked operations.

Ring buffer, each element is cache line sized andinitialized to 0's, producer's fill index set to 0 and is cache aligned and nothing else in cache line (padds). The consumer's empty index is set to 0 and is cache aligned and nothing else in cache line (cache line now at 64 bytes, latter may be different.). Each index, when indexed is MOD buffer size.

Producer on filling examines buffer to assure current value is 0 (typically you have a pointer or value at the front of a cache line size struct. If not 0 then spin (_mm_pause() or SwitchToThread()). When 0, write to buffer at [index] then increment index mod the buffer size.

Consumer, reads entry at [index] If 0 then spin (_mm_pause() or SwitchToThread()). When not0, obtain value and save, then write 0to buffer at [index] then increment index mod the buffer size.

I think it will be the fastest solution, if boundness is Ok.

Also it's crucial to place consumer index and producer index at different cache lines.

And since we dedicate whole cache-line to each value, I think that we can just add dedicated flag to mark cell full/empty. Something like:

[cpp]template
class queue
{
    struct cell
    {
        T       value;
        bool    is_full;
        char    pad [cache_line_size - sizeof(T) - sizeof(bool)];
    };
    //...
};[/cpp]

Dmitry_Vyukov · ‎10-23-2008

Quoting - CHRIS M THOMASSON

One neat thing, it does indeed have an eventcount algorithm I invented here:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380

This primitive allows one to wait on boundary conditions; eg..g queue empty, perhaps queue full.

Btw, Chris, there is a nasty thing wrt spsc-queue and eventcount. Producer must execute #StoreLoad style fence (x86's mfence) in every enqueue().

One possible way around is to use asymmetric eventcount:

http://groups.google.com/group/comp.programming.threads/browse_thread/thread/39a5a91c029c1c51

jimdempseyatthecove · ‎10-23-2008

Scott,

Your thread title "Single-producer/single-consumer shared queue" implies a two thread system (accessing a given queue), but your thread body states "There is substantial natural parallelism". Seeing as how many systems now have 4 cores or more wouldn't your runtime requirements dictate the use of more cores?

Depending on your requirements you could use a group of s-p/s-c queues between each thread combination (i.e. as a private pipe between threads). But then the producer would have to select a queue for insertion. Round robin might be ok if the work loads are approximately equal. If the work loads are not equal then you might consider other techniques to balance the loads.

If we had a sketch of what you wish to do it might be helpful to us in producing a suggestion that makes sense.

Jim Dempsey

scott_meyer · ‎10-23-2008

Quoting - jimdempseyatthecove

Scott,

Your thread title "Single-producer/single-consumer shared queue" implies a two thread system (accessing a given queue), but your thread body states "There is substantial natural parallelism". Seeing as how many systems now have 4 cores or more wouldn't your runtime requirements dictate the use of more cores?

Depending on your requirements you could use a group of s-p/s-c queues between each thread combination (i.e. as a private pipe between threads). But then the producer would have to select a queue for insertion. Round robin might be ok if the work loads are approximately equal. If the work loads are not equal then you might consider other techniques to balance the loads.

You're on the right track. I want to turn a function call tree into thousands, possibly hundreds of thousands of single-producer/single-consumer queueus; then have driver loops (1 or more per cpu) which services each queue in turn. Lots of interesting issues in the driver loop wrt. minimizing scheduling overhead, but these are all downstream of the basic function-call comparison performance requirement.

If you want a bigger picture, think of an actor system where the connection between actors is a shared queue. Quite similar to a process in Erlang but highly optimized for large numbers of 64-bit messages. Erlang uses locked cmpxchg to deliver a more general message structure and is thus hopelessly slow for my application.

Anyway... net of the discussion thus far is that we need to keep head and tail in separate cache lines and allocate queue storage in cache-line multiples where N is probably at least 3. Then try to keep head and tail always in different cache lines. Hmm... that's at least 5 cache lines total or 40 long longs per queue. Runtime memory expense++... I was hoping for something more on the order of 6 long longs but the memory expense may be supportable if the performance is good enough.

Thanks for the comments,

-Scott

Dmitry_Vyukov · ‎10-23-2008

Quoting - scott_meyer

You're on the right track. I want to turn a function call tree into thousands, possibly hundreds of thousands of single-producer/single-consumer queueus; then have driver loops (1 or more per cpu) which services each queue in turn. Lots of interesting issues in the driver loop wrt. minimizing scheduling overhead, but these are all downstream of the basic function-call comparison performance requirement.

If you want a bigger picture, think of an actor system where the connection between actors is a shared queue. Quite similar to a process in Erlang but highly optimized for large numbers of 64-bit messages. Erlang uses locked cmpxchg to deliver a more general message structure and is thus hopelessly slow for my application.

Anyway... net of the discussion thus far is that we need to keep head and tail in separate cache lines and allocate queue storage in cache-line multiples where N is probably at least 3. Then try to keep head and tail always in different cache lines. Hmm... that's at least 5 cache lines total or 40 long longs per queue. Runtime memory expense++... I was hoping for something more on the order of 6 long longs but the memory expense may be supportable if the performance is good enough.

Interesting!

First of all, if you have hundreds of thousands of queues, and worker thread has to poll all them... well... it doesn't look like very fast solution, even if queues themselves are very fast... Until I am completely missing something.

Work-stealing scheduling won't help here, because typical implementations execute cmpxchg (or heavy memory fence) in pop() function. Though it's possible to create some heuristics taking into account very large number of very small work-items.

Also I don't think that you need such high number of spsc queues, because they are intended for transferring data between threads. And if you have, for example, 4 threads and 100 000 queues, it looks quite senselessly (you can connect all threads with just 12 queues). You must do yours best to keep work-items local to the thread, and not send them to other threads at the first opportunity. Transferring of work-items between threads will have huge overheads, even if no cmpxchgs executed.

Dmitry_Vyukov · ‎10-23-2008

Quoting - scott_meyer

Anyway... net of the discussion thus far is that we need to keep head and tail in separate cache lines and allocate queue storage in cache-line multiples where N is probably at least 3. Then try to keep head and tail always in different cache lines. Hmm... that's at least 5 cache lines total or 40 long longs per queue. Runtime memory expense++... I was hoping for something more on the order of 6 long longs but the memory expense may be supportable if the performance is good enough.

You can consider following variant. Brief pseudo-code:

[cpp]struct spsc_queue_body
{
...
};

struct spsc_queue_producer_part
{
  void* producer_index;
  spsc_queue_body* body;
};

struct spsc_queue_consumer_part
{
  void* consumer_index;
  spsc_queue_body* body;
  
  void connect(spsc_queue_producer_part* producer)
  {
    ...
  }
};
[/cpp]

The point is that you can place consumer and producer parts of the queue manually where you want, and then "connect" them into single queue. So you can allocate dense array of 100,000 consumers parts, and waste no memory.

Further, it's possible to "compress" consumer and producers parts to single pointer. If "body" of the queue allocated with proper alignment, then it's possible to detect moment when index overflows w/o any additional data. Something like:

[cpp]void producer_part::enqueue(uint64_t v)
{
  while (current_pos[0])
    backoff();
  current_pos[0] = v;
  current_pos += 1;
  if (((uintptr_t)current_pos % 64) == 0)
    current_pos -= 64 / sizeof(uint64_t);
}[/cpp]

Also it's not necessarily to place every item on separate cache line, if you producers and consumer usually produce and consume in batches, then it's even better to place all items in one cache line.

scott_meyer · ‎10-23-2008

Quoting - Dmitriy V'jukov

Also I don't think that you need such high number of spsc queues, because they are intended for transferring data between threads. And if you have, for example, 4 threads and 100 000 queues, it looks quite senselessly (you can connect all threads with just 12 queues). You must do yours best to keep work-items local to the thread, and not send them to other threads at the first opportunity. Transferring of work-items between threads will have huge overheads, even if no cmpxchgs executed.

For any specific case, one could probably come up with a mapping of a tree onto seveveral threads that would probably be more efficient creating a separate "actor" for every node in the call tree. However, doing this in general is very hard and involves implementing two forms of interaction (message send and function call) at every level and being able to switch back and forth. What happens when a function call has to wait for a message to arrive?

In contrast, creating an actor for everything is an easy transform and the code need only support message sending. Question is, can the hardware be coaxed into executing the message-send form fast enough to make it worthwhile? If not, we stick with function calls and sell bandwidth instead of latency.

-Scott

Dmitry_Vyukov · ‎10-23-2008

Quoting - scott_meyer

For any specific case, one could probably come up with a mapping of a tree onto seveveral threads that would probably be more efficient creating a separate "actor" for every node in the call tree. However, doing this in general is very hard and involves implementing two forms of interaction (message send and function call) at every level and being able to switch back and forth. What happens when a function call has to wait for a message to arrive?

Well, I've created several run-times for agent-oriented programming and can say that it's not obligatory to create queue per agent (until you don't need strict FIFO guarantees), nor it's obligatory to be able to switch between message-passing and direct function calls.

You just have to pack data along with pointer to agent and put this "message" into per thread queue/stack. Thread pools only 1 queue, fetches data and pointer to agent, and then calls appropriate method of the agent and passes the data in.

This way you need only NUMBER_OF_THREADS queues, and still have commonality of message-passing.

If you want I can go into more details.