Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Multi-producer/multi-consumer SEH-based queue

Dmitry_Vyukov
Valued Contributor I
1,657 Views
I've posted novel multi-producer/multi-consumer queue algorithm here:
http://software.intel.com/en-us/blogs/2009/08/11/multi-producermulti-consumer-seh-based-queue/
The algorithm has quite appealing characteristics:
- intrusive
- producers: 1 XCHG, wait-free
- consumers: 1 CAS on common path, mostly lock-free
- producers and consumers do not contend with each other (until queue is empty)
- no need for safe memory reclamation
I bet it will beat Michael and Scott queue to death in every way.
If you are interested we may discuss it here.

0 Kudos
6 Replies
Chris_M__Thomasson
New Contributor I
1,657 Views
Quoting - Dmitriy Vyukov
I've posted novel multi-producer/multi-consumer queue algorithm here:
http://software.intel.com/en-us/blogs/2009/08/11/multi-producermulti-consumer-seh-based-queue/
The algorithm has quite appealing characteristics:
- intrusive
- producers: 1 XCHG, wait-free
- consumers: 1 CAS on common path, mostly lock-free
- producers and consumers do not contend with each other (until queue is empty)
- no need for safe memory reclamation
I bet it will beat Michael and Scott queue to death in every way.
If you are interested we may discuss it here.


FWIW, here is a link to a previous disscussion in `comp.programming.threads':

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c43d1197c8168b4e


The intrusive nature of the queue is extremely attractive indeed. Also, the wait-free nature of push is basically unbeatable. No PDR == very NICE! With regard to comparing it to Michael and Scott queue, well, it's going to devastate it an all fronts. I am sure of that, no testing is even needed!

;^)


Also, FWIW, here is the version that I created which does require memory management and is not intrusive:

http://software.intel.com/en-us/forums/showthread.php?t=66573

And here is a full-blown example implmentation in C#:

http://cpt.pastebin.com/f72cc3cc1

And of course, the implmentation in Relacy Race Detector:

http://relacy.pastebin.com/f59d6ff2b


The non-intrusive nature of my version is simpler, however that does not mean it's better...
0 Kudos
Chris_M__Thomasson
New Contributor I
1,657 Views
Quoting - Dmitriy Vyukov
I've posted novel multi-producer/multi-consumer queue algorithm here:
http://software.intel.com/en-us/blogs/2009/08/11/multi-producermulti-consumer-seh-based-queue/
The algorithm has quite appealing characteristics:
- intrusive
- producers: 1 XCHG, wait-free
- consumers: 1 CAS on common path, mostly lock-free
- producers and consumers do not contend with each other (until queue is empty)
- no need for safe memory reclamation
I bet it will beat Michael and Scott queue to death in every way.
If you are interested we may discuss it here.



Please note that there is a race-condition in the non-intrusive version of the algorithm I posted IF, and ONLY IF, you read the tail pointer BEFORE the tail aba counter. Here is explanation and Relacy source-code which proves it:

http://groups.google.com/group/comp.programming.threads/msg/62527c65a440516a


http://relacy.pastebin.com/f4b57bda2


I am pretty sure that this will effect Dmitriy's algorithm as well as he is loading pointer BEFORE aba counter. However, I need to model his algorithm in Relacy before I give definitive answer.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,657 Views

I am pretty sure that this will effect Dmitriy's algorithm as well as he is loading pointer BEFORE aba counter. However, I need to model his algorithm in Relacy before I give definitive answer.

Here is the test I used for verification:
http://relacy.pastebin.com/f5df9dcfa

The test does not detect any problems.
Unfortunately it uses new not yet published syntax, but at least you may use it as a base.
I express the load as:

[cpp]        head_t h1 = head_.next_.load(std::memory_order_acquire);
        head_t h2 = head_.next_.load(std::memory_order_acquire);
        head_t h (0, 0);
        if (rl::rand(2) == 0)
        {
            h.ptr_= h1.ptr_;
            h.cnt_ = h2.cnt_;
        }
        else
        {
            h.ptr_= h2.ptr_;
            h.cnt_ = h1.cnt_;
        }
[/cpp]


0 Kudos
Chris_M__Thomasson
New Contributor I
1,657 Views
Quoting - Dmitriy Vyukov

Here is the test I used for verification:
http://relacy.pastebin.com/f5df9dcfa

The test does not detect any problems.
Unfortunately it uses new not yet published syntax, but at least you may use it as a base.
I express the load as:

[cpp]        head_t h1 = head_.next_.load(std::memory_order_acquire);
        head_t h2 = head_.next_.load(std::memory_order_acquire);
        head_t h (0, 0);
        if (rl::rand(2) == 0)
        {
            h.ptr_= h1.ptr_;
            h.cnt_ = h2.cnt_;
        }
        else
        {
            h.ptr_= h2.ptr_;
            h.cnt_ = h1.cnt_;
        }
[/cpp]




I will try and convert the example code to Relacy 1.5.0. BTW, how did you eliminate the need for `$'? That's really cool! Anyway, I have one comment. I notice that you are performing the "out-of-order" load only one time, then you enter the "DWCAS loop" and never perform it again upon DWCAS failure. That will not give much of a window for the race, if any, to occur. Might I suggest that you rework the algorithm such that the "out-of-order" load occurs on every iteration of the loop. Something like:


[cpp]    node_t* dequeue()
    {
        for (;;)
        {
            head_t h1 = head_.next_.load(std::memory_order_acquire);
            head_t h2 = head_.next_.load(std::memory_order_acquire);
            head_t h (0, 0);
            if (rl::rand(2) == 0)
            {
                h.ptr_= h1.ptr_;
                h.cnt_ = h2.cnt_;
            }
            else
            {
                h.ptr_= h2.ptr_;
                h.cnt_ = h1.cnt_;
            }
            // [...]
        }
        // [...]
    }[/cpp]

Or perhaps even:
[cpp]
    node_t* dequeue()
    {
        rl::backoff b;
        for (;;)
        {
            head_t h (0, 0);
            head_t h1 = head_.next_.load(std::memory_order_acquire);
            h.ptr = h1.ptr_
            b.yield();
            head_t h2 = head_.next_.load(std::memory_order_acquire);
            h.cnt = h2.cnt_
            // [...]
        }
        // [...]
    }[/cpp]
0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,657 Views
I will try and convert the example code to Relacy 1.5.0. BTW, how did you eliminate the need for `$'? That's really cool!

#define memory_order_XXX mo_XXX, $

The same for all WinAPI/POSIX functions, i.e.:

#define WaitForSingleObject(obj, timeout) rl::rl_WaitForSingleObject(obj, timeout, $)

There are still some place where one has to use macros:
1. Declaration/accesses to plain vars:
DECL_VAR(int) x;
VAR(x) = 1;
2. Declaration/accesses to thread local vars:
DECL_TLS(int) y;
TLS(y) = 1;
3. Memory management functions:
NEW(int) (0);
DEL(p);
(hmm... I think I may redefine new/delete too)

Btw, I am going to publish next version under BSD style licence w/o any additional clauses.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
1,657 Views
Anyway, I have one comment. I notice that you are performing the "out-of-order" load only one time, then you enter the "DWCAS loop" and never perform it again upon DWCAS failure. That will not give much of a window for the race, if any, to occur. Might I suggest that you rework the algorithm such that the "out-of-order" load occurs on every iteration of the loop.

Yes, of course, this will be better. I was just lazy :)

0 Kudos
Reply