consistent but stale reads

bank_kus · ‎07-24-2011

Hi

I m wondering in the below sequence if Core 2 Socket 3 can return stale value of x = 10 much after x = 20 has happened on the other core. Nothing in this sequence seems to violate TLO-CC (from Rick's youtube presentation)

I understand with MESI like coherence protocols there is a single write owner but wondering if processors have internal optimizations that cheat to return older reads as long as causality and total lock order is not violated.

Regards

Banks

=====================

Core 0 Socket 0

=====================

store(x, 10)

mfence

store(y, 1)

mfence

......

store(x, 20)

mfence

store(y, 2)

mfence

=====================

Core 2 Socket 3

=====================

r0 = load(x) // returns 10

r1 = load(y) // return 1

... consistent but stale. Is this possible?

jimdempseyatthecove · ‎07-28-2011

How are you asserting your assumptions about time phasing between cores/sockets.
You cannot rely on time of break point to assert time phase between processors.
Debug break is not instanteanously amongst processors.

when r0 (x) returns 10, r1 (y) could see:

a) unknown state prior to store(y,1)
b) 1 after store(y,1) and prior to store(y,2)
c) 2 after store(y,2) and preceeding subsequent store(y,??)
d) ?? after store(y,??) following store(y,2) above

Allfour returns are consistent - and not necessarily stale.

Jim Dempsey

bank_kus · ‎08-03-2011

Hi Jim

I asked in the context of SEQLOCKs but perhaps I should have asked this instead of trying to draw a similar example.

SEQLOCK:

=====

Writer:

=====

mutex_lock

incr_version // WRITER in, reader please don't advance if you see this (is_odd test)

mfence

..... modify data

incr_version

mfence

mutex_unlock

======

Reader:

======

do {

v0 = getversion()

if (v0 is odd) continue;

d = snapshot_data

v1 = getversion()

} (while v0 != v1)

queue(d) // to a queue protected by a lock

---------------

Lets put some numbers version = 0, data = 20

Writer(0) Version = 1 ; data = 30; Version = 2

Writer(1) Version = 3 ; data = 50; Version = 4

Writer(2) Version = 5 ; data = 70; Version = 6

.....

Question:

Can I expect the queue to contain monotonous non decreasing values assuming several readers are running on separate cores?

Regards

Banks

jimdempseyatthecove · ‎08-04-2011

The way your write loop is writtena reader could observe:

Version = 1, data = 50, Version = 6 (or longer interval)
(@Writer 0+), (@Writer 1+), (@Writer 2+)

As well as the snapshot_data being a blurr of writer versions.

The IA32 and Intel64 processors will assure that the write sequences by one core/hw tread/processorare observed by other core/hw tread/processoreither in order or simulteanous w/o mfence (i.e. write combined to same cache line).

If you have one writer and multiple readers then this is called Single Producer Multi-Consumer (SPMC). It makes little sense to use SPMC if the "work" produced by your consumers is solely to insert into a locked queued. You did not indicate that the Consumers queue is to be ordered or not. If unordered and queue insertion is intermittantly relatively long then the MC could buffer copies of the writer's data through the intermittant delay period.

*** However, as you sketched the code, the Writer is oblivious as to if a reader has captured the data.

A better route to take (there are several routes) might be for the writer to use a ring buffer.

struct YourWritersData
{
...
};

struct YourWritersRingBuffer_SPMC
{
volatile long nextFillIndex;
YourWritersData RingBuffer[YourRingBufferSize];
volatile long nextEmptyIndex; // assumes RingBuffer larger than cache line
YourWritersRingBuffer_SPMC() { nextFillIndex = nextEmptyIndex = 0; }

// get pointer to nextfill buffer, returns null on buffer overrun
// called by single writer
YourWritersData*getBuffer() {
if(nextFillIndex + 1 == (nextEmptyIndex % YourRingBufferSize))
return NULL; // buffer overrun
return &RingBuffer[nextFillIndex % YourRingBufferSize]; } // no advance!!!

// indicate buffer filled in, get next buffer
// called by single writer
YourWritersData*nextBuffer() {
nextFillIndex = (nextFillIndex + 1) % YourRingBufferSize; //advance after fill complete
return getBuffer(); } // return buffer* orNULL if buffer overrun

// pop item from buffer, return NULL if empty
// called by multiple readers
YourWritersData*pop() {
for(;;) {
if(nextFillIndex == (nextEmptyIndex % YourRingBufferSize))
return NULL;
long copyNextEmptyIndex = nextEmptyIndex;
if(InterlockedCompareExchange(
&nextEmptyIndex, // location
nextEmptyIndex + 1, // exchange*** NOT % YourRingBufferSize
copyNextEmptyIndex ) == copyNextEmptyIndex)
return &RingBuffer[copyNextEmptyIndex % YourRingBufferSize];
} // for(;;)
} // pop
};

Note, the modulus usage of the nextEmptyIndex (and copy). This provides a practical protection against a consumer thread being preempted for a duration of interviening fill/empty cycles of being a multiple of YourRingBufferSize. On IA32 the preemption period would have to last 4 billion such insertions before possible adverse situation (over a day of premption @ 1us/insertion). This should not occue unless consumer thread has:

crashed
is waiting at prompt
is in debug break point
O/S is severely overloaded

If you must have higher protection then use a 64-bit value for the nextEmptyIndex, its copy and the 64-bit InterlockedCompareExchange (4 billion days at 1us insertion rate, 4 million days at 1ns/insertion rate).

There are other strategies for avoiding the dequeue Interlocked... but these may have there own set of issues with respect to thread pre-emption latencies (so does your writer thread unless it is a dedicated processor).

Jim Dempsey

jimdempseyatthecove · ‎08-04-2011

After looking at my code, it would be advisable to add a flag to the YourWritersData struct to indicate that the reader is done with the buffer which can also be used by the writer to indicate that the buffer is full of new data.

struct YourWritersData

{
volatile bool WaitingForConsumer;
YourWritersData() {WaitingForConsumer = false;}
...
};

YourWritersData*getBuffer() {
if(RingBuffer[nextFillIndex].WaitingForConsumer)
return NULL; // buffer overrun
return &RingBuffer[nextFillIndex % YourRingBufferSize]; } // no advance!!!

// indicate buffer filled in, get next buffer
// called by single writer
YourWritersData*nextBuffer() {
RingBuffer[nextFillIndex].WaitingForConsumer= true;
nextFillIndex = (nextFillIndex + 1) % YourRingBufferSize; //advance after fill complete
return getBuffer(); } // return buffer* orNULL if buffer overrun

// pop item from buffer, return NULL if empty
// called by multiple readers
YourWritersData*pop() {
for(;;) {
if(nextFillIndex == (nextEmptyIndex % YourRingBufferSize))
return NULL;
long copyNextEmptyIndex = nextEmptyIndex;
if(InterlockedCompareExchange(
&nextEmptyIndex, // location
nextEmptyIndex + 1, // exchange*** NOT % YourRingBufferSize
copyNextEmptyIndex ) == copyNextEmptyIndex)
return &RingBuffer[copyNextEmptyIndex % YourRingBufferSize];
} // for(;;)
} // pop

void ReaderDone(YourWritersData* b) { b->WaitingForConsumer = false; }

Jim Dempsey

bank_kus · ‎08-04-2011

I think we're completely missing the point here and I m honestly asking this of Intel architects and hardware engineers.

Does MESI truly guarantee all reads to the same address return the very last Write or can there be optimizations such that second last write is returned. The definition of last write should be fairly modest? The last core to own that cacheline in the "E" state?

I m not trying to find a good way to do MPSC or SPMC rather asking if SEQLOCKS implemented as they are today, can suffer from returning stale but consistent reads.

Regards

Banks

jimdempseyatthecove · ‎08-05-2011

>>such that second last write is returned.

I will assume you meant second to last write

Writer Reader 1 Reader 2

write 0
read 0 Read 0
observe 0 (interrupt)
write 1
read 1
write 2
observe 1
write 3
observe 0
----
So yes, you can observe what you read in prior write states (as well as current state)
---------------------------------
near-simulteaneous

write 0
write 1read 0

where the read occures immediately prior to cach invalidation (in the readers cach system)
which may occure after the fetch of the write instruction which occures prior to the write to cache/RAM/cache eviction on other cores/processors. This makes it subjective as to what is defined as being first.

What is assured is:

Writer Reader
x = 0
(start) (start)
loop: loop:
inc x x0 = x
if(x goto loopassert(x0<=x1)
if(x1 goto loop

The reader assert should never trigger. You can insert in the reader whatever you want in between the sample of x0 and x1 provided you do not modify x.

*** Note, the above represents the generated assembly code and is not representative of the source code which may optimize variables to register and/or reorder instructions.

The reader may observe: x0==x1, x0==(x1-1),... x0==(x1-n) i.e. x0<=x1
It should never observe x0>x1

This can be stated as the read sequence order (ascending) follows the write sequence order (ascending) although the observed sequences are not necessarily the same (writer 1,2,3,4,5..., reader 1,3,6...)

Jim Dempsey