Solved: Draining store buffer on other core

Boris_D_ · ‎01-05-2015

Hello,

I've a weird question:

As I understand, mfence instruction causes draining of the store-buffer on the same core on which it was executed.

Is there some way for thread on core A, to cause draining of the store-buffer of core B, without running on core B? Maybe some dirty tricks like simulating IO or exception interrupts?

Thanks!

McCalpinJohn · ‎01-06-2015

Locked instructions may behave differently than normal instructions in this case.

Since we are talking about WB memory, only one core can have write permission to a line at any given time. Any store from CoreA will first require a coherence transaction (RWITM) that invalidates the line in all other caches and returns the current value of the line to the local cache. The store cannot be committed to CoreA's cache until after the RWITM transaction completes, though it might write to the store buffer with the data marked as speculative.

Look at some cases:

1. Suppose that another processor (CoreB) executes a load to the same cache line after CoreA's RWITM completes, but before CoreA has written the data from its store buffer to its cache. One possible approach is to respond to the intervention request with the current version of the cache line that is in the cache -- i.e., without the updates that are held in the store buffer (though possible with other updates that have already drained from the store buffer). The line is downgraded from M to S (or similar) in CoreA's cache, and the data is copied to CoreB's cache in S state. At some later time, CoreA needs to drain its store buffer -- but the corresponding cache line no longer has write permission! So a second RWITM is sent out, the S state copy of the cache line is invalidated in CoreB's cache, and S state copy of the line in CoreA's cache is upgraded to M. This time the data from the store buffer is ready to write, so it will be written to the cache as soon as the RWITM transaction completes. If there is a single update for the cache line in the store buffer, then it is guaranteed to be completed in one "extra" RWITM. If there are multiple updates to the cache line in the store buffer that are interleaved with stores to other cache lines, then the process could involve additional pairs of intervention/downgrade and RWITM transactions. Note that after the line is downgraded in CoreA's cache, it might be selected as a victim and silently dropped. In that case when the store buffer needs to write to the line, the RWITM will need to allocate a new line in the cache, possible evicting another cache line. This seems a bit scary in terms of deadlock/livelock, so one could imagine an implementation that does not perform remote snoops on the store buffer, but which does snoop the store buffer before dropping a line from the L1 Data cache -- forcing the RWITM and store buffer drain to precede victimization of the line. I am not sure that this is any easier, but it is a reminder that there are some of the state machines are probably very ugly! Also note that since the store buffer MUST drain in FIFO order, any mechanism that allows snooping must drain from the oldest entry in the store buffer to the target entry. These entries might cover several cache lines, all of which might have been downgraded in the cache!

This sequence of events may seem perverse, but I think the only way to avoid it is to snoop the store buffer on interventions. If you snoop the store buffer on interventions, then the store buffer is effectively just part of the cache and you go back to having a Sequentially Consistent model instead of the Total Store Order model that x86 supports. So I think that something like the transaction sequence above must be allowed on Intel processors.

2. Suppose that CoreB executes a store to the same cache line after CoreA's RWITM completes, but before CoreA has written (all of the) data from its store buffer to its cache. I have not worked through this case in detail, but it is clear that it is somewhat trickier -- if the intervention implied by CoreB's RWITM is allowed to proceed without snooping the store buffer, then the cache line will be invalidated in CoreA's data cache, making it very likely to be victimized. It might be possible to handle this the same way as the previous case -- each core's stores are guaranteed to show up in program order, but the interleaving of those stores is not guaranteed. Since no transactions are locked, there are no atomicity guarantees to be broken.

3. This all suggests that a simple (?) implementation of locked atomic operations will prohibit external interventions from completing on a cache line between the (locked) RWITM (which causes the data to be read) and the delivery of the updated data to the cache. Section 11.10 of Volume 3 of the SW Developer's Guide already says that locked operations cause the store buffer to drain, which presumably includes both data from prior stores as well as the data from the locked store. Any load (intervention) request that arrives while the locked operation is in progress will be either NACK'd or simply delayed until the locked operation is complete. The latter case is plausible because locked operations are limited to a very small subset of instructions that are guaranteed to complete within a few cycles.

View solution in original post

jimdempseyatthecove · ‎01-05-2015

That is an interesting question/prospect. Typically the cache coherency system takes care of this on a cache line by cache line basis, but not in toto.

What you seem to query about is if there is a hypothetical XMFENCE (external memory fence) instruction that you can use in lieu of MFENCE. As an observation (hypothesis), lightly synchronized programs may end up issuing many (orders) of MFENCES than are actually required. In those cases, it might be beneficial for an external core to request the MFENCE be performed when needed. In order to reduce the number of unnecessary draining of store buffers, perhaps the exploratory XMFENCE would be supplied an address or range (like PREFETCH or MONITOR). Then any/all other cores containing the cache line in its store buffer, would flush upto (or all) the cache line specified. In essence this is what is done by reading the cache line, so it needs to be justified why you need the instruction as opposed to reading a location on the cache line.

FWIW you might want to experiment with MONITOR/MWAIT, it might effectively do this type of operation (though not as a documented use).

Jim Dempsey

McCalpinJohn · ‎01-05-2015

From purely architectural perspective, the MFENCE instruction does not have to directly cause the draining of the store buffers on a processor -- it simply has to prevent any subsequent memory references from being executed before the prior loads and stores have reached global visibility. Of course in practice it makes sense to have the MFENCE instruction cause the store buffers to be drained -- otherwise the processor would simply stall until some other mechanism caused the store buffers to drain. Intel's choice to make MFENCE drain the store buffers is confirmed in Section 11.10 of Volume 3 of the Intel SW Developer's Guide, where the following list of store-buffer-draining events is included:

Generation of an Exception and/or Interrupt
Execution of a serializing instruction (CPUID, IRET, and RSM are the only non-privileged serializing instructions)
Execution of an I/O instruction
Execution of a LOCK operation
Execution of the BINIT operation (an external reset operation using the BINIT pin)
Execution of an SFENCE instruction
Execution of an MFENCE instruction

Clearly interrupting the other core will cause the store buffers to drain, but this seems like an awfully expensive operation compared to simply waiting for the store buffers to drain on their own.

Of course the question might make more sense with some more context about the proposed use case....

MFENCE instructions are not needed very often in the x86 architecture. I can think of three cases:

1. Forcing Sequential Consistency by preventing loads from executing before any prior (in program order) stores. This is discussed in Chapter 4 of "A Primer on Memory Consistency and Cache Coherence" by Sorin, Hill, and Wood. This deviation from Sequential Consistency is not part of any commonly used shared-memory synchronization constructs, most of which rely on either atomic Read/Modify/Write operations or on the ordering of multiple stores.

2. Forcing ordering when transitioning between "normal" and "streaming" stores.

3. Forcing ordering for CLFLUSH instructions.

These are all cases for which MFENCE is (sometimes) required for correctness.

It is possible that MFENCE instructions might help performance in producer/consumer codes. In my testing on a two socket Xeon E5-2680 system (Sandy Bridge EP) adding the MFENCE after a pair of stores (data & flag) actually slowed down the producer/consumer handoff for two cores running on the same chip (by about 10%: from 52.8 ns per one-way handoff to 58.2 ns per one-way handoff), but it did speed up the producer/consumer handoff for two cores on different chips (by about 15%: from 244.4 ns per one-way hand-off to 212.1 ns per one-way handoff).

Boris_D_ · ‎01-06-2015

Hi, Thanks for your replies

"hypothetical XMFENCE" is exactly what I need. As far as I understand reading the cache line by core A, doesn't flush the store buffer of core B, correct me if I am wrong...

The use case is quite simple, I need to know a value of a certain variable, let say at time T on thread A (happens only once). the value is only changed by thread B, and after point in time T should not be changed. Thus I need either a mfence in threadB (which is slow), some echoing mechanism or a global XMFENCE. Echoing mechanism is not affordable for me as thread B can theoretically halt or exit without echoing.

Thanks.

jimdempseyatthecove · ‎01-06-2015

>>only changed by thread B, and after point in time T should not be changed

Then what is your reason for concern for one MFENCE during the life of an application?

One option you can explore is to have thread A perform an XADD (exchange add) of the variable and 0. If the cache line is dirty in core B, this will cause B to flush the cache line, thread A will then read the most recent value (and add 0, and write). It will also have the side effect of evicting the cache line in thread B, which may introduce other latencies to B if/when other data co-resides in that cache line.

John D. would a LOCK and mov from memory to register cause the flush as well? The LOCK might be elided under this seemingly useless combination (note the reference manual lock is only valid on memory modification instructions).

Jim Dempsey

jimdempseyatthecove · ‎01-06-2015

Now then, I am going to make an assumption on what you are not telling us.

Your application has a producer thread that writes to this variable, or updates this variable, or writes a series of values into a buffer starting at a beginning variable and continuing through memory, or ring buffer. It is a requirement (desired) that the producer thread NOT be hindered by activities of other threads (to as much of a degree as possible).

The requirements of some other thread is to have as least latency as possible to process the data being produced by the producer thread. Please note that latency and throughput often are on opposing sides of an issue.

One way to handle this is to select a processor with Hyper Threading. Have one thread on the core be the producer thread, then assuming "you don't want hyper-threading on the core of the producer", have the 2nd thread in the core reading the location(s) to be filled by the producer thread (possibly using MONITOR/MWAIT or PAUSE to reduce core load), and once the change is observed, then it issues the SFENCE. This then flushes the cache line held by the core, for a consumer thread (3rd thread) residing in a different core to see the update and perform the processing.

As mentioned earlier latency and throughput often are on opposing sides of an issue. You will have to assess as to if reducing the latency didn't adversely affect the throughput.

Jim Dempsey

McCalpinJohn · ‎01-06-2015

I think I am a bit confused by the use case...

1. The MFENCE instruction is not particularly slow -- in isolation it only requires 3 uops, and even if there are lots of them Agner Fog reports a throughput of one MFENCE per 33 to 36 cycles (10-15 ns) for Sandy Bridge, Ivy Bridge, or Haswell cores. But I can't think of a reason why there might be a lot of them:

a. If there is a specific variable that threadB updates and you want to make sure that this is visible to threadA as soon as possible, then simply add an MFENCE in threadB after updating that variable.

b. If you are updating many variables, this will automatically flush the store buffer for all but the final few stores, so you only need one MFENCE at the end of the sequence to push everything out.

2. Without more detailed information, it is hard to tell why this matters. When threadA reads the variable it will either get the old value or the updated value. Additional logic is required to determine whether the value read by threadA is the old value or the new one.

a. In some cases (e.g., monotonically increasing values), this information can be derived from monitoring changes in the value. This requires that threadA "spin" on the value until it changes. It is guaranteed to happen eventually (provided that threadB actually executes the store causing the variable to be updated), but there certainly may be cases where threadA will get the value sooner if threadB executes an MFENCE (or an SFENCE, which may cheaper) after updating the variable.

b. In most cases an additional "flag" or "version number" variable is required to tell threadA whether the variable has been updated, and the performance depends on how this is implemented.

McCalpinJohn · ‎01-06-2015

Jim -- are you sure that a locked RMW operation will force the value to be flushed from the remote processor's store buffer? I don't see this as a documented mechanism for forcing store buffer flushing.

jimdempseyatthecove · ‎01-06-2015

If it did not, then you could not possibly have two threads performing XADD without getting out of sequence. IOW the 2nd XADD could not evict the first thread's pending write (unless it could way-back the other thread like in TSX).

If the remote thread has a pending write the local thread could not perform an XADD using stale data (without corruption).

Jim

McCalpinJohn · ‎01-06-2015

Locked instructions may behave differently than normal instructions in this case.

Since we are talking about WB memory, only one core can have write permission to a line at any given time. Any store from CoreA will first require a coherence transaction (RWITM) that invalidates the line in all other caches and returns the current value of the line to the local cache. The store cannot be committed to CoreA's cache until after the RWITM transaction completes, though it might write to the store buffer with the data marked as speculative.

Look at some cases:

1. Suppose that another processor (CoreB) executes a load to the same cache line after CoreA's RWITM completes, but before CoreA has written the data from its store buffer to its cache. One possible approach is to respond to the intervention request with the current version of the cache line that is in the cache -- i.e., without the updates that are held in the store buffer (though possible with other updates that have already drained from the store buffer). The line is downgraded from M to S (or similar) in CoreA's cache, and the data is copied to CoreB's cache in S state. At some later time, CoreA needs to drain its store buffer -- but the corresponding cache line no longer has write permission! So a second RWITM is sent out, the S state copy of the cache line is invalidated in CoreB's cache, and S state copy of the line in CoreA's cache is upgraded to M. This time the data from the store buffer is ready to write, so it will be written to the cache as soon as the RWITM transaction completes. If there is a single update for the cache line in the store buffer, then it is guaranteed to be completed in one "extra" RWITM. If there are multiple updates to the cache line in the store buffer that are interleaved with stores to other cache lines, then the process could involve additional pairs of intervention/downgrade and RWITM transactions. Note that after the line is downgraded in CoreA's cache, it might be selected as a victim and silently dropped. In that case when the store buffer needs to write to the line, the RWITM will need to allocate a new line in the cache, possible evicting another cache line. This seems a bit scary in terms of deadlock/livelock, so one could imagine an implementation that does not perform remote snoops on the store buffer, but which does snoop the store buffer before dropping a line from the L1 Data cache -- forcing the RWITM and store buffer drain to precede victimization of the line. I am not sure that this is any easier, but it is a reminder that there are some of the state machines are probably very ugly! Also note that since the store buffer MUST drain in FIFO order, any mechanism that allows snooping must drain from the oldest entry in the store buffer to the target entry. These entries might cover several cache lines, all of which might have been downgraded in the cache!

This sequence of events may seem perverse, but I think the only way to avoid it is to snoop the store buffer on interventions. If you snoop the store buffer on interventions, then the store buffer is effectively just part of the cache and you go back to having a Sequentially Consistent model instead of the Total Store Order model that x86 supports. So I think that something like the transaction sequence above must be allowed on Intel processors.

2. Suppose that CoreB executes a store to the same cache line after CoreA's RWITM completes, but before CoreA has written (all of the) data from its store buffer to its cache. I have not worked through this case in detail, but it is clear that it is somewhat trickier -- if the intervention implied by CoreB's RWITM is allowed to proceed without snooping the store buffer, then the cache line will be invalidated in CoreA's data cache, making it very likely to be victimized. It might be possible to handle this the same way as the previous case -- each core's stores are guaranteed to show up in program order, but the interleaving of those stores is not guaranteed. Since no transactions are locked, there are no atomicity guarantees to be broken.

3. This all suggests that a simple (?) implementation of locked atomic operations will prohibit external interventions from completing on a cache line between the (locked) RWITM (which causes the data to be read) and the delivery of the updated data to the cache. Section 11.10 of Volume 3 of the SW Developer's Guide already says that locked operations cause the store buffer to drain, which presumably includes both data from prior stores as well as the data from the locked store. Any load (intervention) request that arrives while the locked operation is in progress will be either NACK'd or simply delayed until the locked operation is complete. The latter case is plausible because locked operations are limited to a very small subset of instructions that are guaranteed to complete within a few cycles.

jimdempseyatthecove · ‎01-07-2015

Arbitration of coincidental RWITM could be problematic in terms of performance...
unless the cores know in advance which has preferential treatment.

In the mid-1980s I helped design a shared memory system which had a selectable prioritization: a) by CPU number, and b) a very fast shift register that rotated a token amongst competing CPUs. Method a was priority-arbitration, and b was fair-arbitration. Method b is somewhat like a token-ring system. The number of CPUs was small (8 max), the arbitration was somewhat like that of a SASI / SCSI bus.

A lot has happened in the intervening 30 years (to wit John's sketch above).

Jim Dempsey

Sweeney__Tim · ‎12-21-2017

This is a topic I've been very interested in. Consider the case of a nonblocking concurrent garbage collector. A smart pointer type would implement a "write barrier" operation, and user code would execute it tens of millions of times per second across many different threads. Then a garbage collection thread would run tens of times per thread, and implement non-blocking synchronization with user threads.

We want to maximize the efficiency of the 10M's-of-times-per-second user code, and are happy to sacrifice lots of performance in the 10's-of-times-per-second library code. In this case, the writer barrier is:

mov [thread_flag],1 # tell the garbage collector that it's not allowed to free memory
sfence # make sure the garbage collector sees the flag before we do the following
mov eax,[pointer] # load the pointer
lock inc [eax] # increment atomic reference count
mov [thread flag],0 # unblock the garbage collector

The sfence is needed here because the CPU may perform the "move eax,[pointer]" load before draining the "mov [thread_flag],1" store, and in the meantime the garbage collector may have modified [pointer] and freed the memory underneath it.

This code takes 30 cycles, but it would run in 15 cycles if we could eliminate the sfence in this frequent operation, and instead use the postulated XMFENCE instruction (forcing all other cores to drain their store buffers before continuing) during the infrequent operation on other threads.

Is there another efficient way to accomplish this with just a single atomic and no fence? In TSX land, we can use multi-instruction atomicity to avoid this complication entirely. Omitting xbegin-failure housekeeping:

xbegin
mov eax,[pointer]
lock inc [eax]
xend

Thanks for any insight!