memory barrier, interlocked operations

Aleksandr_A_ · ‎02-10-2015

Does anyone know how exactly memory barrier is implemented in modern processors?

Same question for lock prefixed operations

Thank you

McCalpinJohn · ‎02-11-2015

This is a large topic with numerous subtleties. Unfortunately the eloquent and comprehensive reply that I just spent the last hour composing was eaten by the !*&!@#% web site when it decided that I had to login again (even though I was already logged in).

The short answer is that vendors are unlikely to discuss the details of their actual implementations for several reasons:

They don't want to encourage patent lawsuits.
They don't want to disclose trade secrets in their implementations.
The description of the memory barrier implementation would not make sense without an understanding of many details of the processor implementation that have not been publicly disclosed. E.g.,
- Every major proprietary processor includes undocumented "bus" protocols used for interactions among units on the same chip.
- Every major proprietary processor includes undocumented buffers, queues, and bypasses that are intimately involved with memory ordering issues.
Vendors want users to implement code according to the architectural specification, not the idiosyncrasies of any particular implementation.
- This is particularly relevant in memory ordering. Over time, implementations have supported more and more out-of-order behavior, so that code that "works" on one generation may fail on the next. The code was "wrong" for the earlier generation, but the user was not aware that it was wrong because memory ordering issues are notoriously difficult to reason about.

The best overview I have seen of the high-level issues is "A Primer on Memory Consistency and Cache Coherence" by Sorin, Hill, and Wood.

Although Intel has published a number of documents on memory ordering, I think that the intent is that the "official" statements are now contained in Chapter 8 of Volume 3 of the Intel Architectures SW Developer's Manual (document 325384, revision 053, January 2015).

McCalpinJohn · ‎02-11-2015

Of course many of the issues that apply to the implementation of memory barriers also apply to the implementation of lock-prefixed operations. Since both of these involve memory ordering, many of the low-level implementation issues are the same, and involve many of the same undocumented hardware features.

Bernard · ‎02-11-2015

Search in Google Patents for this keyword "Read-Copy operations with reduced memory barrier usage". I am not sure if this patent describtion will suit your needs.

Aleksandr_A_ · ‎02-11-2015

thank you for your answers, guys.

Let me ask a less wide question. I'm just curious: how this happens that lock prefixed operations (for example lock add) are more lightweight than a memory barrier. To my current understanding lock prefixed operations operate as full memory barrier as well, so logically it should be heavier than just a memory barrier. But tests shows that locked prefixed ops are faster. Can anyone tell me how to resolve this my confusion? :)

Thanks in advance

McCalpinJohn · ‎02-12-2015

As discussed in Section 8.2.2 of Volume 3 of the Intel SW Developer's Guide, locked instructions are prevented from being reordered with respect to any other loads and stores, however they do not force draining of the store buffers like a memory barrier (as discussed in sections 8.2.5 and 8.3).

As long as the locked add is operating on data that is fully contained within a single cache line, it can be executed very efficiently using the "cache locking" mechanisms mentioned in Section 8.1.

The observation that the locked add instruction is faster than the barrier suggests that the cache locking mechanism is less expensive than draining the store buffers. This does not seem surprising from an implementation perspective. Even if the two operations were of comparable difficulty, I would expect the designers to put a lot more effort into optimizing the performance of the heavily used locked atomic operations than into optimizing the performance infrequently used memory barriers.

McCalpinJohn · ‎02-12-2015

It took me a bit to find the links, but there are two other recent forum topics where I discussed the details of pieces of the memory consistency/ordering model:

https://software.intel.com/en-us/forums/topic/532363

https://software.intel.com/en-us/forums/topic/537987

Bernard · ‎02-12-2015

>>>As long as the locked add is operating on data that is fully contained within a single cache line, it can be executed very efficiently using the "cache locking" mechanisms mentioned in Section 8.1. >>>

I suppose that cost of lock prefixed instruction can be higher when the data is located in memory at least at the beginning when variable is not cached.

Aleksandr_A_ · ‎02-12-2015

John D. McCalpin, Thanks a lot for your detailed answers, this was exactly what I was looking for.