Does anyone know how exactly memory barrier is implemented in modern processors?
Same question for lock prefixed operations
This is a large topic with numerous subtleties. Unfortunately the eloquent and comprehensive reply that I just spent the last hour composing was eaten by the !*&!@#% web site when it decided that I had to login again (even though I was already logged in).
The short answer is that vendors are unlikely to discuss the details of their actual implementations for several reasons:
The best overview I have seen of the high-level issues is "A Primer on Memory Consistency and Cache Coherence" by Sorin, Hill, and Wood.
Although Intel has published a number of documents on memory ordering, I think that the intent is that the "official" statements are now contained in Chapter 8 of Volume 3 of the Intel Architectures SW Developer's Manual (document 325384, revision 053, January 2015).
Of course many of the issues that apply to the implementation of memory barriers also apply to the implementation of lock-prefixed operations. Since both of these involve memory ordering, many of the low-level implementation issues are the same, and involve many of the same undocumented hardware features.
thank you for your answers, guys.
Let me ask a less wide question. I'm just curious: how this happens that lock prefixed operations (for example lock add) are more lightweight than a memory barrier. To my current understanding lock prefixed operations operate as full memory barrier as well, so logically it should be heavier than just a memory barrier. But tests shows that locked prefixed ops are faster. Can anyone tell me how to resolve this my confusion? :)
Thanks in advance
As discussed in Section 8.2.2 of Volume 3 of the Intel SW Developer's Guide, locked instructions are prevented from being reordered with respect to any other loads and stores, however they do not force draining of the store buffers like a memory barrier (as discussed in sections 8.2.5 and 8.3).
As long as the locked add is operating on data that is fully contained within a single cache line, it can be executed very efficiently using the "cache locking" mechanisms mentioned in Section 8.1.
The observation that the locked add instruction is faster than the barrier suggests that the cache locking mechanism is less expensive than draining the store buffers. This does not seem surprising from an implementation perspective. Even if the two operations were of comparable difficulty, I would expect the designers to put a lot more effort into optimizing the performance of the heavily used locked atomic operations than into optimizing the performance infrequently used memory barriers.
It took me a bit to find the links, but there are two other recent forum topics where I discussed the details of pieces of the memory consistency/ordering model:
>>>As long as the locked add is operating on data that is fully contained within a single cache line, it can be executed very efficiently using the "cache locking" mechanisms mentioned in Section 8.1. >>>
I suppose that cost of lock prefixed instruction can be higher when the data is located in memory at least at the beginning when variable is not cached.