Concurrent stores seen in a consistent order

T__H__Black · ‎01-09-2013

The Intel Architectures Software Developer's Manual, Aug. 2012, vol. 3A, sect. 8.2.2:

Any two stores are seen in a consistent order by processors other than those performing the stores.

But can this be so?

The reason I ask is this: Consider a dual-core Intel i7 processor with HyperThreading. According to the Manual's vol. 1, Fig. 2-8, the i7's logical processors 0 and 1 share an L1/L2 cache, but its logical processors 2 and 3 share a different L1/L2 cache -- whereas all the logical processors share a single L3 cache. Suppose that logical processors 0 and 2 -- which do not share an L1/L2 cache -- write to the same memory location at about the same time, and that the writes go no deeper than L2 for the moment. Could not logical processors 1 and 3 (which are "processors other than those performing the stores") then see the "two stores in an inconsistent order"?

To achieve consistency, must not logical processors 0 and 2 issue SFENCE instructions, and logical processors 1 and 3 issue LFENCE instructions? Notwithstanding, the Manual seems to think otherwise, and indeed supports its opinion with a seemingly clear example in sect. 8.2.3.7.

Does this mean that every cache at every level snoops all writes to every other cache, even across cores, even across packages? If so, would this not imply that every store to a valid local line of cache must lock the global address bus? This does not sound right. Is it right? After all, what is the point of multithreading, what is the point of caching, when ordinary stores must always lock global resources? I am confused.

(Incidentally, I have indeed checked the Manual's latest errata, which do not seem to address the matter.)

TimP · ‎01-09-2013

As soon as a processor updates its cache, a copy of that cache line in another processor's L1 or L2 becomes invalid and cannot be accessed (presents a cache miss) until all updates are completed. If a thread is reading a cache line which is modified by another, false sharing occurs, presenting a likely serious performance issue. Snoops have to deal only with updates to last level cache on other packages.

SergeyKostrov · ‎01-13-2013

>>...Does this mean that every cache at every level snoops all writes to every other cache, even across >>cores, even across packages?... In documentation it is clearly stated that in case of Level 2 cache that '...Level 2 cache enables efficient data sharing between two cores to reduce memory traffic to the system bus....'. It is impossible for Logical Processors 3 and 4 to have access to cache lines Level L1&L2 of Logical Processors 1 and 2 ( I indexed CPUs from 1 to 8 ) according to the manual. Note: Please take a look at a page 43 in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture ( Order Number: 253665-044US ) August 2012

Bernard · ‎01-13-2013

>>>In documentation it is clearly stated that in case of Level 2 cache that '...Level 2 cache enables efficient data sharing between two cores to reduce memory traffic to the system bus....'.

It is impossible for Logical Processors 3 and 4 to have access to cache lines Level L1&L2 of Logical Processors 1 and 2 ( I indexed CPUs from 1 to 8 ) according to the manual>>>

I think that in case of desktop CPU cache coherency is maintained at the physical core level and do not cross physical cross boundaries.Regarding Xeon CPU cache coherency can be maintained differently need to consult software manual.