Multi-Threading, Multi-sockets and cache coherency

Brenden_T_ · ‎06-24-2015

I got into a debate with someone on Stack Overflow and I want to make sure I've got my facts straight.

The question concerns cache coherency. In a document below, it talks about maintaining cache coherency and mentions Intel as one of the manufacturers implementing cache coherency:

http://acg.cis.upenn.edu/papers/cacm12_why_coherence_nearfinal.pdf

However, it talks about on-chip cache coherency. I assume that if I have a multi-socket system that the caches between separate CPUs on separate sockets are NOT maintained coherently, hence the need for a memory barrier op-code. Is my assumption correct?

I think this is clear but I'll clarify questions if I can. I believe that the memory bandwidth would be too great to implement cache coherency between physically separate CPUs. That's what the front side bus is for. I guess there may be some systems that attempt to implement some kind of multi-cpu cache coherency, but I don't see that they could be common, and certainly not universal.

Anyway, thanks for your time and for clearing up any misconceptions I may have.

McCalpinJohn · ‎06-24-2015

Multi-socket Intel systems are cache coherent between/across sockets. Very little software exists for systems that have memory that is shared but not guaranteed to be coherent.

Memory barriers are most commonly needed when ordering between "normal" and "nontemporal" stores is required. The Intel compiler generates these automatically when they are needed. The memory consistency model used by Intel processors is strongly ordered, so no barriers are needed for the most common shared memory access patterns. The one special case that needs barriers needs them because of store buffers, so that case applies to single-socket systems as well as to multi-socket systems.

Coherence traffic is carried over the QPI links. With 2-socket systems the coherence traffic uses a modest fraction of the QPI bandwidth. For 4-socket and 8-socket systems there a "snoop filters" to eliminate much/most of the cross-chip coherence traffic for the most common usage patterns.

McCalpinJohn · ‎06-24-2015

As an aside, I find the paper's arguments to be too high-level to be convincing. In theory we know how to scale cache coherence well enough to handle expected single-chip configurations. In practice, on the other hand, cache coherence in multicore chips is becoming increasingly challenging, leading to increasing memory latency over time, despite massive increases in complexity intended to mitigate the issues.

This does not mean that cache coherence will not be retained in future systems -- it means that I think it is the wrong approach, and that the penalties for maintaining cache coherence (in complexity, energy, latency, etc) are large enough that they block both incremental improvements and radical architectural changes (that could allow much larger improvements in low-level efficiency).

jimdempseyatthecove · ‎06-29-2015

John,

Back in the old days (60's-70's) I programmed on the DEC PDP8 series of computers. True, these were single processor systems. This said, they had an interesting characteristic that would apply to cache coherency on modern systems. On these systems the ALU was placed between the memory subsystem and that which used it. This provided for not only the instruction stream to use the ALU, but also the I/O bus devices. An example was for high-speed A/D to perform an add to memory. There were other odd devices that would use AND, OR, and rotate. The DC02 Teletype device multiplexer could handle 128 ports with the virtual UART registers inside the RAM of the PDP8I.

Back to the future... today's systems could incorporate some of this old technology to perform the most frequently used primitives.

XADD, BTS, BTC could be performed using a secondary ALU in the memory subsystem. This would handle MUTEX and atomic integer add.

CAS could be implemented as well, this would require sending both the compare element and the set element. This isn't too unusual as memory subsystem have used a Page and Cell scheme (IOW packet sent as opposed to single strobe).

This system would not necessarily be slower on a single socket system, as the secondary ALU(s) could additionally be placed at appropriate cache levels.

We have gobbs of silicon space now, why not use some of it wisely.

Jim Dempsey

McCalpinJohn · ‎06-29-2015

The IBM POWER7 processors include fixed-point ALU functionality in the memory controllers. The ALUs support ADD, AND, OR, XOR, Compare-and-Swap, and a few other operations on data sizes of 8, 16, 32, or 64 bits. On the Power 775 ("PERCS") system, these operations can be launched to any memory location in the system with a user-mode instruction. (There may be a few instructions to set up the 32 Byte command buffer, but a single instruction sends the command to any memory location in the system.) This feature is used in the RandomAccess benchmark of the HPC Challenge benchmark suite to deliver the highest performance on that benchmark -- more than 4 times the performance of the full "K Computer", using 1/10th the cores and occupying about 1/40th the number of racks (~22 vs 864). However, like the K computer, the Power 775 system is not inexpensive....

An opportunity with more potential for volume is provided by the Hybrid Memory Cube. (Intel's acquisition of Altera will make Intel one of the "Developer Members" of the Hybrid Memory Cube Consortium, should Intel decide to continue with this Altera project.) Because of the long-standing difficulty of implementing high-speed logic in semiconductor processes optimized for DRAM (and the converse difficulty of implementing DRAM in semiconductor processors optimized for high-speed logic), the Hybrid Memory Cube includes a die optimized for high-speed logic at the base of a stack of DRAM dies. The Through-Silicon-Vias connecting these dies force the logic die to be about as large as the DRAM dies, and although part of the logic from the DRAMs has been moved to the logic die (implementing a basic DRAM controller with a simplified external interface), my guess is that there is a lot of silicon area on the logic die that is unused. (This is just a guess, I don't have any inside information on this.) I know from some of my recent work that ALUs are tiny -- in a 45 nm process a 64-bit fully-pipelined (1 GHz) 64-bit floating-point fused-multiply-add unit only takes a bit over 0.04 mm^2. The DRAM bandwidth of a Hybrid Memory Cube is supposed to be up to 320 GB/s (from the DRAM to the logic layer -- external bandwidth is different), so you would need quite a few ALUs running at O(1) GHz to process the data at full bandwidth. The Hybrid Memory Cube specification already includes a number of arithmetic operations in the logic layer, including 128-bit or paired 64-bit atomic adds, a set of boolean operations on 128-bit memory values (AND, NAND, OR, NOR, XOR), and five variations of Compare-And-Swap.

jimdempseyatthecove · ‎07-01-2015

Do you have a link to a white paper for the Hybrid Memory Cube ALU functionality (not the specification)?

Jim Dempsey

McCalpinJohn · ‎07-01-2015

Most of the material out there has more than a little bit of marketing slant, but as long as you remember that the comparisons may not be fair or reasonable..... Most of the stuff I see on the interwebs refers to version 1 of the HMC specification, not the current version 2 specification (released in November 2014).

The original Hot Chips presentation is at http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf

Xilinx provides an overview of HMC in their white paper on the future of DRAM technologies: http://www.xilinx.com/support/documentation/white_papers/wp456-DDR-serial-mem.pdf

jimdempseyatthecove · ‎07-01-2015

Unfortunately, neither of those presentations described ALU or other "cooking" functionality co-resident on the logic base plane of the HMC. There is some mention that ECC could be (is intended to be?) on the logic plane as opposed to in the memory controller inside the host CPU (or memory controller glue chip).

IMHO, the placing of an additional ALU inside the HMC is but one place to position it. The second place is at the LLC (it may not be as beneficial to extend this to the caches held inside the core).

Jim Dempsey

McCalpinJohn · ‎07-02-2015

The atomic functions required by the HMC standard are described in the standard, but without much discussion of implementation issues. Version 2 of the HMC standard is available without being a member of the HMC consortium, but members probably have access to more interesting discussions.

The main reason that I would want to put ALU functionality in the logic chip of the HMC stack is to avoid the energy cost of moving data between the logic chip and the processor chip. The through-silicon vias inside the HMC package are very short and have very low capacitance, so moving data inside the stack is at least an order of magnitude cheaper in energy costs than moving the data between packages.

There may be other good reasons to put an ALU in the LLC. The extra functionality that I would want to see on-chip is more related to communication and synchronization than arithmetic (which is already overprovisioned). User-accessible hardware FIFOs and user-level barrier hardware could be very effective at improving the efficiency of fine-grained parallel applications on multicore processors.

jimdempseyatthecove · ‎07-11-2015

>> User-accessible hardware FIFOs and user-level barrier hardware could be very effective at improving the efficiency of fine-grained parallel applications on multicore processors.

Precisely what's behind my reasoning.

For example, look at the current state of writes with respect to same location or same cache line. To some greater extent you can look at this as the hardware already providing FIFO access to cache levels and RAM. By placing the ALU elsewheres (plural) you can effectively perform an atomic increment/decrement/xadd/bitFunction and possibly CMPXCHG all without the LOCK.

Now be aware that I am not a CPU designer (but I have helped design older processors) so there may be reasons why in the general sense you would not want to do this. But this said, there is no reason why in the specialized case that this cannot be done. For example the paging system already contains attribute bits to reflect the type of and capability of what the page is mapping to. This could be extended to include a flag or flags to indicate this page has ALU support. The compiler could then be instructed to place, or the application could perform a special allocation to attain, a page with the ALU capability. The programmer would place the mutexes and atomic variables in this area.

Before you reject the idea of special allocation, bear in mind that systems do have allocation routines for Non-Paged Pools, NUMA node allocation, and shortly, when Knights Landing comes out, you will have a specialized allocator for Near Memory.

Jim Dempsey

McCalpinJohn · ‎07-20-2015

Jim Dempsey wrote:

To some greater extent you can look at this as the hardware already providing FIFO access to cache levels and RAM. By placing the ALU elsewheres (plural) you can effectively perform an atomic increment/decrement/xadd/bitFunction and possibly CMPXCHG all without the LOCK.

I would certainly approve of placing functionality in various places in the memory hierarchy, but every time I have brought this up within a processor design team I have been shut down decisively. The problem (to the extent that I understand it) is that the assumption that memory references can be speculative is pervasive. Hardware FIFOs, barriers, and ALUs all have side effects, so they cannot be accessed speculatively. Of course processors must support IO space references with side effects in order to configure certain IO devices, but they do this using a very big hammer -- they shut down all out of order memory references, all concurrency in memory references, and all use of the cache hierarchy. This is disastrous for performance.

To support memory references with side effects at high performance, one would have to redesign both the load/store functionality of the core and the entire cache+memory hierarchy. All aspects of the system would need to conform to an architectural specification that includes memory references with side effects as a first-class feature. The hardware would need to be able to support non-speculative accesses with ordered concurrency to the address ranges supporting these specialized memory-mapped functions. As you noted, something similar is supported for store operations so that OOO processors correctly recover the serial semantics of the code, but extending this to the load side of the memory hierarchy requires a fundamentally different architecture, with a substantially different implementation.

I have argued for many years now that this direction is essential if we ever want parallel processing to become ubiquitous, but the combination of risk and cost is too high for a mainstream product. If it does happen, it will have to start in a small specialized market and spread outward.

jimdempseyatthecove · ‎07-22-2015

>>I would certainly approve of placing functionality in various places in the memory hierarchy, but every time I have brought this up within a processor design team I have been shut down decisively.

This is the typical NIH syndrome which often is not corrected until it is too late (meaning someone else has patented the technique) and is not beating your pants at some critical benchmark.

The significance, of which we are in agreement on implementing, has no change with regarding to using "legacy" code containing LOCKs (or RTM/TSX). In fact, the new feature might actually fix old bugs in code (code that should have used LOCK but did not). In the new design, the LOCK could be interpreted as a hint to use the distant ALU, and this would eliminate (or augment) the earlier suggestion I made about using a page table attribute. The LOCK reinterpretation would be better because existing applications would run faster with no changes to the code. The page table attribute could be used as well, though some simulation would be required to determine the potential benefits.

Jim Dempsey

Travis_D_ · ‎08-18-2015

John D. McCalpin wrote:

Memory barriers are most commonly needed when ordering between "normal" and "nontemporal" stores is required. The Intel compiler generates these automatically when they are needed. The memory consistency model used by Intel processors is strongly ordered, so no barriers are needed for the most common shared memory access patterns.

I don't think this is entirely accurate. The intel x86/64 memory model (TSO with causal consistency or whatever they are calling it) is indeed stronger than many models for past and current competing architectures, but it is not true that no barriers are needed for the most common shared memory access patterns. The one ordering that the Intel does allow (StoreLoad - later loads can be before before earlier stores) in practice comes up in nearly all concurrent algorithms. It certainly comes up for the traditional mutex lock/unlock since the lock side needs a StoreLoad barrier (in practice a LOCK CMPXCHG or LOCK XADD) to avoid loads moving out of the critical section.

Similarly more lockfree algorithms need something like a sequentially consistent CAS operation which is even stricter than StoreLoad.

Only a very limited number of concurrent algorithms can get away with "plain" (barrier-free) loads and stores, even on x86 - the fast path for double checked locking and similar "initialize one" patterns are the only ones that immediately spring to mind.

The one special case that needs barriers needs them because of store buffers, so that case applies to single-socket systems as well as to multi-socket systems.

I've often seen this statement, but it isn't correct. Well, formally speaking, the memory model describes permitted re-orderings in an abstract enough way that you can't necessarily tell what underlying micro-architectural feature would cause then (although in some cases you can) - but the reorderings on Intel come from more than just the store-buffer. The main source of re-orderings (other than the store buffer) is that modern x86 CPUs will hoist loads before stores. This effect is not just due to the store buffers - loads can execute even before the value and address for a prior store are available. In the end though, both the store buffer and the load hoisting have the effect of StoreLoad reordering, so the weakening in the memory model they imply is pretty much the same.

The load hoisting applies equally to single socket systems, of course, in a similar way as store buffer reordering, as you pointed out. In fact that's pretty much true of all reorderings. Both reoderings are made possible by within-CPU (front end and back end) reorderings, e.g., with instruction scheduling and store buffers. Outside of the CPU the cache coherency protocol is responsible for keeping up its end of the illusion, and that applies to single-socket systems as well (e.g., because the L2 is not shared and the L3 is). Moving to multi-socket doesn't change any of the fundamental blocks - it just makes things harder and slower (e.g., because the links are higher latency) and so in practice a different solution may be chosen (e.g., to support more efficient snooping).

Travis_D_ · ‎08-18-2015

jimdempseyatthecove wrote:

XADD, BTS, BTC could be performed using a secondary ALU in the memory subsystem. This would handle MUTEX and atomic integer add.

CAS could be implemented as well, this would require sending both the compare element and the set element. This isn't too unusual as memory subsystem have used a Page and Cell scheme (IOW packet sent as opposed to single strobe).

In practice this would likely be dramatically slower than what we have today. What's confusing about the current LOCK operations is that their very name seems to imply they have to do take some kind of "bus lock", or at least imply some kind of global coherence activity, so perhaps pushing them "down to RAM" is a good idea.

The reality is today that these operations are capable of completing entirely locally, and probably do just that 99.999% of the time (outside of true contention, in which case you are screwed either way). The cache protocols take care of the cross-core and cross-socket communication in a way that doesn't impose a per operation cost even on these sequentially ordered operation. Once you have the line in a E state, the CPU can perform the operation completely locally (with some speculation or draining of the store buffers to account for "near misses"). The cost is generally around the cost of a branch mispredict, perhaps 10 to 20 cycles. It is the same cost on multi-socket systems (true contention cases may be higher on multi-socket systems).

In practice, even this 10 to 20 cycles could be brought right down, almost arbitrarily low - the remaining cost is just the store buffer drain cost, or something similar: with a bit more speculation and tracking, perhaps even that could be removed (x86 is also hurt here by the fact that all LOCK operations are very strongly ordered, even though some code may be fine with weaker semantics).

So the upshot is that today the typical cost of these operations is roughly on par with a single L2 access, faster than L3 access and an order of magnitude or more faster than a simple to RAM. Anything that went in RAM itself would always pay at least the RAM access latency (even if we assume the operation itself and additional arbitration takes zero time). So we can't really improve on the current situation by going down that path.

You might then ask - what about the situation where there is actual sharing of data on a frequent basis between cores (or sockets)? Well again the cache coherency magic (including QPI and friends on multi-socket) causes that traffic to flow along a path that is generally better than a RAM miss. For example, on a single core system, most of that will flow through the L3 since that's first "shared" level between cores. On a multi-socket system it may flow over QPI for sharing between cores, which should be faster than RAM (although the devil is in the details in these kind of cases, e.g., you may saturate your coherency bandwidth and some point and see a large drop off in performance).

Mahdi_M_ · ‎03-28-2017

Hi,
First of all, I should apologize if my question is not completely related to this topic because I am so beginner please bear me :)

I was wondering if I can set or change my coherency mechanisms(snooping, Directory-based, etc.) on my CPU if so how? Also is there any possibility to ask CPU to follow the different approach or it is basically hardware based and not changeable?

I have the Xeon E5-2560 V4.

I wanna do this in order to find an impact of different cache coherency mechanism on my application which working on the different level of cache and memory respectively. Best,

McCalpinJohn · ‎03-29-2017

The Xeon E5 v4 series should support several snooping modes, but the documentation is not particularly enlightening. The BIOS may expose options with names like "Early Snoop", "Home Snoop", "Opportunistic Snoop Broadcast", and "Cluster on Die".

These modes don't just change the snoop protocol, they also appear to change the allocation of buffers (particularly QPI buffers) to different types of transactions. The combination of minimally documented changes to minimally documented coherence protocols with undocumented changes to undocumented buffers makes it challenging to draw "theoretical" conclusions from these sorts of experiments. It is, of course, easy enough to measure performance under different modes to see which is friendliest to your applications of interest....