What's CPU cache pipelining and how to use it ??

ArthurRatz · ‎09-30-2017

Dear colleagues,

I've got a question on what is CPU cache pipelines and if it's possible to increase the performance speed-up of my program by using CPU pipelines ?

If so, please advise me what particular libraries should be used to leverage CPU pipelines and what guidelines it's recommended to read before I can get started with performance optimization by the means of CPU pipeline using ?

Thanks a lot for your respond in advance.

Best, Arthur V. Ratz (@arthurratz).

McCalpinJohn · ‎10-04-2017

CPU execution pipelines are described in many places. Instructions are split into sub-steps that are executed in different "pipeline" stages. In many cases this allows the appearance of single-cycle execution even if the processor actually requires many cycles between the instruction fetch and the visibility of the instruction result in the register file.

CPU cache accesses can be pipelined in a similar way. Early processors had single-cycle L1 Data Cache access, but that is almost never possible in current designs. L1 Data Cache access times are typically 4-7 cycles in modern processors (depending on the data type/width and the instruction addressing mode). This means that even in the case of an L1 Data Cache hit, the fetch of the data must happen 4-7 cycles before it can be used.

Fortunately, this is almost always completely invisible to the user. The out-of-order execution mechanisms in the hardware will issue memory accesses as early as possible, and execute dependent instructions as soon as possible after the data is available.

Most of the time, the user only needs to be aware that the hardware needs independent work to have something to do while waiting for the data from cache accesses and while waiting for result values to come out the end of the execution pipelines. As a simple, but important, example, consider the task of summing all the values in an array.

If a single scalar register is used as an "accumulator", then you will only be able to execute one add every pipeline latency. For double-precision floating-point numbers, this is between 3 and 6 cycles depending on the processor model. For this example, I will assume a 4-cycle latency, so the performance will be limited to one floating-point add every 4 cycles = 0.25 FP operations/cycle.

There are two ways to improve throughput:

Increase the number of independent accumulators (allowing the additions to be interleaved in the pipeline stages of the Floating-Point-Unit, and/or allowing the additions to be executed in independent Floating-Point-Units (if available)).
Use SIMD instructions to allow a single register to act as accumulator for multiple values. For double-precision floating-point values, this gives 2 values in a 128-bit register, 4 values in a 256-bit register, or 8 values in a 512-bit register.

These two approaches can be combined (and must be combined to reach full speed on most processors).

Multiple Accumulators:
- If the summation is split into two parts (typically the first half of the array and the second half) with independent scalar registers used for the two halves, then the additions will be able to execute concurrently (either in independent FPUs or in alternating pipeline stages of the same FPU), giving a throughput of 2 adds every 4 cycles = 0.50 FP operations per cycle.
  - This can be extended to 4 accumulators (with a single FPU) or 8 accumulators (with two FPUs) to double or quadruple the performance (1.00 FP operations per cycle or 2.00 FP operations per cycle).
SIMD Accumulators:
- If a single vector register is used as an "accumulator", then you will only be able to execute one SIMD add every pipeline latency.
- With 256-bit (AVX/AVX2) registers, this is 4 64-bit adds, but it still takes 4 cycles, so performance will be 4 adds/4 cycles = 1.00 FP ops/cycle. For 512-bit (AVX-512) registers, this is 8 adds in 4 cycles, for 2.00 FP operations per cycle.
Combined:
- With two 512-bit FPUs (and still assuming 4-cycle latency), you need 8 accumulator registers (each holding 8 values) to fully overlap the execution pipeline delay.
  - The peak FP execution rate in this case is 64 FP additions every 4 cycles, or 16 FP operations per cycle -- a 64x speedup over the base case.
- You don't actually need to go this far because you will run into other performance limits before you reach full speed.
- Note that this is effectively the same as splitting the array into 64 independent parts and accumulating 64 independent partial sums. (You have to combine these into a single sum at the end.)
- That is a lot of independent work, and a software developer may need to keep in mind that a pretty significant amount of parallelism is required for the processor to stay busy.

The compiler may or may not make the transformations above to a piece of code that is accumulating a sum -- this depends on many factors that are beyond the scope of this note.

This does not directly address cache pipelining, but it describes an analogous issue. In this case the hardware overlaps the L1 Data Cache load latency with the arithmetic automatically and transparently if the array is large enough. If the array is really short, the latency to get the data loaded and the execution pipeline latency may not be negligible. The hardware will still try to overlap these latencies with preceding and/or following instructions, but there are a few cases where this is not possible. (Having the summation surrounded by poorly predicted branches is one such case....)

ArthurRatz · ‎10-22-2017

Thank for your reply and detailed explanation.