I've got a question on what is CPU cache pipelines and if it's possible to increase the performance speed-up of my program by using CPU pipelines ?
If so, please advise me what particular libraries should be used to leverage CPU pipelines and what guidelines it's recommended to read before I can get started with performance optimization by the means of CPU pipeline using ?
Thanks a lot for your respond in advance.
Best, Arthur V. Ratz (@arthurratz).
CPU execution pipelines are described in many places. Instructions are split into sub-steps that are executed in different "pipeline" stages. In many cases this allows the appearance of single-cycle execution even if the processor actually requires many cycles between the instruction fetch and the visibility of the instruction result in the register file.
CPU cache accesses can be pipelined in a similar way. Early processors had single-cycle L1 Data Cache access, but that is almost never possible in current designs. L1 Data Cache access times are typically 4-7 cycles in modern processors (depending on the data type/width and the instruction addressing mode). This means that even in the case of an L1 Data Cache hit, the fetch of the data must happen 4-7 cycles before it can be used.
Fortunately, this is almost always completely invisible to the user. The out-of-order execution mechanisms in the hardware will issue memory accesses as early as possible, and execute dependent instructions as soon as possible after the data is available.
Most of the time, the user only needs to be aware that the hardware needs independent work to have something to do while waiting for the data from cache accesses and while waiting for result values to come out the end of the execution pipelines. As a simple, but important, example, consider the task of summing all the values in an array.
There are two ways to improve throughput:
These two approaches can be combined (and must be combined to reach full speed on most processors).
The compiler may or may not make the transformations above to a piece of code that is accumulating a sum -- this depends on many factors that are beyond the scope of this note.
This does not directly address cache pipelining, but it describes an analogous issue. In this case the hardware overlaps the L1 Data Cache load latency with the arithmetic automatically and transparently if the array is large enough. If the array is really short, the latency to get the data loaded and the execution pipeline latency may not be negligible. The hardware will still try to overlap these latencies with preceding and/or following instructions, but there are a few cases where this is not possible. (Having the summation surrounded by poorly predicted branches is one such case....)