Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1757 Discussions

Cache Pipelining: Is It Possible to Issue Different Requests for Different Pipeline Stage

KerimTurak
Beginner
1,222 Views
Hello, when cache pipelining is implemented, is it possible to have requests for different addresses in both pipelines of a cache with, for example, 2 pipelines? Could this situation occur for both data and instruction caches?
0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,179 Views

The definition of pipelining implies that independent operations can proceed in consecutive cycles.  This is true of core functional unit pipelines as well as memory access pipelines.  For example a Skylake Xeon processor can execute two loads per cycle and the L1 Data Cache can deliver two cache lines per cycle.  The core+L1D cache can sustain two loads per cycle for L1D-resident data as long as none of the loads crosses a cache line boundary.  Given a minimum L1D latency of 4 cycles, full bandwidth requires that there be loads to 8 different cache lines outstanding (two accesses per cycle for each of the last 4 cycles).

Some processor and/or cache implementations have a traditional fixed-latency pipeline -- the internal scheduler (or in some cases the compiler) can predict exactly when a result will be available and can schedule execution to avoid conflicts between the output of the instruction and the execution of other instructions that need to access the same port (of a register file or a cache or a functional unit).  Modern high-end processors have variable-latency pipelines, so instructions that depend on loads from memory will be issued for attempted execution, then rejected (for later retry) if any of the input arguments are not yet valid (because the data has not yet been delivered by the memory subsystem).  The simple fixed-latency pipelines are more common in low-end embedded processors and in digital signal processor (DSP) chips. 

View solution in original post

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
1,180 Views

The definition of pipelining implies that independent operations can proceed in consecutive cycles.  This is true of core functional unit pipelines as well as memory access pipelines.  For example a Skylake Xeon processor can execute two loads per cycle and the L1 Data Cache can deliver two cache lines per cycle.  The core+L1D cache can sustain two loads per cycle for L1D-resident data as long as none of the loads crosses a cache line boundary.  Given a minimum L1D latency of 4 cycles, full bandwidth requires that there be loads to 8 different cache lines outstanding (two accesses per cycle for each of the last 4 cycles).

Some processor and/or cache implementations have a traditional fixed-latency pipeline -- the internal scheduler (or in some cases the compiler) can predict exactly when a result will be available and can schedule execution to avoid conflicts between the output of the instruction and the execution of other instructions that need to access the same port (of a register file or a cache or a functional unit).  Modern high-end processors have variable-latency pipelines, so instructions that depend on loads from memory will be issued for attempted execution, then rejected (for later retry) if any of the input arguments are not yet valid (because the data has not yet been delivered by the memory subsystem).  The simple fixed-latency pipelines are more common in low-end embedded processors and in digital signal processor (DSP) chips. 

0 Kudos
KerimTurak
Beginner
1,140 Views

I sincerely thank you for the satisfying answer you provided to my question.

However, your answer has raised some questions for me.

Please excuse my lack of knowledge.

As far as I understand, throughput increases by processing different requests in consecutive cycles. There's one last point that confuses me. When a CPU sends a request that results in a cache miss in the level 1 cache, and if this miss is fulfilled by the level 2 cache, is the response to the miss request conveyed to the CPU in parallel with the writing of the response to the cache in the same cycle, or is the response delivered to the CPU in the cycle following the one in which it is written to the cache? This question brings to mind the answer that having this structure in the L2 cache will increase the critical path, assuming that the L1 cache transfers the response to the CPU while writing it to itself. If the L2 cache misses and writes the data to itself which comes from lower hierarcy, it seems more reasonable for it to transmit the data to the L1 cache in the cycle following the write. Am I mistaken?

0 Kudos
McCalpinJohn
Honored Contributor III
1,096 Views

The details of the timing of the various components of the response in a multi-level cache are implementation-specific and are typically neither visible nor important to performance.

For most of these cases, the important number is the latency from the initial execution of the load to the execution of the next instruction that requires the value returned by the load.  This can be measured in some cases and is documented by the vendor in some cases.   From a user perspective the latency value used is typically the one provided by pointer-chasing microbenchmarks. It is good to be aware that the details of the timing depend on the size and alignment of the load(s), and may depend on whether any victims generated are dirty.  Once you get beyond the L2 cache there are many additional complications depending on the complexity of the chip.  Examples include the varying distance from the core making the request to the L3 "slice" containing the data in processors with distributed shared L3 caches, and the need to cross a frequency domain boundary between the Core/L1/L2 frequency domain and the L3 frequency domain.  Going to DRAM there are additional variations in distance when a chip has multiple memory controllers, and additional frequency domain crossings between the L3 frequency domain and the DRAM frequency domain.

0 Kudos
Reply