Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

out-of-order (OOO) microarchitecture



I'm a newbie here so I've a basic question about microarchitecture supporting out-of-order execution.

Just to fix ideas consider the "simple" Intel Core microarchitecture (see for instance Core Execution core Reservation Station (RS) has 6 issue (or dispatch) ports used to issue microps (uop) for execution. In particular Port 3,4,5 are dedicated to issue Load/Store uops to the three Load Store units belonging to the execution unit.

Now consider the case in which RS issue a Load uop to the Load unit using Port 3. Latency to complete Load uop depends upon data availability in cache (L1/l2/L3) and in case of cache miss an (off-chip) memory access is needed (just to fix ideas suppose 100 clock cycles are needed)

In that interval of time (100 clock cycles) does Load unit remain stalled by that load operation or not ? If there is another Load uop stored in the RS buffer, can RS issue it towards Load unit using Port 3 on the next clock cycle ?

Basically the question try to understand if Load/Store units have a level of pipelining that allow to issue upos at successive clock cycles (regardless of the latency needed to complete each single load/store uop)


0 Kudos
8 Replies
Black Belt

This is described in a couple of places....

The Intel Optimization Reference Manual (document 248966, latest revision 031, September 2015) describes several Intel microarchitectures in Chapter 2.   Figure 2-2 shows the Haswell microarchitecture, and you should note the box in the upper left quadrant labelled "Load Buffers, Store Buffers, Reorder Buffers" and the box in the bottom center labelled "Line Fill Buffers".

Although it does not make it into a table in this chapter, the very end of section 2.2 notes that there are 72 load buffers and 42 store buffers to support "micro-ops execution in-flight".  Awkward wording, but the point is clear enough -- the processor can support up to 72 concurrent load micro-ops before stalling due to lack of load buffers.   

Note that these buffers are for "load micro-ops" and don't have any direct connection to cache misses.

L1 Data Cache misses are supported by the Line Fill Buffers.  The number of Line Fill Buffers is only mentioned for some of the microarchitectures, but Table 2-18 and the text in Section say that both Nehalem and Sandy Bridge have 10 Line Fill Buffers, and can therefore directly support up to 10 L1 Data Cache Misses.  My interpretation is that this has not changed in Haswell or Skylake, but I have not tested this explicitly.  

A processor core can have more than 10 requests to the L3 and/or memory only by way of the L2 Hardware Prefetchers.  Note that the L1 Hardware Prefetchers use the same 10 Line Fill Buffers as Demand misses, so L1 HW prefetching can provide better overlap of latency, and they can provide concurrency in cases where the core cannot (e.g., prefetching based on a series of dependent loads where the core can't issue out of order because it needs data from memory to compute the next address), but they don't increase the maximum amount of concurrency that the hardware supports.

I pay a lot of attention to concurrency because bandwidth is limited by concurrency in many situations.  For example on a Xeon E5-2660 v3 system (10 core, 2.6 GHz nominal, 4-channel DDR4/2133 DRAM), the memory latency on my test systems is just over 85 ns.    With a bandwidth of 4*8*2.133=68.2 GB/s, the amount of concurrent memory traffic required to fully overlap the latency is 85 ns * 68.2 GB/s = 5800 Bytes, or about 91 cache lines.   If a core can only manage 10 concurrent cache transfers, it will only get 10*64 Bytes/85 ns = 7.5 GB/s of sustained bandwidth.  Fortunately there are both L2 hardware prefetchers and multiple cores.  With Haswell, Intel has done a very good job of increasing the effectiveness of the L2 prefetching, so that for the STREAM benchmark, I get the maximum performance using 5 of the 10 cores, and I get within 3%-4% of maximum performance using 4 of the 10 cores. (Performance degrades slightly as I use more than 5 cores because of increasing DRAM bank conflicts, giving about the same performance for 10 cores as for 4 cores.)

Another very useful resource for learning about the microarchitecture of Intel, AMD, and VIA processors is Agner Fog's excellent write-up "microarchitecture.pdf" available at   ; This document is most helpful if you read it from front to back because he describes the earlier processors first and then describes the improvements and increased complexity for each of the following processors in historical order.  This incremental approach makes the complexity of the more recent processors a bit less overwhelming.

Black Belt

As always, John's comments are a valuable contribution to my understanding.

On the question of fill buffers, I think John may be influenced by their role in the ARM architecture, which is different from the role in Intel architectures beginning with Woodcrest, where they replaced the Write Combine Buffers of the NetBurst CPUs.

As John said, the number of Line Fill Buffers per core has remained at 10 for a number of years (except that MIC KNC returns to 8).  Beginning (I think) with Nehalem, they are "competitively shared" between logical processors, which may limit performance of HyperThreading.  Earlier CPUs gave each logical processor access to only half of the fill buffers, even if only one of the logical processors was in use.

The Intel optimization manual describes the use of the fill buffers only indirectly.  There is just one brief comment about their use for non-temporal "streaming" stores which (to me) muddies the waters for the more usual case of non-streaming stores.

The fill buffers come into play on data stores (move to memory).  They read from memory "read for ownership" in order to replicate the current data when allocating a buffer and not using streaming stores.  This would be the only type of possible cache miss involved with fill buffers.  When data are flushed from a fill buffer, they go directly to L1D.  Ideally, this happens only when full cache lines have been updated, but a loop which writes alternately to too many different cache lines may thrash the fill buffers.  In my past experience, for applications which operate in this mode, 6 to 9 write streams (with HT disabled) have appeared to be optimum.  Intel compilers make some effort at -O3 to split and fuse loops so as to fall into this range.

John may have more knowledge about how the number of fill buffers influences performance of streaming stores.  I suppose at least one of them needs to be sending data to memory while another is being filled.

On the NetBurst CPUs, the hardware attempted to keep 2 write combine buffers ready by initiating flush (write to L2) for the oldest buffers, regardless of whether they had been filled with updated data.  There has been nothing presented about any similar action for fill buffers.  It seems clear that attempting to write 10 streams already runs into bottlenecks, yet there are situations where the compiler correctly allows more than 10 streams by not splitting a loop when it would reduce register data locality.

On MIC, and AVX512, where optimized code stores full cache lines anyway, fill buffer thrashing should be a thing of the past.

Black Belt

If there is a read or prefetch in the cache line whose only valid copy is the one in the line buffer, that also has to be taken care of, possibly by flushing the buffer, or copying to a read stream buffer.  The optimization guide refers briefly to the case where prefetch has to check the line buffers, as we don't want to prefetch an invalid cache line.  The read after write to the same cache line might be handled better by this scheme of pushing the line buffer out to L1D than by the older scheme of flushing Write Combining buffer to L2, at least for the case where the same thread is writing and reading.  I didn't want to discuss these cases of false sharing as they go even further beyond what is adequately documented and into where performance problems are expected regardless of implementation.

With such sketchy documentation, some low level research such as John has engaged in might be needed to clarify the role of the line fill buffers.

Black Belt

The Line Fill Buffers are associated with pending load or store requests for cache lines that are not in the L1 cache, but do not have a 1:1 relation to load and store micro-ops.  

Pending load and store micro-ops are associated with the "load buffers" and "store buffers", and multiple load buffers and store buffers can point to each of the Line Fill Buffers.

For example, with 32-bit loads to consecutive addresses, you would end up with 16 "load buffers" pointing to a single Line Fill Buffer.    This is why the number of load buffers and store buffers needs to be so much larger than the number of Line Fill Buffers.

Black Belt

Yes, that description seems accurate.   In this case (assuming no TLB misses), the core would be able to issue the 10 loads to 10 different cache lines in 5 cycles.  

During and after these 5 cycles the core will continue fetching instructions, decoding them into uops, performing register renaming, sending the corresponding uops to the reservation stations, and then dispatching the uops to the execution ports.  This will continue until the core runs out of some other resource. 

According to, the "Reorder Buffer" holds 192 uops and the "Reservation Station" hold 60 uops.  At a rate of 4 uops/cycle, it will take 15 more cycles to fill the Reservation Station and 48 cycles to fill the reorder buffer.   Any of the instructions that make it as far as the Reservation Station (and that don't depend on the data from prior loads) can be executed and the result written back to the Reorder Buffer.  So the Reservation Station will fill up if the subsequent instructions depend on the data from the loads (so they can't execute), while the Reorder Buffer will fill up if the subsequent instructions are independent of the results of the loads (so they get executed and returned to the Reorder Buffer).

Assuming that the data requested by the 10 initial loads is actually in memory (not in cache), it is not going to show up for a long time.   For a single-socket system the latency may be as low as ~60 ns, so at 3.0 GHz the latency would be about 180 cycles.   Unless there is a serious problem with instruction fetch, the Reorder Buffer is going to fill up a long time before that.

After an instruction has been executed, its corresponding uops are removed from the Reorder Buffer at Retirement.  But for the program to be serially correct, the instructions must retire in program order.   So in any cycle, the processor will attempt to retire instructions starting with the *oldest* instruction in the Reorder Buffer (i.e., the instruction that appeared first in program order).   If the oldest instruction in the Reorder Buffer has not completed its execution, then nothing can be retired, and no uops can be removed from the Reorder Buffer.   With a 200 cycle memory latency and no more than ~50 cycles to fill the Reorder Buffer, a memory-latency-bound code can be stalled for a long time.

In this scenario, the 10 loads to different cache lines are going to be processed mostly concurrently.  The first data to be returned will be in about 200 cycles, but the next nine loads will get there data much more quickly (perhaps as fast as one load every 10 cycles, limited by the time required to transfer the cache lines from DRAM through the L3, L2, L1 and into the core) because most of the latency has been overlapped.  It is possible for the memory controller to return the cache lines out of order, which would allow the loads to execute out of order.  This, in turn, might allow the dispatch and execution of dependent operations in the Reservation Stations, but nothing can retire until the data from the first (in program order) load is returned and that load instruction is completed.  

Black Belt

Sorry for the confusion -- I was not as careful as I should have been in my description....

  1. I should start with a caveat: If you look at the block diagrams in Chapter 2 of the Intel Optimization Reference Manual, you will note that the specifics of the boxes and the details of the nomenclature differ.  They also differ in some details from the high-level diagram in Agner Fog's microarchitecture document and from the high-level block diagram in the link above.  This should be a reminder that these are representing the structure of the pipeline at a high level, and that we all have to be careful with drawing detailed conclusions based on the specifics of any of these diagrams.
  2. I picked 4 uops per cycle (sent from the Rename/Reorder unit) to the Reorder Buffer as a nominal value.  This is the nominal maximum -- the average value will typically be lower, but it depends on the previous fetch stages, the types of uops generated, etc.
  3. The Reorder Buffer is probably inclusive of the Reservation Stations, so there can be a maximum of 192 uops in flight.
    • Understanding the exact number of uops being transferred in each cycle is only going to be possible for very carefully constructed sequences of instructions that are repeated enough times to allow the performance counters to be used.   A discussion of such a case is at
    • There are some special cases for uops that are fused at different parts of the pipeline and unfused in other parts -- see Agner Fog's microarchitecture docs for a discussion of some of these cases.
    • uops that are "ready" are sent to the Reservation Stations.   (Roughly speaking, a uop is "ready" if the uops that define its input physical register(s) have already been sent to the scheduler.  The details depend on the implementation.)
    • So, as you noted, it should take a minimum of roughly 48 cycles to fill the Reorder Buffer (from an empty start) if the Rename unit is able to issue 4 uops per cycle.   Lots of cases will take more cycles than this (even without fetch & decode problems), due to limits in the number of registers that can be renamed per cycle.  (Fog's microarchitecture document says the Rename unit can rename 4 registers per cycle on Sandy Bridge, which is enough for 4 loads, but only enough for two 2-register instructions or one 3-register instruction.   It is not clear if Haswell/Broadwell or Skylake have increased this rename rate.)
  4. After a uop has been dispatched to an Execution Port and execution has completed, the results stay in the corresponding physical registers and the uop's entry in the Reorder Buffer is marked as "completed" (but not yet retired!).
  5. The Retirement Unit (closely associated with the Reorder Buffer logic) chooses the oldest completed uops for retirement. (The processor must also track retirement by instruction as well as by uop.)  Many uops can be retired in one cycle, as long as none of them are newer than any uncompleted uop in the Reorder Buffer.

Anyway, the point is that you can fill up the Reorder Buffer in a fairly small fraction of a memory latency (~50 cycles out of ~200 cycles).  So even if all of the subsequent instructions are independent of the data being fetched and can be executed out of order, you will only be able to tolerate a fairly small fraction of the memory latency.  More commonly many of the subsequent instructions will be dependent on the data coming back from the loads, which will allow even less effective overlap.  

But this is why the hardware has aggressive hardware prefetchers -- to try to get the data in the caches before the loads, so the latencies are not so long. 

Looking at the numbers, one would expect that these OOO mechanisms would be extremely effective for data that is in the L2 cache (~12-14 cycle latency), and quite effective for data that is in the L3 cache (~30-40 cycles latency).  Experience generally agrees -- the core can tolerate L2 latencies almost perfectly and L3 latencies very well.  Bandwidth is another issue -- if you only have 3 instructions for every word coming from L3 or memory, then your average instruction retirement rate will be limited by the sustained bandwidth.  (In other words, the ability to tolerate latency requires that you have work to do while waiting -- not all codes have extra work to do.)

Black Belt

I was using "in flight" and "in the Reorder Buffer" as synonyms.  

Without knowing anything about Intel's implementation, one categorization of items in the Reorder Buffer might be:

  • Arrived from the Rename unit, but not yet copied to the Reservation Stations.
  • Already copied to Reservation Stations, but not yet executed.
  • Executed and marked complete, but not yet Retired.

Other states may exist for speculatively issued instructions.  OOO and speculation have many similarities, but the implementations may need to be different.   It is also not clear what happens to uops that are dispatched to an execution port, but then rejected because the data is not ready.  These probably stay in the Reservation Station to be retried, but they could be kicked back to the Reorder Buffer.

Black Belt

When I say that "most of the latency has been overlapped", I simply mean that while each load takes about 200 cycles (from issue to completion or retirement), the elapsed time for 10 loads is much less than 10 loads * 200 cycles/load = 2000 cycles.  

The rate at which cache lines are returned depends on lots of complex interactions between the Core, L1, L2, L3, Ring, Memory Controller, DRAM Controller, DRAM, and (in multi-chip systems) cache coherence transactions.

The specific numbers will depend on the processor model (and any user-configurable options, such as Turbo Boost), the memory configuration (e.g., maximum frequency of installed DRAM, number of populated DRAM channels, number of DIMMs and ranks/DIMM in each channel), the details of the test pattern (e.g., contiguous vs strided vs random addresses, number of read and write streams), and the number of cores used for the test.

For an older processor like the 2006-era Core processor mentioned above, a common configuration might have a 1066 MHz Front Side Bus.  This has a peak bandwidth of 8*1.066=8.53 GB/s.  Assuming that an appropriate set of DIMMs is installed, the sustained bandwidth on the STREAM benchmark should be in the range of ~5.1 GB/s.  (e.g., for a similar Core2 Duo processor).   With 64 byte cache lines, 5.12 GB/s is 80 Million cache lines per second, or one cache line every 12.5 ns.  At a nominal 2 GHz frequency, this is one cache line every 25 core clock cycles.   This is a long-term average -- the detailed timing is likely to be quite complex.

A wide range of values can be obtained depending on the specifics of the configuration. 

  • A very low value (in cycles) might be obtained by using a single core on a processor with relatively low frequency, but relatively high memory bandwidth.  For example, a Core i3-4330TE (Haswell, 2.4 GHz dual-core) configured with 2 channels of DDR3/1600 DRAM has a peak memory bandwidth of 25.6 GB/s and should be able to sustain read rates of ~18 GB/s using one core.  18 GB/s / 64 B/cacheline = 0.28125 G lines/second = 3.56 ns/cacheline.  So the average latency between cache line returns to the core will be about 3.56ns * 2.4 GHz = 8.5 cycles.   The latency on this system should be about 70 ns, so the latency-bandwidth product is 70ns*25.6GB/s=1792 Bytes = 28 cache lines.   This is more than the 10 cache lines that a single core can directly manage, but if there are multiple accesses to each 4KiB page, the L2 hardware prefetchers will be able to create more concurrency.  I typically see an "effective" concurrency of up to 20 cache lines for contiguous reads from 2 different address streams.  Rearranging the latency/bandwidth/concurrency relationship says that 20 lines * 64 Bytes/line / 70 ns = 18.3 GB/s -- which is where my 18 GB/s single-threaded bandwidth assumption came from.
  • A very high value (in cycles) might be obtained using all of the cores on a processor with relatively high frequency, but relatively low memory bandwidth.  For example, a Core i7-4790K (Haswell, 4.0 GHz quad-core) configured with 2 channels of DDR3/1600 DRAM has a peak memory bandwidth of 25.6 GB/s (same as the previous example) and should be able to sustain ~22 GB/s using all four cores.  So each core will be receiving an average of 5.5 GB/s, or 0.086 Gcachelines/sec = one cache line every 11.6 ns.   At 4.0 GHz this is about 47 cycles between cache line returns for each core.  Again, these are long-term averages -- the detailed timings are likely to be complex and (without an inline DRAM probe connected to a logic analyzer) are not going to be visible to the user.

Certainly there are parts of the memory subsystem that are fully pipelined -- each cache line load takes exactly 8 transfer cycles (4 major cycles) on the DRAM bus, and typically internal transfers are also "atomic".   Consecutive transfers are only sometimes pipelined without gaps, and sometimes the gaps are quite large.  Because a large number of units that have to interact in the memory access path, there are many opportunities for stalls, and these cannot be understood in detail using publicly available information.  That is why measurements (such as STREAM) continue to be useful.