Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Does data from PCIe go into L1/L2 cache?


Hi community,


I am currently studying the memory architecture of SKYLAKE-SP, and happen to be stuck on explaining the system behavior when an application requests for data from a PCIe device. Unfortunately, I had no luck in finding related documentation that fully resolves my issue.


So my situation is this:

- I have a SKYLAKE-SP system, with an FPGA device connected via PCIe interface.

- The FPGA has a memory device inside, which is memory-mapped (with no_cache option) to a certain system address range.

- An application accesses data from this FPGA through BAR(Base Address Register) and an offset.


Correct me if I'm wrong, but as far as I know, the system follows the steps below.

1) a core requests for data from an address in the PCIe memory address range.

2) request is sent to the System Agent, LLC, and the memory controller simultaneously.

3) System Agent identifies that the data is in PCIe and responds..

4) Data is fetched


Here is where I got stuck.

1. Would the data from my PCIe device get cached into the L1/L2 cache? If so, can it also get into the LLC?

2. Are there any notable differences in the data path from/to PCIe, compared to the path to/from the conventional memory controller? (e.g. reported bypass routes etc.)



Thank you and best regards,


0 Kudos
4 Replies
Black Belt

Lots of special cases here....  The most important features that (appear to be) well-defined are:

  • Loads from a memory region marked as UC will put data into the target register only, and will not put data in any caches.  Only the requested bytes will be accessed on the device.  No speculation is allowed -- so, for example, end of loop branch conditions must be fully resolved before the load(s) from the next loop iteration can be executed.
  • IO DMA to cacheable system memory will be placed in the L3 cache (if the IO device is attached to the same socket as the system memory buffer).
  • Intel allows special buffering of reads from PCIe BAR spaces if they are of type WC by using the MOVNTDQA instruction family.
Black Belt

Intel (and AMD) processors allow PCIe BAR spaces to be mapped with the WP or WT types.  These both allow reads to be cached, but maintaining coherence is completely up to the user code.  Implementations differ on which caches can be used, but all of the implementations that I know of allow WP and WT types to be cached in the L1 DCache.  

Both WT and WP types allow writes to be combined, but the implementation may not be the same as the write combining of the WC memory type.


Thank you very very much for the quick reply John!

A few more questions on the properties of UC though..

By 'target register' are you referring to the general purpose registers? (e.g. eax, ebx etc)

Does 'not putting data in any caches' imply instant flush upon load/store cache?
do you mean that there exist a direct bypass route to/from PCIe BAR?

For example, if I were to load 4 bytes from PCIe BAR space marked as UC in the MTRR as below,

mov (%ebx), %eax /*load 4 bytes from the memory address in EBX into EAX*/

Are you saying that the core fetches exactly 4 bytes from PCIe into eax without any operation on the L1/L2/LLC?

Can MTRR be modified at driver-level via linux's ioremap?

In other words, if I were to mark some region as UC in MTRR at BIOS,
and then ioremap the same region as WC with 'ioremap_wc' in the driver, would the MTRR be modified to WC?

If so, what happens if multiple drivers mark the same region with different ioremap options?

Are there any official materials on this matter that I can relate to?

Black Belt

1. A "MOV" instruction from any uncached memory region puts the data in the target register that you named in the MOV instruction.   I don't think there is any reason to believe that the data ever moves "through" the caches -- that would displace cached data, which would be undesirable.  Many/most of the data paths are likely the same, but that is an implementation detail that is going to be much harder to observe, since uncached accesses don't overlap with any other memory accesses.

2. MTRR's can be modified on a live system (assuming the BIOS has not locked them?), but this has to be done carefully.  MTRR's are described in Section 11.11 of Volume 3 of the Intel Architectures SW Developer's Manual, with Section 11.11.18 addressing specific issues related to maintaining MTRR consistency in multiprocessor systems.   I don't know if Linux automagically supports these rules for changing MTRRs.

The "effective memory type" is based on the memory type specified by the MTRR (Section 11.11), the memory type specified by the PAT (Section 11.12), and the memory type specified by the PCD and PWT bits in the page tables (Section 4.5 "4-Level Paging").   Tables 11-7 and 11-11 describe how these three sources are combined to produce an effective memory type.   I have only played with these features a little bit, and mostly stick with the simple cases of UC and WC for PCIe BAR regions.