Solved: Impact of "RdCode" on remote CPU via QPI

ZWang45 · ‎10-17-2015

Hi All,

I am now working on the intel's HARP system (CPU+FPGA, connected by QPI).

After reading https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf, I roughly know the impact of "RdData" to the remote CPU via QPI.

However, the FPGA can issue the "RdCode" to the CPU via QPI, I do not know the exact steps the CPU goes through. Thanks.

Zeke

McCalpinJohn · ‎10-21-2015

I don't know anything about this system in particular, but in general a "Read Code" transaction coming in over a QPI link will just snoop the caches and not have any direct impact on the processor cores.

A "Read Code" transaction is very similar to a "Read Data" transaction, except that the "Read Data" transaction return unshared cache lines in the "Exclusive" state (so that they can be written to without requiring additional coherence transactions), while the "Read Code" transaction will always return the line in the "Shared" state (because the hardware does not support storing to lines that are mapped in the instruction cache -- they have to be turned into data lines and modified in the data cache before being fetched back into the instruction cache).

The general-purpose processor in the HARP systems is a Xeon E5, so it has an inclusive L3 cache. If the "Read Code" transaction misses in the L3 cache, then the Home Agent that owns the corresponding physical address will fetch the cache line from DRAM and return the data to the requester. If the "Read Code" transaction hits in the L3 cache, the subsequent transactions will depend on the state that the line was found in. The L3 may be able to return the data directly, or it may need to fetch the data from an L2 or L1 cache, or (if the data is already in the shared state) it may simply allow the Home Agent to return the data. There are many possible transactions, and I don't think that Intel has publicly documented the details of the transactions. A very high-level overview is available at http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf, and more details of the protocol may can be derived (with varying levels of confidence) from the Uncore Performance Monitoring Guides for the Xeon E5 processors (document 327043 for Sandy Bridge EP, document 329468 for Ivy Bridge EP, and document 331051 for Haswell EP).

View solution in original post

McCalpinJohn · ‎10-21-2015

I don't know anything about this system in particular, but in general a "Read Code" transaction coming in over a QPI link will just snoop the caches and not have any direct impact on the processor cores.

A "Read Code" transaction is very similar to a "Read Data" transaction, except that the "Read Data" transaction return unshared cache lines in the "Exclusive" state (so that they can be written to without requiring additional coherence transactions), while the "Read Code" transaction will always return the line in the "Shared" state (because the hardware does not support storing to lines that are mapped in the instruction cache -- they have to be turned into data lines and modified in the data cache before being fetched back into the instruction cache).

The general-purpose processor in the HARP systems is a Xeon E5, so it has an inclusive L3 cache. If the "Read Code" transaction misses in the L3 cache, then the Home Agent that owns the corresponding physical address will fetch the cache line from DRAM and return the data to the requester. If the "Read Code" transaction hits in the L3 cache, the subsequent transactions will depend on the state that the line was found in. The L3 may be able to return the data directly, or it may need to fetch the data from an L2 or L1 cache, or (if the data is already in the shared state) it may simply allow the Home Agent to return the data. There are many possible transactions, and I don't think that Intel has publicly documented the details of the transactions. A very high-level overview is available at http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf, and more details of the protocol may can be derived (with varying levels of confidence) from the Uncore Performance Monitoring Guides for the Xeon E5 processors (document 327043 for Sandy Bridge EP, document 329468 for Ivy Bridge EP, and document 331051 for Haswell EP).

ZWang45 · ‎10-22-2015

Hi John, thanks for your helpful reply.

I cannot understand "(because the hardware does not support storing to lines that are mapped in the instruction cache -- they have to be turned into data lines and modified in the data cache before being fetched back into the instruction cache)." Do you mean the instructions can be modified by the other program?

One more question: when the "Read Code" transaction issued by FPGA (via QPI) misses in the L3 cache, the Host Agent will fetch the cache line from DRAM and return the data to the FPGA ("Shared" state in FPGA cache), does the CPU L3 cache hold the same cache line in the "Shared" state? or L3 cache does not have the copy?

Thanks.

Zeke

McCalpinJohn · ‎10-22-2015

The instruction cache is a "read-only" cache. There are no "store instruction" instructions -- only "store data" instructions. So if you need to modify instructions (as in self-modifying code), the store instruction will cause the corresponding cache line to be invalidated from any caches (data or instruction) that are holding it, and it will be copied into the data cache of the processor executing the store and that processor will be given permission to modify the line. After the line has been modified, it can be fetched back into an instruction cache. There are subtleties and restrictions related to modification of code -- some of these are discussed in section 8.1.3 of Volume 3 of the Intel Architectures Software Developer's Manual.

Normally the "Read Code" transaction is only generated by a processor core Instruction Cache miss (or perhaps by an Instruction Cache prefetch), but with an FPGA that supports QPI there might be more flexibility in the control of the generation of such transactions.

The L3 cache should only cache lines that are fetched by the local cores. If the FPGA in the other socket requests a cache line that is owned by a Home Agent on the Xeon E5, the L3 on the Xeon E5 will not create a new cached copy of that line. If there was already a copy of the cache line in the L3, then a copy might remain in the L3, depending on the transaction type. In the case of a "Read Code" transaction, if the line is already in the L3 cache on the Xeon E5, it should stay there (though the state might change).

ZWang45 · ‎10-23-2015

Thanks, John.