Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Core to Core communication behavior in Skylake-SP

Jaeyoung__Choi
Beginner
700 Views

Hi,

I am using Xeon gold 6132 processor and I am wondering the core to core communication behavior.

I know the each core has L3 cache slice and private L2 cache.

So, If we assume the producer process has small working set which fit into L2 cache and consumer process tried to read the producer's data.

Is it forwarded directly from producer's private cache??

I couldn't find any document which describe the caching behavior.

If someone who know well about this or know the document which describe this communication behavior.

Can you please let me know??

Thank you in advance.

Jae young

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
700 Views

There is very little documentation on the implementation of the protocols, but there are enough details in the Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual to enable many of the details to be discovered.   

The general flow is:

  • Requester loads an address, missing in its own L1 and L2 caches.
  • The system agent looks at the address, computes a hash function, determines which CHA/L3 agent "owns" the address, and forwards the request to that agent.
  • At the CHA/LLC slice, the address is lookup up in the LLC and is also looked up in the Snoop Filter.
  • In your case, the Snoop Filter will determine that the requested line is owned in modified state in the private caches of the Producer core.
  • The CHA will send an intervention request to the Producer core to send the cache line to the consumer core.
    • Lots of details become a bit harder to guess at this point -- careful microbenchmarks are required.
    • Option 1: WB/downgrade, leave memory incoherent
      • Producer cache drops line from M to S (or F), Consumer cache takes line in F (or S) state.  F state does not have write permission, but it is dirty with respect to memory, so it must be written back to memory on eviction.
    • Option 2: WB/downgrade, update memory
      • Producer cache drops line from M to S, Consumer cache takes line in S state.  The modified data is written back to DRAM, so both Producer and Consumer caches are "clean".
    • Option 3: Migratory data transfer
      • Producer cache drops line from M to I, Consumer cache takes line in M state.
    • Options 4+: (Left as an exercise for the reader...)
    • In each of these cases, the Snoop Filter must also be updated, but the details of how this is coordinated with respect to the transactions in the Producer and Consumer caches are unclear.

Table 3-1 in the uncore performance monitoring guide document is particularly valuable -- it lists the transactions available to the Coherence and Home Agents.   Lots of carefully directed testing is required to make sense of any of it.

View solution in original post

0 Kudos
2 Replies
McCalpinJohn
Honored Contributor III
701 Views

There is very little documentation on the implementation of the protocols, but there are enough details in the Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual to enable many of the details to be discovered.   

The general flow is:

  • Requester loads an address, missing in its own L1 and L2 caches.
  • The system agent looks at the address, computes a hash function, determines which CHA/L3 agent "owns" the address, and forwards the request to that agent.
  • At the CHA/LLC slice, the address is lookup up in the LLC and is also looked up in the Snoop Filter.
  • In your case, the Snoop Filter will determine that the requested line is owned in modified state in the private caches of the Producer core.
  • The CHA will send an intervention request to the Producer core to send the cache line to the consumer core.
    • Lots of details become a bit harder to guess at this point -- careful microbenchmarks are required.
    • Option 1: WB/downgrade, leave memory incoherent
      • Producer cache drops line from M to S (or F), Consumer cache takes line in F (or S) state.  F state does not have write permission, but it is dirty with respect to memory, so it must be written back to memory on eviction.
    • Option 2: WB/downgrade, update memory
      • Producer cache drops line from M to S, Consumer cache takes line in S state.  The modified data is written back to DRAM, so both Producer and Consumer caches are "clean".
    • Option 3: Migratory data transfer
      • Producer cache drops line from M to I, Consumer cache takes line in M state.
    • Options 4+: (Left as an exercise for the reader...)
    • In each of these cases, the Snoop Filter must also be updated, but the details of how this is coordinated with respect to the transactions in the Producer and Consumer caches are unclear.

Table 3-1 in the uncore performance monitoring guide document is particularly valuable -- it lists the transactions available to the Coherence and Home Agents.   Lots of carefully directed testing is required to make sense of any of it.

0 Kudos
Jaeyoung__Choi
Beginner
700 Views

Dear John.

You always give detailed answer for my question.

I think its now clear. I have been learned a lot from you.

Thank you.

0 Kudos
Reply