Solved: About Caching Home agent

Jaeyoung__Choi · ‎09-03-2019

Hi,

I am trying to understand what CHA does, but there is so limited information on Intel document.

Is there document that describe what CHA does?? especially in Xeon scalable family.

If so, can you share with me please??

Thank you.

McCalpinJohn · ‎09-05-2019

Intel does not make this easy, so I should probably be more careful about nomenclature....

On the 28-core SKX/CLX die, I found an invariant mapping of bit positions in CAPID6 to mesh stops with enabled CHA/SF/L3 slices. I will call this the "physical tile number". Bit 0 of CAPID6 refers to the tile in the upper left corner of the die (immediately above IMC0), bit 1 of CAPID6 refers to the tile immediately below IMC0, and physical tile numbers increase downward, from the left column to the right column, skipping over the two IMCs.

The MSR interface to the CHA/SF/L3 slices works by "logical tile number". The "logical tile numbers" are assigned using the same pattern (top to bottom, left to right) as the "physical tile numbers", but the numbering skips over tiles with disabled CHA/SF/L3. So for a processor with N (N<28) enabled CHA/SF/L3 slices, "logical tile numbers" 0 to N-1 are always the active ones, and "logical tile numbers" N to 27 are either mapped to the disabled "physical tile numbers" or mapped to NULL. I.e., the MSRs for "logical tile numbers" N to 27 can be read and written, but they don't appear to do anything.

So for the 28-core SKX/CLX die, the CHA/SF/L3 that is accessed as (logical) "CHA0" will be the directly above IMC-0 (unless bit 0 of CAPID6 is zero, indicating that the CHA/SF/L3 at that mesh point is disabled) and the CHA/SF/L3 that is accessed as (logical) "CHA1" will be directly below IMC-0 (unless the either bit 0 or 1 of CAPID6 is zero). On all the systems I have examined, CAPID6 splits its zero values evenly between the upper 14 bits (the right half of the die) and the lower 14 bits (the left half of the die). My interpretation of the product offerings is that Intel offers processors based on the 28-core die with 12, 14, 16, 18, 20, 22, 24, 26, or 28 active CHA/SF/L3 slices. With 12 slices enabled, 16 slices are disabled -- or 8 slices disabled in each half of the chip. There are probably some restrictions on the patterns of disabled slices, but for the 24-slice (gen1) parts, we have 122 different patterns across our 3472 Xeon Platinum 8160 processors, so it would take access to a whole lot of processors to be sure that some patterns were not allowable.

I don't know of any descriptions of Intel's credit-based flow-control schemes, but it appears to me that there are different credit mechanisms for different kinds of traffic. The credits discussed in Sections 2.2.2 and 2.2.3 are for the "Common Mesh Stop" (CMS). These are credits that allow the agents that live at this mesh stop to insert various kinds of traffic onto the AD (Command and Address) meshes and BL (Block data) meshes.

The credits discussed in Sections 2.2.8 and 2.2.10 are credits that the CHA has for sending reads or writes to one of the IMCs. These are held by each CHA -- unrelated to the distance to the IMC -- and are used to make accesses more "fair". Without a distributed credit mechanism of this sort, the CHAs adjacent to the memory controllers would probably get more than their fair share of traffic, and the most distant CHAs would get starved.

Traffic from the CHAs to the IMCs goes through the M2M box. It is possible that the discrepancy between the "WPQ_CYCLES_FULL" and "WRITE_NO_CREDIT" events can be cleared up by reviewing the related events in the M2M box(es): WPQ_CYCLES_REG_CREDITS, WPQ_CYCLES_SPEC_CREDITS, WRITE_TRACKER_CYCLES_FULL, WRITE_TRACKER_CYCLES_NE. It may be that the CHA WRITE_NO_CREDIT event is incrementing because the M2M box cannot accept more writes, and not because the IMC WPQ is full. It is also possible that the counter event is broken. It is also possible that I am completely misunderstanding the way these mechanisms are implemented....

View solution in original post

McCalpinJohn · ‎09-04-2019

In previous processors, coherence for lines hitting in the L3 cache was handled by the C-Box associated with each L3 slice. Coherence for transactions going to memory was handled by the Home Agent. Starting with Xeon Phi x200, the functionality of the Home Agent was split up and distributed around the mesh. With SKX, each mesh stop now has an L3 slice (LLC), a "Caching and Home Agent" (CHA), and a "Snoop Filter" (SF). Addresses are hashed and assigned to be processed by the LLC/CHA/SF at exactly one of the active mesh stops. For normal cacheable addresses:

The address is checked against the LLC to see if the line is cached in the L3.
The address is checked against the SF to see if the line is cached in one or more of the L1/L2 caches on the same chip.

The CHA monitors the LLC and SC checks, then if necessary coordinates any additional responses (such as sending the request to memory, monitoring for conflicting accesses, etc.).

The existence of this set of three "boxes" at each mesh stop was disclosed in Intel's 2017 presentation at the Hot Chips conference. The function of the boxes is described in "Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual" (Intel document 336274). Section 2.2.5 describes the performance monitoring events supported by the CHA, and provides pretty much all of the available information about the functionality of the CHA. Between the uncore performance monitoring manual and directed testing, a person who is experienced with different coherence protocols can figure out a lot (but almost certainly not all the details).

Jaeyoung__Choi · ‎09-04-2019

Dear John

Thank you for sharing the information.

But, I can't still understand how the CHA send the message by mesh interconnect.

recently, I have found my core mapping by referencing your methodology and I found that CHA0 and CHA2 is adjacent CHA with IMC.

In "Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual" there is event "WPQ_CYCLES_FULL" in IMC event section and in it's definition there is statement "This count should be similar count in the HA which tracks the number of cycles that the HA has no WPQ credits".

So, I found the "WRITE_NO_CREDITS" event in CHA event section. I think this is the event that should be similar with WPQ_CYCLES_FULL.

I have generated a lot of store request and then monitored the "WPQ_CYCLES_FULL" and "WRITE_NO_CREDIT" event.

But, I got really strange result. WRITE_NO_CREDIT report the value about million but WPQ_CYCLES_FULL report only about 161 cycles when I summed Channel 0,1,2 values. Moreover, I cannot understand why the all of the CHA report about 1 millions of WRITE_NO_CREDIT event. As far as I know, credit is used for indicating buffer availability of next node. So, I expected only the CHA0 and CHA2 will report the WRITE_NO_CREDIT event...

I have no idea what I have misunderstood.

McCalpinJohn · ‎09-05-2019

Intel does not make this easy, so I should probably be more careful about nomenclature....

On the 28-core SKX/CLX die, I found an invariant mapping of bit positions in CAPID6 to mesh stops with enabled CHA/SF/L3 slices. I will call this the "physical tile number". Bit 0 of CAPID6 refers to the tile in the upper left corner of the die (immediately above IMC0), bit 1 of CAPID6 refers to the tile immediately below IMC0, and physical tile numbers increase downward, from the left column to the right column, skipping over the two IMCs.

The MSR interface to the CHA/SF/L3 slices works by "logical tile number". The "logical tile numbers" are assigned using the same pattern (top to bottom, left to right) as the "physical tile numbers", but the numbering skips over tiles with disabled CHA/SF/L3. So for a processor with N (N<28) enabled CHA/SF/L3 slices, "logical tile numbers" 0 to N-1 are always the active ones, and "logical tile numbers" N to 27 are either mapped to the disabled "physical tile numbers" or mapped to NULL. I.e., the MSRs for "logical tile numbers" N to 27 can be read and written, but they don't appear to do anything.

So for the 28-core SKX/CLX die, the CHA/SF/L3 that is accessed as (logical) "CHA0" will be the directly above IMC-0 (unless bit 0 of CAPID6 is zero, indicating that the CHA/SF/L3 at that mesh point is disabled) and the CHA/SF/L3 that is accessed as (logical) "CHA1" will be directly below IMC-0 (unless the either bit 0 or 1 of CAPID6 is zero). On all the systems I have examined, CAPID6 splits its zero values evenly between the upper 14 bits (the right half of the die) and the lower 14 bits (the left half of the die). My interpretation of the product offerings is that Intel offers processors based on the 28-core die with 12, 14, 16, 18, 20, 22, 24, 26, or 28 active CHA/SF/L3 slices. With 12 slices enabled, 16 slices are disabled -- or 8 slices disabled in each half of the chip. There are probably some restrictions on the patterns of disabled slices, but for the 24-slice (gen1) parts, we have 122 different patterns across our 3472 Xeon Platinum 8160 processors, so it would take access to a whole lot of processors to be sure that some patterns were not allowable.

I don't know of any descriptions of Intel's credit-based flow-control schemes, but it appears to me that there are different credit mechanisms for different kinds of traffic. The credits discussed in Sections 2.2.2 and 2.2.3 are for the "Common Mesh Stop" (CMS). These are credits that allow the agents that live at this mesh stop to insert various kinds of traffic onto the AD (Command and Address) meshes and BL (Block data) meshes.

The credits discussed in Sections 2.2.8 and 2.2.10 are credits that the CHA has for sending reads or writes to one of the IMCs. These are held by each CHA -- unrelated to the distance to the IMC -- and are used to make accesses more "fair". Without a distributed credit mechanism of this sort, the CHAs adjacent to the memory controllers would probably get more than their fair share of traffic, and the most distant CHAs would get starved.

Traffic from the CHAs to the IMCs goes through the M2M box. It is possible that the discrepancy between the "WPQ_CYCLES_FULL" and "WRITE_NO_CREDIT" events can be cleared up by reviewing the related events in the M2M box(es): WPQ_CYCLES_REG_CREDITS, WPQ_CYCLES_SPEC_CREDITS, WRITE_TRACKER_CYCLES_FULL, WRITE_TRACKER_CYCLES_NE. It may be that the CHA WRITE_NO_CREDIT event is incrementing because the M2M box cannot accept more writes, and not because the IMC WPQ is full. It is also possible that the counter event is broken. It is also possible that I am completely misunderstanding the way these mechanisms are implemented....

Jaeyoung__Choi · ‎09-08-2019

Dear John

Thank you for sharing your knowledge.

I also has equal distribution of core at left part and right part each ( 7 - 7 ).

I saw the event TxC_BL_INSERTS at M2M events section and I recognize there is umask for "CMS-Near side , CMS-Far -side".

So, I expected M2M-IMC0's near side would be left part of die and far side would be right part of die.

To clarify my expectation I have used only the IMC0 and pinned the process , which generate a lot of load request, to each core. ( I have tile - core mapping information )

My experiment was very simple.

1. pin the process

2. count M2M-IMC0 TxC_BL_INSERTS ( Near side )

3. count M2M-IMC0 TxC_BL_INSERTS ( Far side )

Experiment result to each core report only the one of the among the near side count and far side count report high count .

But after I got all of the result I realized my expectation was totally wrong.

There was no relation between near,far side and core physical location in tile...

May I ask if you also have no relation between core physical location and near,far side??

Thank you.

McCalpinJohn · ‎09-09-2019

I can't remember if I ever figured out the meaning of Agent 0 and Agent 1 in this context. For the M2M, I would guess that one side of the mesh stop (i.e., one of the Agents) refers to transactions going to/from the M2M box, and the other side goes to everything else, but I don't think that I have tried to pin this down. I had trouble getting the M2M counters to work the first time I tried to use them, and I don't think I went back to this topic....