Solved: Question on L1 writeback performance counters on westmere EP

McCalpinJohn · ‎06-15-2012

Just in time for the weekend!

I am trying to figger out the memory access problems with an FFT code and can't quite decide what some of the Westmere EP (Xeon X5650, 06_2C) performance monitor events actually mean....

L1 Writeback events show up in four different performance monitor events in this processor (Intel arch sw dev guide vol 3B, document 325384-042, table 19.9):

(1) Event 28, Masks 1/2/4/8, L1D_WB_L2."*"_STATE, Counts number of L1 Writebacks to the L2 where the cache line to be written is in the "*" state.
(2) Event B0, Mask 40, OFFCORE_REQUESTS.L1D_WRITEBACK, Counts number of L1D Writebacks to the uncore.
(3) Event F0, Mask 10 , L2_TRANSACTIONS.L1D_WB Counts L1D, Writeback operations to the L2.
(4) Event 51, Mask 04/08, L1D.M_*_EVICT, Counts the number of modified lines evicted from the L1 data cache due to replacement (04) or snoop HITM intervention (08)

For Event 28, it would probably be more clear if the text ended with "...is in the * state in the L2 cache". The L1 line will usually be in the M state (though I don't know how Intel handles "O" state lines).

For this code I sort of expect strange results because the power-of-2 strides in the FFT are likely to cause lots of cache conflicts, but I am not sure enough about the meaning of the counters to know if I am seeing evidence of this or not....

Normalizing the events to "counts per FFT element per FFT" gives reasonable numbers to look at. The raw values are in the range of 1 billion writebacks per execution of the code and are extremely stable across runs.

Event F0, Mask 10 gives 3.79 writebacks to the L2
Event 28, Mask 0F gives 3.79 writebacks to the L2
Mask 01 gives 0.36 writebacks to I state lines (9.5%)
Mask 02 gives 0.00 writebacks to S state lines (0.0%)
Mask 04 gives 2.82 writebacks to E state lines (74.4%)
Mask 08 gives 0.59 writebacks to M state lines (15.7%)
Event B0, Mask 40 gives 3.19 writebacks to the uncore (84.1% of the WB to the L2 given by Event F0)
Event 51, Mask 04 gives 3.79 L1 M state evictions due to replacement
Event 51, Mask 08 gives 3.79 L1 M state evictions due to snoop HITM

Questions:
(a) What causes a writeback to an I state line in the L2?
(I am running a single threaded workload pinned to a single core with HT disabled)
(b) What causes a writeback to the uncore?
(c) Is an L1 WB to the uncore a subset of writebacks to L2 or is it additive?
(d) Do counts in Event 51, Mask 04 imply *L1 replacement" (i.e., a "capacity" miss), or is the event more general (e.g., due to L2 or L3 replacements forcing L1 invalidation)?
(e) Do counts in Event 51, Mask 08 imply that there is something other than an L1 capacity miss happening? (Note that this is a single threaded workload pinned to a single core, so no interventions will come from other processor cores, but interventions could come from the L2 or L3.)

I am hoping that these counters give information that can be used to determine the number of L1 writebacks caused by L3 replacements causing L2/L1 invalidation (perhaps Event 51/Mask 08) and the number of L1 writebacks caused by L2 replacements causing L1 invalidation (perhaps Event B0/Mask 40).

It is more likely that the explanation is some combination of misinterpretation on my part and counters that don't count exactly what they are supposed to count, but I always like to learn --- maybe I can use these counters to learn something even more interesting than what I was originally looking for....

Hussam_Mousa__Intel_ · ‎07-30-2012

Hello jdmccalpin,

Apologies for the delay in responding to your message. As you probably realize it was a very detailed and involved question and it did take me some time to consult internally on many of the subtle details.

Regarding the behavior of your software, my hunch analysis based on your reported numbers is that 3.19 is the true number accountable to your code. The remainder is probably Prefetch related writebacks. The event 0xB0/mask:0x04 excludes prefetching.

The fact that event 0x51 0x04==0x51 0x08 is due to the fact you are running single threaded and not getting any SNOOP requests from other caching units.

However, I am not sure I understand the direct benefit to software from understanding only the writeback behavior of the caching unit. Can you please provide some more context?

I am not sure what you mean by O state.

Below are some specific responses to your enumerated questions:

(a) What causes a writeback to an I state line in the L2?
(I am running a single threaded workload pinned to a single core with HT disabled)

Most likely that L2 prefetches have already evicted the entries or lines that L1 writebacks. L2 may not include L1 entries because L2 is not an inclusive cache. Other factors such as different replacement policies between L2 and L1 may make L2 in I-state.

(b) What causes a writeback to the uncore?

One reason of many is that the line is in LLC but not in L2. There is also variety of other reasons and the details are generally implementation specific. Recall that uncore includes the LLC (which is L3 in this architecture)

(c) Is an L1 WB to the uncore a subset of writebacks to L2 or is it additive?

they could be either.

(d) Do counts in Event 51, Mask 04 imply *L1 replacement" (i.e., a "capacity" miss), or is the event more general (e.g., due to L2 or L3 replacements forcing L1 invalidation)?

Typically implies L1 replacement (i.e., a "capacity" miss).

(e) Do counts in Event 51, Mask 08 imply that there is something other than an L1 capacity miss happening? (Note that this is a single threaded workload pinned to a single core, so no interventions will come from other processor cores, but interventions could come from the L2 or L3.)

L1D.M_SNOOP_EVICT(event 0x51, umask 0x08) is inclusive of L1D.M_EVICT (event 0x51 umask 0x04). This includes both replacement evictions and HIT_M snoop requests. The 0x04 event is only replacement evictions. (The definition in the SDM was fixed for later architectures see table 19-3)

Hope this helps,
Hussam

View solution in original post

Hussam_Mousa__Intel_ · ‎07-30-2012

Hello jdmccalpin,

Apologies for the delay in responding to your message. As you probably realize it was a very detailed and involved question and it did take me some time to consult internally on many of the subtle details.

Regarding the behavior of your software, my hunch analysis based on your reported numbers is that 3.19 is the true number accountable to your code. The remainder is probably Prefetch related writebacks. The event 0xB0/mask:0x04 excludes prefetching.

The fact that event 0x51 0x04==0x51 0x08 is due to the fact you are running single threaded and not getting any SNOOP requests from other caching units.

However, I am not sure I understand the direct benefit to software from understanding only the writeback behavior of the caching unit. Can you please provide some more context?

I am not sure what you mean by O state.

Below are some specific responses to your enumerated questions:

(a) What causes a writeback to an I state line in the L2?
(I am running a single threaded workload pinned to a single core with HT disabled)

Most likely that L2 prefetches have already evicted the entries or lines that L1 writebacks. L2 may not include L1 entries because L2 is not an inclusive cache. Other factors such as different replacement policies between L2 and L1 may make L2 in I-state.

(b) What causes a writeback to the uncore?

One reason of many is that the line is in LLC but not in L2. There is also variety of other reasons and the details are generally implementation specific. Recall that uncore includes the LLC (which is L3 in this architecture)

(c) Is an L1 WB to the uncore a subset of writebacks to L2 or is it additive?

they could be either.

(d) Do counts in Event 51, Mask 04 imply *L1 replacement" (i.e., a "capacity" miss), or is the event more general (e.g., due to L2 or L3 replacements forcing L1 invalidation)?

Typically implies L1 replacement (i.e., a "capacity" miss).

(e) Do counts in Event 51, Mask 08 imply that there is something other than an L1 capacity miss happening? (Note that this is a single threaded workload pinned to a single core, so no interventions will come from other processor cores, but interventions could come from the L2 or L3.)

L1D.M_SNOOP_EVICT(event 0x51, umask 0x08) is inclusive of L1D.M_EVICT (event 0x51 umask 0x04). This includes both replacement evictions and HIT_M snoop requests. The 0x04 event is only replacement evictions. (The definition in the SDM was fixed for later architectures see table 19-3)

Hope this helps,
Hussam

McCalpinJohn · ‎07-31-2012

Thanks for the wonderful response! I was afraid that this had disappeared into the bit bucket....

The bigger context is that I am trying to understand much the power-of-two strides in various FFT algorithms impact the cache effectiveness (and thereby the application performance). Significant conflict misses typically indicate an opportunity for either code or data rearrangement to improve performance. Some of the cache behavior can be estimated with cache models but real hardware typically has complexities that one does not anticipate in advance when setting up a simulation, so hardware performance counters are often quite useful -- provided that you can figure out what they mean! (I worked in the HW design teams at SGI, IBM, and AMD, so I have usually had access to engineering resources to figure these things out. Now that I am back in academia it is more challenging to get the level of detail that I am used to.)

I was operating under the (incorrect) assumption that the cache hierarchy on Westmere EP was inclusive.
Re-reading the SW Optimization guide (#248966, section 2.3.4) makes it clear that the L2 does not include the L1, but that the L3 includes all the lines in the L1 I&D caches and all the L2 caches. Fixing this incorrect assumption makes most of my confusion go away!

Of course "most" is not "all", and I still have plenty of work to do here. The large fraction of writebacks to the uncore suggests that L2 conflicts are serious issue. I suspect that these are mostly due to demand misses with large power-of-two separations (bigger than 4kB), rather than due to L2 prefetches (which operate within 4kB pages). I will need to run with large pages to eliminate the confounding factor of page coloring in the L2 & L3 (though I expect that to *decrease* performance in this case by making conflicts more consistent), and I will also need to run with the prefetchers disabled to get the "traditional" dumb cache behavior.

Thanks for the suggestion to look at B0/04 to get the offcore demand RFO requests (excluding L2 prefetch requests). There are a number of other counters of this sort that I probably ought to be looking at as well.

BTW, my reference to the "O" state was another confusion on my part. "O" (short for "Owned") is an extension to the MESI protocol used by some other processor architectures to indicate a line that is in a "shared" state, but is inconsistent with memory. It occurs when an "M" state line is downgraded by a read from another core, but not written back to system memory. It acts like an "S" state line except that it is dirty, so it must be written back to outer levels of the cache or memory when evicted. The MOESI protocol is used by AMD64 processors (http://en.wikipedia.org/wiki/MOESI_protocol), and a similar state (called "T") is used in IBM POWER processors, starting with POWER4, if I recall correctly. The 13 states of the IBM Power6 and Power7 cache coherence protocol are described in Table 2 of Sinharoy, et al., "IBM POWER7 multicore server processor", IBM J. RES. & DEV. VOL. 55 NO. 3 PAPER 1 MAY/JUNE 2011. (I worked on the design team for the POWER4 and POWER5 processors and I thought they were more than complex enough, but POWER6 added a lot of additional states.)

Hussam_Mousa__Intel_ · ‎07-31-2012

I am glad this was helpful. Typically cache demands and hits are used as a better gauge for locality vs write backs, and hence my curiousity about your research.

On a different but related note, you should explore the offcore events available on westmere. you can better characterize the traffic between a given core and the uncore with a decent level of thoroughness.

-Hussam