Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Memory load performance counter on Haswell

BGoel
Beginner
657 Views

I have noticed that on Haswell microarchitecture, 

MEM_LOAD_UOPS_RETIRED:L1_HIT + MEM_LOAD_UOPS_RETIRED:L1_MISS != MEM_UOPS_RETIRED:ALL_LOADS and I was wondering why?

If I add MEM_LOAD_UOPS_RETIRED:HIT_LFB to the left side of equation above, I get closer to the count of MEM_UOPS_RETIRED:ALL_LOADS, but I don't understand why I need to do that since a load will hit LFB only after it misses L1. So LFB hits should already be counted as L1 misses. Am I misinterpreting the counters here?

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
657 Views

I believe that the counters are defined so that L1 misses that hit an existing LFB are not counted by the L1 miss event.

In other words, the event MEM_LOAD_UOPS_RETIRED.L1_MISS should be interpreted as "loads that miss the L1 data cache and do not match a currently allocated LFB".

So the L1_HIT + L1_MISS + HIT_LFB sub-events should add up to a number that is very close to the ALL_LOADS count.

Note that errata HSD29 for the 4th-generation Core processors (i.e., Haswell) says that these performance counter events can be incorrect when HyperThreading is enabled -- counts can be dropped or counts can be incorrectly included when they are actually due to instructions issued by the other logical processor.  This errata does not apply if HyperThreading is disabled in the BIOS.

There are a number of other errata relating to these events in this (and other) processor generation(s), and there is no guarantee that the published errata include all the bugs that are actually present.

On my Sandy Bridge EP systems (Xeon E5-2680), the MEM_LOAD_UOPS_RETIRED.L1_MISS sub-event is not listed in the documentation, but I decided to try that Umask (0x08) anyway to see if it works.   Using six variants of an instrumented version of the STREAM benchmark, I found that the sum of Event 0xD1 L1_HIT, L1_MISS, and HIT_LFB sub-events matched the MEM_UOP_RETIRED.ALL_LOADS (Event 0xD0, Umask 0x81) to a few parts per million for most cases. 

The cases that showed extremely good agreement (5-6 digits) were SSE scalar code (compiled with -xSSE4.2 -no-vec), SSE streaming code (compiled with -xSSE4.2), SSE vector code with allocating stores (compiled with -xSSE4.2 -opt-streaming-stores never), and AVX scalar code (compiled with -xAVX -no-vec).

The other two cases showed deviations of 1-2 parts per thousand for this formula.  These were AVX streaming code (compiled with -xAVX) and AVX vector code with allocating stores (compiled with -xAVX -opt-streaming-stores never).

Disabling all the hardware prefetchers helped reveal a bit more detail.   The overall pattern was the same -- the same 4 cases had ~6 digit agreement and the same two cases had ~3 digit agreement, but there is clearly something broken (relative to expectations) on the AVX side.

With the HW prefetchers disabled, scalar code should have ~0 L1_HITs,  ~1/8 L1_MISS events and ~7/8 HIT_LFB.  Both the SSE and AVX scalar codes matched this pattern.

With the HW prefetchers disabled, SSE vector code (with either streaming stores or allocating stores) should have ~0 L1_HITs, ~1/4 L1_MISSes, and ~3/4 HIT_LFBs.   Both of these cases also matched these ratios to within 1-2%.

With the HW prefetchers disabled, AVX vector code (with either streaming stores or allocating stores) should have ~0 L1_HITs, ~1/2 L1_MISSes, and ~1/2 HIT_LFBs.  This was not the case.   For these two cases the counters returned 2-3% L1_HIT events, about 0.1% L1_MISS events, and 96%-97% HIT_LFB events.  Some of the L1_HIT events are due to auxiliary loads (not required by the STREAM kernel, but part of the loop overhead and/or measurement overhead), but the total loads only amount to ~1% more than those required by the STREAM kernels, so 2%-3% L1 hits must indicate something wrong with the counter.  The L1_MISS event returns non-zero values, but these are very small and may be entirely due to the measurement infrastructure and not related to AVX loads at all.  The HIT_LFB event seems to capture both the 50% that are expected to hit in the LFB and the remaining 47%-48% that are should have been counted as L1_MISS events.

I will try this again on an Ivy Bridge system to see if there is any change in the L1_MISS event for AVX loads.  The behavior on Sandy Bridge may be the reason why that sub-event was left out of the documentation -- it works on scalar code and SSE vector code, but not AVX vector code.

View solution in original post

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
658 Views

I believe that the counters are defined so that L1 misses that hit an existing LFB are not counted by the L1 miss event.

In other words, the event MEM_LOAD_UOPS_RETIRED.L1_MISS should be interpreted as "loads that miss the L1 data cache and do not match a currently allocated LFB".

So the L1_HIT + L1_MISS + HIT_LFB sub-events should add up to a number that is very close to the ALL_LOADS count.

Note that errata HSD29 for the 4th-generation Core processors (i.e., Haswell) says that these performance counter events can be incorrect when HyperThreading is enabled -- counts can be dropped or counts can be incorrectly included when they are actually due to instructions issued by the other logical processor.  This errata does not apply if HyperThreading is disabled in the BIOS.

There are a number of other errata relating to these events in this (and other) processor generation(s), and there is no guarantee that the published errata include all the bugs that are actually present.

On my Sandy Bridge EP systems (Xeon E5-2680), the MEM_LOAD_UOPS_RETIRED.L1_MISS sub-event is not listed in the documentation, but I decided to try that Umask (0x08) anyway to see if it works.   Using six variants of an instrumented version of the STREAM benchmark, I found that the sum of Event 0xD1 L1_HIT, L1_MISS, and HIT_LFB sub-events matched the MEM_UOP_RETIRED.ALL_LOADS (Event 0xD0, Umask 0x81) to a few parts per million for most cases. 

The cases that showed extremely good agreement (5-6 digits) were SSE scalar code (compiled with -xSSE4.2 -no-vec), SSE streaming code (compiled with -xSSE4.2), SSE vector code with allocating stores (compiled with -xSSE4.2 -opt-streaming-stores never), and AVX scalar code (compiled with -xAVX -no-vec).

The other two cases showed deviations of 1-2 parts per thousand for this formula.  These were AVX streaming code (compiled with -xAVX) and AVX vector code with allocating stores (compiled with -xAVX -opt-streaming-stores never).

Disabling all the hardware prefetchers helped reveal a bit more detail.   The overall pattern was the same -- the same 4 cases had ~6 digit agreement and the same two cases had ~3 digit agreement, but there is clearly something broken (relative to expectations) on the AVX side.

With the HW prefetchers disabled, scalar code should have ~0 L1_HITs,  ~1/8 L1_MISS events and ~7/8 HIT_LFB.  Both the SSE and AVX scalar codes matched this pattern.

With the HW prefetchers disabled, SSE vector code (with either streaming stores or allocating stores) should have ~0 L1_HITs, ~1/4 L1_MISSes, and ~3/4 HIT_LFBs.   Both of these cases also matched these ratios to within 1-2%.

With the HW prefetchers disabled, AVX vector code (with either streaming stores or allocating stores) should have ~0 L1_HITs, ~1/2 L1_MISSes, and ~1/2 HIT_LFBs.  This was not the case.   For these two cases the counters returned 2-3% L1_HIT events, about 0.1% L1_MISS events, and 96%-97% HIT_LFB events.  Some of the L1_HIT events are due to auxiliary loads (not required by the STREAM kernel, but part of the loop overhead and/or measurement overhead), but the total loads only amount to ~1% more than those required by the STREAM kernels, so 2%-3% L1 hits must indicate something wrong with the counter.  The L1_MISS event returns non-zero values, but these are very small and may be entirely due to the measurement infrastructure and not related to AVX loads at all.  The HIT_LFB event seems to capture both the 50% that are expected to hit in the LFB and the remaining 47%-48% that are should have been counted as L1_MISS events.

I will try this again on an Ivy Bridge system to see if there is any change in the L1_MISS event for AVX loads.  The behavior on Sandy Bridge may be the reason why that sub-event was left out of the documentation -- it works on scalar code and SSE vector code, but not AVX vector code.

0 Kudos
McCalpinJohn
Honored Contributor III
657 Views

Following up -- I don't see any change in the Ivy Bridge EP implementation of the MEM_LOAD_UOPS_RETIRED.L1_MISS event relative to Sandy Bridge EP -- it still shows essentially no counts for 32 Byte AVX reads, but instead lumps all of the events that miss the L1 into the MEM_LOAD_UOPS_RETIRED.HIT_LFB sub-event.

0 Kudos
BGoel
Beginner
657 Views

Thanks John. That was very helpful. It's frustrating sometimes to work with counters that have misleading names. :)

0 Kudos
Reply