Understanding L2 Miss performance counters

BGoel · ‎08-13-2014

I am trying to understand the performance counters related to L2 misses on Haswell microarchitecture. Can someone tell me why is L2_RQSTS:MISS counter value greater than OFFCORE_RESPONSE_1:ANY_REQUEST:ANY_RESPONSE? Sometimes these two counter values are very close but for some benchmarks, L2_RQSTS:MISS is around 20-30% more than OFFCORE_RESPONSE_1:ANY_REQUEST:ANY_RESPONSE. Is that because L2 misses for the cache line already being serviced do not generate offcore responses? Or is there any other reason? Thanks in advance.

McCalpinJohn · ‎08-13-2014

I see the L2_RQSTS:MISS event described in the VTune configuration files, but I don't see anything that maps exactly to OFFCORE_RESPONSE_1:ANY_REQUEST:ANY_RESPONSE. (On the other hand, the most recent version of VTune uses a fairly complex syntax in its configuration files and it is certainly possible that I am not understanding it correctly.)

Is there any way to find out exactly what is programmed into MSR 0x1A7 for this case?

Sukruth_H_Intel · ‎08-13-2014

Hi John,

The MSR 1A7H is used for "OFFCORE response Performance monitoring". You may get the complete details about the event code and counters here :- http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

futureishere,

Could you please send me the result file for the above observed behavior?

Regards,

Sukruth H V

BGoel · ‎08-14-2014

Hi Sukruth,

I am using libpfm4.4 with perf_events to get the counter values. This is what I get when I run yada benchmark from STAMP benchmark suite with locks on 4 threads with L2 prefetcher turned off from BIOS:

task -i -e L2_RQSTS:MISS,OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_RESPONSE ./yada -a15 -i inputs/ttimeu1000000.2 -t 4

264442632 L2_RQSTS:MISS
218706989 OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_RESPONSE

McCalpinJohn · ‎08-14-2014

I am aware of the use of the 0x1A7 MSR -- the question was determining exactly which bits it contained. There is no point in trying to understand a difference between two performance counter events unless the exact programming of the events is clear.

I will look up the results from the libpfm4 translation after lunch....

BGoel · ‎08-14-2014

Hi John,

ANY_REQUEST is alias to DMND_DATA_RD:DMND_RFO:DMND_IFETCH:WB:PF_DATA_RD:PF_RFO:PF_IFETCH:PF_LLC_DATA_RD:PF_LLC_RFO:PF_LLC_IFETCH:BUS_LOCKS:STRM_ST:OTHER

and ANY_RESPONSE is used to set the bit 'Any' (Offset 16) in MSR_OFFCORE_RSP_x.

So the event OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_RESPONSE would set bits 16-15,11-0 in MSR_OFFCORE_RSP_0 register as per my understanding.

Krishnaswa_V_Intel · ‎08-14-2014

Xeon processors support 2 forms of L2 streaming prefetches. In one case, the data will be fetched into L2. In the other case, the data will only fetched into L3. This 2nd case is also known as LLC prefetch (or L3 prefetch) though it is still initiated by L2.

Haswell PMU has a bug and it can't count whether LLC prefetches hit in LLC or miss LLC. However, L2_RSQTS.MISS will count those. That is why you are seeing the difference. If you disable L2 prefetcher, then your numbers should match

McCalpinJohn · ‎08-14-2014

According to section 18.11.4 of the latest revision of Vol 3 of the Intel Arch SW Developer's Guide (325384-051), bits 3 and 7-11 of the offcore response MSRs are reserved in Haswell microarchitecture. It is hard to tell exactly what this means, but it is certainly possible that the implementation has changed enough that setting these bits causes inaccurate readings.

BGoel · ‎08-15-2014

Thanks for the reply Vish. The results I posted before were with L2 prefetcher already turned off.

Thanks for pointing that out John. I ran the benchmark again with just bits 15,6-4,2-0 set and got same result.

McCalpinJohn · ‎08-15-2014

Some ideas for further analysis:

1. Have you tried setting up a test case for which the number of L2 misses is known in advance? (Since you already know how to turn the L2 prefetcher off this should not be too tricky.) Then you could see which counter is closer to the expected value.

2. Have you tried this test on an earlier processor? Sometimes the counter functionality is effectively identical across processor generations, sometimes changes in implementation cause a legacy event to misbehave, and sometimes the new implementation brings in new bugs.

McCalpinJohn · ‎08-15-2014

I looked at some cases where I knew what answers to expect on my Sandy Bridge EP (Xeon E5-2690) systems....

Using the STREAM benchmark (one thread) with inline performance counter reads and hardware prefetch disabled, I get exact matches between L2_RQSTS.DEMAND_DATA_RD_MISS (Event 0x24, Umask 0x02 -- note that the encodings are very different on Sandy Bridge and Haswell !) and OFFCORE_RESPONSE_0 with Request type "DMND_DATA_RD" (bit 0) and Response type "Any" (bit 16) as the only two bits set. Here "exact match" means that the counts differ by at most one increment in the last decimal place when counting 5 million events (for the COPY kernel) or 10 million events (for the TRIAD kernel).

For the same benchmark, I also get exact matches between L2_RQSTS.RFO_MISS (Event 0x24, Umask 0x08) and OFFCORE_RESPONSE_0 with Request type "DMND_RFO" (bit 1) and Response type "Any" (bit 16) as the only two bits set.

When I enable the L2 prefetchers the results no longer match. The Offcore Response counts make sense -- the sum of Demand Read responses and L2 HW Prefetch Read responses is about 2% higher than the expected number of read responses. Similarly the sum of Demand RFO responses and L2 HW PF RFO responses is about 2% higher than the expected number of RFO responses. An overcount of 2% is more than I expect (since STREAM uses all the data in a 40 million element vector -- prefetches should all be used), but it is close enough for most performance work.

On the other hand, with the HW prefetchers enabled I have been unable to come up with an interpretation of the L2 counters that makes much sense. Some L2 demand read misses are converted to L2 demand read hits, but the sum of demand read hits and misses is 30% lower than the expected value for the COPY kernel and 17% lower than the expected value for the TRIAD kernel.

BGoel · ‎08-22-2014

Hi John,

Sorry for delayed response. I was trying to get access to an Ivy Bridge machine to confirm your observations. And yes, I see the same results as you do on Ivy Bridge Core i7-3770 machine. The L2_RQSTS and OFFCORE_RESPONSE_0 results do match with L2 prefetcher turned off.

So I don't understand that why I can't get these counters to match on my Haswell machine (Core i7-4770), unless there's a bug in Haswell PMU.

BGoel · ‎08-22-2014

I am observing few more strange things on Haswell. With L2 prefetcher turned off, for the yada benchmark, I still see some count for L2_RQSTS:ALL_PF but L2_RQSTS:L2_PF_HIT and L2_RQSTS:L2_PF_MISS are zero as expected:

task -i -e L2_RQSTS:ALL_PF,L2_RQSTS:L2_PF_HIT,L2_RQSTS:L2_PF_MISS ./cmd 4

52220075 L2_RQSTS:ALL_PF

0 L2_RQSTS:L2_PF_HIT
0 L2_RQSTS:L2_PF_MISS

That can't be expected behavior, can it?

Krishnaswa_V_Intel · ‎08-22-2014

I am not sure prefetchers are actually getting disabled properly on your system. Can you please read MSR 0x1A0 and report the value?

BGoel · ‎08-22-2014

Value of 0x1A4: 3

I am not sure which MSR this is as this address is not mentioned in the manual. In case you meant IA32_MISC_Enable MSR then it's value is 0x1A0: 4000850089.

Krishnaswa_V_Intel · ‎08-22-2014

I meant to write 0x1A0:). Does your BIOS expose disabling L1 prefetcher. If so, can you give that a try

BGoel · ‎08-22-2014

No, it doesn't. :(

Krishnaswa_V_Intel · ‎08-22-2014

btw, how are you measuring these events? Is that through Linux Perf or your own tool programming the PMU?

BGoel · ‎08-22-2014

I am using libpfm 4.5.0 that uses Linux perf_events underneath.

BGoel · ‎09-04-2014

Hi Vish,

Any updates on this? I tried to verify the results on a system with Intel motherboard (DQ87PG) but it doesn't even provide the option to disable hardware prefetcher in BIOS! :(

Krishnaswa_V_Intel · ‎09-24-2014

We just now publicly disclosed how to enable/disable h/w prefetchers on Intel processors code named Nehalem, Westmere, SandyBridge, Ivybridge and Haswell. Please refer to https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors that I just posted