Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

4th gen (Haswell) undocumented events?

nsmeds
Novice
765 Views

Hi,

Think this may be the right forum where people may have some insight in this.

Reading "Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B and 3C"  from June 2013 I got confused in section 19-2 "PERFORMANCE MONITORING EVENTS FOR THE 4TH GENERATION INTEL® CORE™ PROCESSORS". On p 2530 event num D0H (MEM_UOPS_RETIRED) is described. However, umask values 02H and 20H are not described even though they are referenced as values possible to combine with the umask values included in the docs.  A not too far-fetched guess is that 02H may be STORES, 20H is harder to guess.

Anyone on the forum who knows what should have been written in the docs?

/Nils

0 Kudos
13 Replies
Patrick_F_Intel1
Employee
765 Views

Hello Nils,

This is the correct forum for questions like this. I will check if these umasks are supposed to be public and get back to you.

Pat

0 Kudos
perfwise
Beginner
765 Views

Patrick,

    Following up with a similar question, PMC 24, has changed from IB to HW.  I am referring to the System Programming Guide (inclusive of 1 and 2) at page 19-3 for HW pmc events.  The unit mask does alot of very useful data collection of requests coming to the L2, but none of the bits say what they do.  Masks are provided, which is aluded to above, one can use, but you may want to get more data.  Bit 3, in the unit mask, 0b08, what does it do.  

   I ask because I ran a test I built which just hits in the L2 with RD requests.  There are 666 requests per 1000 instructions.  Each rip strides by 64B.  I can provide the test to you if you want.  There are 4 loads, 1 cmp and 1 jump every loop iteration.  I see those in the core via a different PMC which counts LD/ST/etc requests.  There are 666 allocations into the L1D.  Using unit mask 0x41, bit 0 selects RD requests and bit 6 selects hit to the L2 I see there are no "demand" requests to from the core, aka. not L1 hw pref related.  So all the lines are brought in by the L1D Hw pref.  Using unit mask 0x81 I measure and this measure L1 HW pref requests to the L2 (no breakout is possible for hit and miss, I already tried that).  I get a count of 333 L1D HW pref.  This confounds me.. since I'm only getting 1/2 the allocations accounted for by the L1D HW pref.  HOWEVER, if I use a unit mask 0x88, I get the spurious extra requests to the L2 from the L1D HW pref.  Bit 4, doesn't allow a breakout of RD/RFO or IC activity.. it's just inclusive of all L1D HW pref.  So.. can you possibly inform me what this measures.  Can L1D HW prefetches.. generate extra transactions, not tied to the LFB which fill data speculatively into the L1D?  Any assistance is greatly appreciated and have a great day...

Perfwise

0 Kudos
perfwise
Beginner
765 Views

Any update Pat on the questions we had above?  Any help is greatly appreciated.. 

Perfwise

0 Kudos
perfwise
Beginner
765 Views

Just Pinging this thread.. I've not seen a response yet to the questions both Nils and I had.  Any help is very much appreciated where possible..

Perfwise

0 Kudos
Patrick_F_Intel1
Employee
765 Views

Hello Perfwise,

I'm not sure what you mean by the statement: "Bit 3, in the unit mask, 0b08, what does it do". The umasks are usually 1 byte.

Pat

0 Kudos
Patrick_F_Intel1
Employee
765 Views

Hello Nils,

Below is what the SDM should say. Let me know if this doesn't clear up your questions.

Pat

D0H 01H MEM_UOPS_RETIRED.LOADS Qualify retired memory uops that are loads. Combine with umask 10H, 20H, 40H, 80H. Supports PEBS and DataLA

D0H 02H MEM_UOPS_RETIRED.STORES Qualify retired memory uops that are stores. Combine with umask 10H, 40H, 80H. Supports PEBS and DataLA

D0H 10H MEM_UOPS_RETIRED.STLB_MISS Qualify retired memory uops with STLB miss. Must combine with umask 01H, 02H, to produce counts. Supports PEBS and DataLA

D0H 20H MEM_UOPS_RETIRED.LOCK Qualify retired memory uops with Lock. Must combine with umask 01H, to produce counts. Supports PEBS and DataLA

D0H 40H MEM_UOPS_RETIRED.SPLIT Qualify retired memory uops with line split. Must combine with umask 01H, 02H, to produce counts. Supports PEBS and DataLA

D0H 80H MEM_UOPS_RETIRED.ALL Qualify any retired memory uops. Must combine with umask 01H, 02H, to produce counts. Supports PEBS and DataLA

0 Kudos
perfwise
Beginner
765 Views

Pat,

    The unit mask is a byte, which is 8 bits, and in hex thats 0xHH where H is 4 bits.  I asked what bit 3 handles in the unit mask for PMC 24. So if you use unit mask 0x88, what is that measuring.  It appears it's needed in order to get "all the L1 hw prefetch" activity in Haswell.. and I know that setting bit 7, i.e. 0x80, selects the L1D Hw prefetcher.. I'm wondering what I get when I select bit 7 as well as bit 3, what's this measuring.  I know with my directed test it's something related to L1D Hw pref activity.. because I see no DEMAND REQ from the L1D (specified by a mask of 0x41) in a test which sequentially reads 128KB from the L2.  If the allocations into the L1D are not demand oriented then the Line Fill Buffers must be generated by L1D HW pref.  I can only account for half the L1D Allocations into it from the L1D Hw prefetcher with the unit mask 0x81, which measures all L1D Hw prefetches for RD requests.  If I set 0x81, then I get all the L1D HW prefetch activity and can account for all the L1D allocations .. for a very prefetchable, directed test I use to measure whether the L1D Hw pref is working properly.

Perfwise

0 Kudos
Patrick_F_Intel1
Employee
765 Views

Hello Perfwise,

Looking at the haswell L2_requests events (event mask 0x24) (copying from the SDM)

24H 21H L2_RQSTS.DEMAND_DATA_RD_MISS Demand Data Read requests that missed L2, no rejects.
24H 41H L2_RQSTS.DEMAND_DATA_RD_HIT Demand Data Read requests that hit L2 cache.
24H E1H L2_RQSTS.ALL_DEMAND_DATA_RD Counts any demand and L1 HW prefetch data load requests to L2.
24H 42H L2_RQSTS.RFO_HIT Counts the number of store RFO requests that hit the L2 cache.
24H 22H L2_RQSTS.RFO_MISS Counts the number of store RFO requests that miss the L2 cache.
24H E2H L2_RQSTS.ALL_RFO Counts all L2 store RFO requests.
24H 44H L2_RQSTS.CODE_RD_HIT Number of instruction fetches that hit the L2 cache.
24H 24H L2_RQSTS.CODE_RD_MISS Number of instruction fetches that missed the L2 cache.
24H 27H L2_RQSTS.ALL_DEMAND_MISS Demand requests that miss L2 cache.
24H E7H L2_RQSTS.ALL_DEMAND_REFERENCES Demand requests to L2 cache.
24H E4H L2_RQSTS.ALL_CODE_RD Counts all L2 code requests.
24H 50H L2_RQSTS.L2_PF_HIT Counts all L2 HW prefetcher requests that hit L2.
24H 30H L2_RQSTS.L2_PF_MISS Counts all L2 HW prefetcher requests that missed L2.
24H F8H L2_RQSTS.ALL_PF Counts all L2 HW prefetcher requests.
24H 3FH L2_RQSTS.MISS All requests that missed L2.
24H FFH L2_RQSTS.REFERENCES All requests to L2 cache

So we can see that umask 0x01 is demand loads, bit 0x02 is RFO, 0x04 is instruction fetch, 0x08 is a L2 hw prefetch request.
For the upper 4 bits, we can see that 0x20 is a miss, 0x40 is hit and that 0x80 another kind of hit. I wonder why one would put 0x80 on data hits and not on instruction hits... hmmm what kind of mesi hit state is not likely to occur for instructions but very likely to occur for data... hmmm.
That leaves 1 bit unaccounted for: 0x10. Clearly it is a prefetch request from the above table. So we have bits 0x18 having to do with prefetch requests...
Now... I can try and go get approval to tell you exactly what each of these bits mean but that will take a while. Have you tried just enabling/disabling the prefetchers and seeing which count changes when you use 0x10 vs 0x80 ?

I'm not trying to be cute but I think all the info can be deduced from the table and a few experiments. Hopefully this is enough info to get you going. Sorry for not replying sooner,

Pat

0 Kudos
perfwise
Beginner
765 Views

Pat,

    Any help is greatly appreciated.  So I'm happy to have any help and you're not being "cute".  Rest assured. BTW.. my motherboard doesn't provide control over the hw prefetchers in the bios.  So I have no way of trying an experiment to discern any behavior when I set these bits in the unit mask via turning them off in the bios.

    One point, on your post you say 0x80 isn't used for INSTR requests but it is.  The unit mask 0xE4, which is mentioned above, sets bit 0x80 for instruction requests.  Looking at:

0xE1 and subtracing out the bits set by 0x21 and 0x41 above the remaining bits set are 0x80, and the definition of that appears to be L1 HW pref requests.  

This appears to be used for RD, RFO and IC requests, to capture "all" requests coming to the L2.  So.. seems to imply that the bit 0x80 selects the L1 HW pref requests for DATA and INSTR, correct?

As for 0x10, that seems to obviously select the L2 HW prefetcher.  Bit selected by 0x08 seems to be required to select "all" requests that missed the L2, and it seems to have to be selected with either bit 0x80 or 0x10 to have an effect.  

I'll continue my studies.. but look at my comments.  I couldn't discern from your reply any "M" state you were alluding to in bit 0x80, which is what I think you were referring to along the lines of self modifying code.

Perfwise

0 Kudos
perfwise
Beginner
765 Views

Pat,

    I have been looking more closely at even 0x24 (L2 RQSTs), 0xD0 (which measures loads/stores/etc retired) and 0xD1 (which measures retired load hits/misses to the cache hierarchy).  A couple observations, and I don't know how well it's been looked at internally there.  The # of LDs in 1) a simple test which strides by 64B, 2) which access 128KB, do not match between the requests observed from the L1D in event 0x24 using unit mask 0xE1 to that observed using event 0xD1 per 1000 instructions.  Also event 0xD0 and 0xD1 do not agree as to the number of loads.  Suppose for the sake of illustration there are 666 (4 per 6 instructions are loads in a loop) LDs pti (per 1000 instructions), then I'm observing 160 hits in the L1D and 506 hits in the L2 using event 0xD1.  Event 0xD0 doesn't agree.. it's undercounting the # of LDs pti as ~560 to 600, depending upon the run of the test.  Event 0x24 with unit mask 0xE1 doesn't report this number of 666 but rather is reporting 450 pti.  Now the lines which remained in the L1D and are hit there (the 160 I mentioned above) were prefetched to get there.. but they are not included in event 0x24 with unit mask 0xE1.  So it seems there's some other mechanism which needs to be counted.  Surprisingly if I use event 0x24 with unit mask 0x68, then I get a count for "something".  It's not 160 but approximately 130, and I don't know what it's measuring.  Are these downward fills into the L1D from prefetch requests I can't measure?  Don't know.  But.. on my SB and my IB parts.. 0xD0 and 0x24 were in agreement with the "demand" requests from the L1D iin event 0x24 (which included HW pref requests and true demand requests from uops executed).  I know now they don't... and that's somewhat troubling because I can't really get a feel for my L2 traffic on HW.  I just wanted to let you know that.. and this weekend I'll give you a test to illustrate this.

One other question.. I've been unable to measure any L1D hw pref traffic so far, though from event 0xD1 I know lines are being fetched from the L2 to the L1D in my test.  Is the bit set by unit mask 0x80 in event 0x24 supposed to measure L1D HW pref?  Is it possible that using unit mask 0x61 I'm measuring, as it was on SB and IB, "true demand" and "hw pref" requests from the L1D? 

Thanks for any help.. perfwise

0 Kudos
nsmeds
Novice
765 Views

Thanks Patrick,

I think this is useful information to me. If I get confused further I will come back to the forum. But I think I have a lot of good stuff  to digest now.

/Nils

0 Kudos
McCalpinJohn
Honored Contributor III
765 Views

I don't know if these notes will be helpful, but I recently did a study of the behavior of the L2 cache access counters on the Sandy Bridge generation of products.  I was able to disable the L1 and L2 hardware prefetchers independently for these tests, and found no evidence that any of the counters measure the L1 prefetch accesses to the L2 cache (but see the detailed notes for caveats).   An important conclusion from the study was that the count of L2 prefetch accesses to the L2 is not a reliable indicator of L2 prefetch data movement into the L2 -- perhaps because of overcounting or perhaps L2 prefetches sometimes put the data only in the L3 and not in the L2.

----------------------------------------
Notes from 2013-08-09 in response to a query about measuring L2 miss rate:
----------------------------------------
Event 0x24 is one set of events that measures L2 activity.  
         Umask 0x01 returns "demand read requests that hit L2 cache"
         Umask 0x03 returns "demand read requests to the L2 cache"   (the comment about counting L1 prefetch load requests is wrong — they are not counted by this event)
Since 0x03 is the combination of 0x01 and 0x02, this suggests that a Umask of 0x02 would count "demand read requests that miss L2".  
I don't have an Ivy Bridge Xeon to test this on, but on my Sandy Bridge Xeon E3-1270, I find that the sum of events 0x24/0x01 and 0x24/0x02 matches 0x24/03 exactly…

Of course, these are just the L2 misses associated with demand misses from the L1.  You will probably also want to include L1 store misses.  In this case, Intel documents hits, misses, and total:
Event 0x24, Umask 0x04 counts L1 store misses that hit in the L2
Event 0x24, Umask 0x08 counts L1 store misses that miss in the L2
Event 0x24, Umask 0x0C counts all L1 store misses that reference the L2

If you care about L1 instruction cache misses that hit or miss in the L2, Event 0x24 Umask 0x10/0x20/0x30 provide hits, misses, and total accesses.

The final category in Event 0x24 is related to L2 prefetch events, with Umasks 0x40/0x80/0xC0 providing hits, misses, and total accesses.

For all of these categories, my Xeon E3-1270 (Sandy Bridge core) shows reasonable results, with the hits + misses exactly matching the combined event counts.
For the prefetch category, I disabled the L2 prefetchers and got counts of zero for those Umasks (independent of whether the L1 hardware prefetchers were enabled).

So Event 0x24 has masks to count all L2 *load* accesses *except* for L1 hardware prefetches.

As a final check, I combined the masks for all four categories of hits, the masks for all four categories of misses, and the masks for all four categories of total accesses, and the results matched sums of the results of the separate tests to within 1%.
----------------------------------------
Event 0x28 measures hits and misses for L1 writebacks to the L2.
    Umask 0x01 counts L1 Data Cache Writeback requests that miss the L2.   Section 19.3 says "LLC" instead of "L2", but I think this is incorrect.  See the descriptions of the same event in sections 19.4 and 19.5.
         The counts for this event should usually be small.  The only time this event occurs is when a line is chosen as victim in the L2 before it is chosen as a victim from the L1.
         The Sandy Bridge and Ivy Bridge processors do not force the L2 to be inclusive, so this can happen, though usually data is evicted from the L1 before it is evicted from the L2.
         In my sample problem on my Xeon E3-1270, this accounted for just under 2.5% of the L1 writeback requests to L2.
   Umask 0x04 counts L1 Data Cache Writeback requests that hit in the L2 cache, but where the copy of the line in the L2 cache is in the "E" (Exclusive) state.
         I think that this occurs when a process reads a cache line that is not currently shared by another process (to get it in E state), then later writes to that line (to make it dirty in L1).
   Umask 0x08 counts L2 Data Cache Writeback requests that hit in the L2 cache, where the copy of the line in the L2 cache is in the "M" (Modified) state.
         I think that this occurs when the process brings the line into the L2 cache with a store instruction, rather than reading it then later writing to it.
   Umask 0x0F counts all L1 Data Cache Writeback requests to the L2.
         The astute reader will note that this includes Umask 0x02, but at least on my system Umask 0x02 does not count anything, so Umask 0x0D returns the same value as Umask 0x0F.

It is nice that this event is annotated with "not rejected", meaning that the counts are not contaminated by L1 Data Cache Writeback attempts that are rejected by the L2 cache and later retried.
The values I got for total L1 Data Cache Writebacks were similar to the values I got for L1 Data Cache Store Misses ("Store RFO requests") , which is reasonable, and they matched exactly when the hardware prefetchers were disabled.  I think this is reasonable too, but I would have to look at the access patterns in more detail to say more.
-------------------------------------------
Event 0xF0 measures accesses to the L2.
    My interpretation is that events 0x24 and 0x28 measure access from the L1 side, while Event 0xF0 measures the transactions from the L2 side.
    Umask 0x01 counts demand read requests from the L1 Data cache that access the L2.  
        For my test code, this value is almost 2% higher than the corresponding Event 0x24/Umask 0x03, unless hardware prefetch is disabled, in which case the numbers match to 6 digits.
    Umask 0x02 counts L1 Data Cache Store misses (RFO's) that access the L2.
        For my test code, this value is almost 9% higher than the corresponding Event 0x24/Umask 0x0C, unless hardware prefetch is disabled, in which case the numbers match to 7 digits.
    Umask 0x10 counts L1 Data Cache Writebacks to L2.  For my test code this matches Event 0x28/Umask 0x0D to 8 digits, with or without prefetch enabled.
-------------------------------------------
Event 0xF1 measures fills to the L2.  Since these arise from L2 misses, it should be possible to combine events 0xF1 and 0xF0 to get information analogous to the 0x24 and 0x28 events.
      If only it were so easy.
      With L2 hardware prefetch disabled, the sum of the L2 fills (Event 0xF1, Umasks 0x01, 0x02, 0x04) matches the sum of L2 misses (Event 0x24, Umasks 0x02, 0x08, 0x20).
      But with the L2 hardware prefetchers enabled, the numbers don't match any more.  
In particular, the sum of L1 misses (Event 0x24, Umasks 0x02, 0x08, 0x20, and 0x80) greatly exceeds the number of L2 fills (Event 0xF1, Umasks 0x01, 0x02, 0x04).
                For my test code the ratio was about 1.7:1
      Why?  One possible reason:
              According to the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966):
     "The streamer and spatial prefetcher prefetch the data to the last level cache. Typically data is brought also to the L2 unless the L2 cache is heavily loaded with missing demand requests."

It would take additional analysis with the Uncore LLC performance counters to see whether the number of L2 prefetches that miss in the L2 can be matched to the counters out there.
      But this discrepancy does make it clear that the count of L2 prefetches missing the L2 (Event 0x24, Umask 0x08) cannot be taken as an indication of actual traffic into the L2.
------------------------------------
Event 0xB0 measures "Offcore requests".  Since "Offcore" (or "uncore") means anything beyond the L2 cache, these should arise from L2 misses and may be useful.
     Umask 0x01 measures demand read requests sent to the uncore.  It should match demand read requests that miss in the L2 (0x24/0x02), and it does match exactly in my tests.
     Umask 0x02 measures code requests sent to the uncore. It should match code requests that miss in the L2 (0x24/0x20) , and it does match exactly in my tests.
     Umask 0x04 measures offcore store miss requests (plus locks and a few other transactions).  
            Since locks should be relatively rare, this event should be very close to L1 data cache store misses that miss in the L2 (0x24/0x08), and I get an exact match in my tests.
     Umask 0x08 measures all data read requests sent to the uncore — both demand and prefetch.  It is not clear if this intended to include RFOs.
            Running with the L2 prefetchers disabled, the counts from Umask 0x01 and 0x08 match — indicating that the combined event does not include RFOs.
            I was surprised to see that the numbers were the same whether I had the L1 prefetchers enabled or disabled.  
        Re-reading the SW Optimization Guide discussion on the prefetchers, I now see that the L1 prefetchers only bring data into the L1 — they do not put the data in the L2 or L3.
                So it is no surprise that the presence or absence of L1 prefetches does not change the number of L2 fills.  
                L1 prefetches *do* change the performance, however, with the following timings:
                   2.576 seconds with all prefetchers enabled
                   2.956 seconds with only the L2 prefetchers enabled
                   4.305 seconds with only the L1 prefetchers enabled
                 10.58 seconds with no prefetchers enabled
    Umask 0x80 counts all L2 pipeline transactions.  Presumably this includes L1 prefetches, but it is hard to tell, since the number of transactions is going to be the same whether I have the L1 prefetchers enabled or disabled. (This is a unit stride code, so effectively all L1 prefetches are going to be useful, replacing demand loads on a 1:1 basis.)
--------------------------------------
Summary:
      (1) It is easy to count some classes of accesses that miss in the L2:  
              L1 data cache demand read misses,
              L1 data cache demand store misses,
              L1 instruction cache misses, and
              L1 data cache writebacks.
      (2) L2 prefetch misses in the L2 are not a direct indicator of traffic into the L2.
      (3) It is not clear whether Event 0xB0 allows direct or indirect counting of L1 hardware prefetch accesses or misses at the L2 cache.
------------------------------------

0 Kudos
Travis_D_
New Contributor II
765 Views

I'm not sure if it comes too late for "perfwise" but I recently looked at the event=0x24 L2_RQSTS counter on Skylake client, and was able to fully decode it mostly via exhaustive testing. There are errors and omissions in the SDM, and also an inconsistency between the manual and the event files that Intel makes available to download via 01.org.

So for this event the umask is logically divided into 3-bit and 5-bit field:

XYZa bcde

Where X is the MSB and a the LSB. The XYZ part defines what result type for an access you are interested in: X is "hit in the M state", Y is "hit in the E state" and Z is "miss". Any combination of XYZ can be set at once, so setting all of them (events with 0xE or 0xF as their high nibble) gets all result types.

The 'abcde' field filters on request type or "origin". e is demand read requests. d is RFO requests - definitely RFOs originating say from a demand store without a prior, and possibly also some types of prefetching. c is instruction requests, i.e., requests originating from L1I misses. b is prefetch requests originating from the L1D. This includes both L1 HW prefetches as well as any software prefetching, and maybe the next page prefetcher (NPP). Finally, a is L2-initiated data prefetches, i.e., accesses related to the activity of the L2 prefetchers. As with the "result" field, you can combine the bits in any way you please.

So there are really 256 possible umask values, although some don't make much sense: e.g., you will probably always get 0 for instruction "M" state hits, umask=0x84 - but most of the other events all make sense.

 

 

 

0 Kudos
Reply