Trying to make sense of L1D (0x51), L2_RQSTS(0x24) and OFFCORE_

perfwise · ‎12-06-2011

First.. I generally can make excellent sense of the Evictions/Allocations on L1D and correlate them to requests to L2_RQSTS and likewise misses from L2_RQSTS can be correlated to both LONGEST_LAT_CACHE (which doesn't include any L2 HW PF) and OFFCORE_REQUESTS.

However, when you are doing long vector operations that are greater than the L3, and thus miss all levels of the cache, I observe discrepencies from the L1D alloc and L2_RQSTS:Demand_RD and the L2 HW PF reported missing the L2 by L2_RQSTS and coming to OFFCORE_RQSTS.

An illustration is a sequential read loop with 64B read with 1 SW pref (512B ahead), and then loop.. coping 16MB of data. I observe via PMCs on SandyBridge that:

There are 124 L1D Alloc/Evict

The L2_RQSTS:DEMAND_REQ = 82 (not 124!)

The L2_RQSTS:ALL_PF = 156 (23 hit and 133 miss)

The OFFCORE_REQESTS: (ALL_DATA_RD - DEMAND_DATA_RD - DEMAND_CODE_RD) = 47 (this count includes L2 HW PF but not LLC HW PF)

Questions:

the allocations/evictions associated with L1D do not match the DC requests/RFO requests to L2_RQSTS. I'm using a simple loop that either loads or stores, with software prefetch.
- can someone please explain this observation. It states for L2_RQSTS that that count is inclusing of L1 HW pref as well as demand requests. Is the L1D able to read or write directly from the L3, when it detects that a streaming operation is being performed?
- please provide some explanation.
The L2 HW PF missing in the L2 reported by the L2_RQSTS:PF_MISS is much higher than that which can be devived above from OFFCORE_REQUESTS. (L2_RQSTS reports 133 miss but OFFCORE_REQUESTs only shows 47 getting to the L3)
- why is this? I only observe this when memory intensive tests are run. Can someone please provide some explanation?
OFFCORE_REQUESTS is very close to the UNC_CBO_CACHE_LOOKUP values which I'm also measuring. I'm measuring across all CBO (0-3) and the number from MESI makes sense, I requests are misses while MES are hits. I'm observing, I believe (by comparing the counts in OFFCORE_RQSTS for Demand Rd and L2 PF with the hit and miss onUNC_CBO_CACHE_LOOKUP) that the L2 PF requests (the 47 reported by OFFCORE_RQSTS) are missing the L3, and there are approximately 82 Demand Requests hitting in the L3, likely upon the L2 HW PF data.

If someone can explain the behavior and discrepencies in L1D req and L2_RQSTS/OFFCORE_RQSTS HW PF counts.. it would be very helpful.

Thanks in advance

perfwise

Patrick_F_Intel1 · ‎12-09-2011

Hello perfwise,
I'm researching this. It might take a while to dig through it.
I have a few questions.
Are you using sw prefetch instruction prefetchnta?
On SNB, prefetchnta can bring lines directly from L3 to L1 (avoiding polluting L2, like the opt. guide talks about on Pentium M).
Do you have the hw prefetchers enabled in the bios?
I assume so.
Pat

Patrick_F_Intel1 · ‎12-20-2011

Hello perfwise,
To answer these questions I need to know more completely what you actually doing.
If you have a compilable code example, can you post it?
If you are using prefetchnta then data fetches can bypass L2. This might explain some of the differences you are seeing but it is hard to say.
Also, do you have prefetchers enabled or disabled in the bios?
Hits and missescounting iscomplicated becausethe counts can vary with timing (has a hw/sw prefetch had enough time to bring the data in so it shows up as a hit or is it still in flight (a miss) or does it hit a buffer, etc).
Note that, while you may be aware of all this, sometimes I add more background info for other folks.
Thanks
Pat

Trying to make sense of L1D (0x51), L2_RQSTS(0x24) and OFFCORE_REQUESTS (0xB0)