Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Cache associativity - Observations from random ptr chase test..

perfwise
Beginner
505 Views
Hi,
I have been looking at the cache latencies of my Intel SB part. I build a ptr chase that takes a given size of memory and breaks it into so many granular chunks. I use the first array entry every granular element to point to the next.
For instance.. in a size spanning 128KB and a granularity of 32KB I create a walk of:
a[0] -> a[32768 * 2]
a[32768 * 2] -> a[32768 * 1]
a[32768 * 1] -> a[32768 * 3]
a[32768 * 3] -> a[0]
The cache size of the L1 is 32KB and has an associativity of 8, so one might expect that the index to a specific way is set via bits 14:12. I do a walk with granular steps of 4KB spanning (8 + n) * 4KB for various n, and I observe:
* for n=0, I get 4 cycles of latency, as expected the L1 latency
* for n=1, I am stressing the associativity of the L1, but I don't get the L2 latency, I'm inbetween, say 5-7 cycles sometimes and other times very close to the L2 latency
* for n=2, I'm getting the L2 latency, makes sense
(question: why for n=1 am I sometimes getting into a mode where I get numbers close the L1 latency, when for the given cachesize and associativity I should be in the L2, is there some buffering for victims to the L2 that my test is hitting in some state with this trivial test?)
Then I thought I'd measure the transition between the L2 and L3. The L2 is reported as 256KB and non inclusive of the L1. The associativity is 8, so one might expect the index to the way of the L2 to be determined via bits 17-15. I do a walk with granular steps of 32KB spanning (8 + n) * 32KB for various n, and I observe:
* for n=0, I get the L1 latency, which is expected, the L1 is 8 way associative and we found earlier it maps every one of the 8 steps of 32KB in this test to different ways.. and since there are 8 steps, we fit in the associativity of the L1.
* for n=1, I get 5.2 cycles, which isn't the L2 latency, why?
* for n=2 and greater, I observe the L3 latency, and not the L2 latency, why!?
(question: It appears that there's some interesting behavior going on between the L1 and L2.
* is the L1 a write back cache to the L2? Appears not. The behavior looks alot like the L2 doesn't exist.
* if I do a walk of steps of 4096, once I do 64 of these (which maps the associativity of the L2, but doesn't include that of the L1, which would require 72 such steps) I observe the latency increase from the L2 for 64, which is 12, to 14 for 65 steps, 17 for 66 steps, and so on. I'm observing behavior like either the L1 doesn't exist or a way of the L2 isn't there.
)
This test is using large pages, 2MB, so TLB misses are not occuring, and the start of the array is aligned to the start of the 2MB page. Lastly, the pointer chase is 8x unrolled. See below for an illustration of what the code looks like, maybe the HW pref is catching for walks with a few steps (say 8), that the address loaded via the rip is the same. For a walk with 16 steps, every 2 loop iterations you would get the same address touched via that rip:
1002420: 48 8b 3f mov (%rdi),%rdi
1002423: 48 8b 3f mov (%rdi),%rdi
1002426: 48 8b 3f mov (%rdi),%rdi
1002429: 48 8b 3f mov (%rdi),%rdi
100242c: 48 8b 3f mov (%rdi),%rdi
100242f: 48 8b 3f mov (%rdi),%rdi
1002432: 48 8b 3f mov (%rdi),%rdi
1002435: 48 8b 3f mov (%rdi),%rdi
1002438: 49 ff cd dec %r13
100243b: 75 e3 jne 1002420
I can play with removing the unrolling in this test.. but maybe the hw pref is getting in the way. Still.. for a walk spanning 64 * 4KB, you fit in the L2, for 65 * 4KB, you don't. Yet I'm observing that we are actually going to the L3. Is the hardware prefetcher speculating and evicting data from the L1 that I'm not requesting?
Any help in understanding this strange behavior is very much appreciated. As stated before.. the steps are randomized but there is a periodicity that the hw pref might catch, depending upon it's use of the rip and the address history for that rip.
Perfwise
0 Kudos
5 Replies
perfwise
Beginner
505 Views
An update on my progress. I'm able to reconcile my observations, if the L2 is inclusive of the L1D. I've measured the pointer chase after flushing it to memory and also without, thus making those lines Modified or Exclusively owned by the core in question.
My observation is that when you do a pointer chase through 64 steps, those steps reside in the first 8B of a 4KB page, I observe a latency of about 12 cycles. This is expected for an L2 of 256KB with 8-way associativity.
However, if you now sweep through 65 distinct 4K blocks (chosen randomly to defeat the HW pref), I observe the latency increases to 14 cycles per step, which coincides with 9 accesses to the L3 and 56 to the L2.
If you increase your number of random steps by 1 4K block, you will see that you're unable to fit more than 8 distinct 4K indexes into a 32KB block, they're always evicted to the L3.
So, does this imply the L2 is inclusive of data in the L1, it seems so.
Can you confirm that's the case?
perfwise
0 Kudos
TimP
Honored Contributor III
505 Views
Pardon me for asking for clarification, but I think you're stating that your experiments indicate that you can't get past the L2 associativity rules by trying to put additional blocks of data in L1D. That isn't the same as demonstrating that the documents about exclusivity of L1 and L2 data are wrong, even if you demonstrated that exclusivity doesn't help you exceed the associativity limit.
0 Kudos
perfwise
Beginner
505 Views
My interest is not in determing if the data is wrong. Largely the cache behavior is undocumented. I'm anttenuating the associativity of the L2 and given the "cpuid" output of the L1, observing what happens. If you sweep through 256KB with a 32KB block, you get 4 cycles, ok.. L1 hit. If you do 288KB, you observe that you get 8 L1 hits and 1 L2 hit, likely you're coming back to the evicted index in the L2 just fast enough to catch it before it's written to the L3. If you conversely do a 256KB sweep with 4KB steps, randomly, you observe L2 latency, 12 cycles. If you do 260KB with a 4KB step, you anttenuate 1 index of the L2 with 9 accesses and don't come back in time to catch it before it's written to the L3. So your latency there, is the following ( 9 * 28 + 56 * 12 ) / 65 = 14. So.. I'm rather confident that the L2 is inclusive of the L1D, at least from these observations.
I'm just wondering if that is something you can confirm. My tests say yes, and they're pretty easy to write. I just didn't find any documentation of this, and thought I'd ask or at least bring this to your attention. Thanks for any help..
perfwise
0 Kudos
Patrick_F_Intel1
Employee
505 Views
Hello perfwise,
For "how to characterise the L2"... from one of our previous forum entries: http://software.intel.com/en-us/forums/showpost.php?p=158013 we see:

On Sandy Bridge, the L2 can be characterized as 'non-inclusive, non-exclusive'.
See Table 2-5 of the Optimization guide (URL in previous reply) for the cache policies and characteristics by cache level.
The cacheline can be in L1 and L2 and L3 or
The line can be in L1 and not in L2 but always in L3 or
The line can be in L2 and not in L1 but always in L3 or
The line can be only in L3 (not in L1 nor in L2).
A modified line in the L1 will be written back to the L2 if the L2 has a copy of the line or, if the line isn't in the L2, the line can be written back directly to the L3.
If the modified line is written back to the L2 the the line won't be written back to the L3 unless the line is evicted from the L2 or the line is requested by another core.

As foryour timings, these will probably require more time than I have right now to check the numbers.
And added complication is that the pseudo least-recently-used (LRU) replacement algorithm is not perfect soyou will sometimes get L1 misses even when, technically, with a 4KB stride, you should be able to fit 8 cachelines into the L1D's 8 ways.
Pat

0 Kudos
perfwise
Beginner
505 Views
Pat,
Yes, I remember seeing this documented but what I'm specifically asking is when a line is brought into the core, from a memory request or from the L3, is it installed in BOTH the L1D and L2. That's the point I don't see stated and my test is telling me is happening.

The timings you don't need to look at, if the above is true it explans everything.

perfwise
0 Kudos
Reply