For "how to characterise the L2"... from one of our previous forum entries: http://software.intel.com/en-us/forums/showpost.php?p=158013 we see:
On Sandy Bridge, the L2 can be characterized as 'non-inclusive, non-exclusive'.
See Table 2-5 of the Optimization guide (URL in previous reply) for the cache policies and characteristics by cache level.
The cacheline can be in L1 and L2 and L3 or
The line can be in L1 and not in L2 but always in L3 or
The line can be in L2 and not in L1 but always in L3 or
The line can be only in L3 (not in L1 nor in L2).
A modified line in the L1 will be written back to the L2 if the L2 has a copy of the line or, if the line isn't in the L2, the line can be written back directly to the L3.
If the modified line is written back to the L2 the the line won't be written back to the L3 unless the line is evicted from the L2 or the line is requested by another core.
As foryour timings, these will probably require more time than I have right now to check the numbers.
And added complication is that the pseudo least-recently-used (LRU) replacement algorithm is not perfect soyou will sometimes get L1 misses even when, technically, with a 4KB stride, you should be able to fit 8 cachelines into the L1D's 8 ways.
Yes, I remember seeing this documented but what I'm specifically asking is when a line is brought into the core, from a memory request or from the L3, is it installed in BOTH the L1D and L2. That's the point I don't see stated and my test is telling me is happening.
The timings you don't need to look at, if the above is true it explans everything.