Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Why behavior of L2 adjacent line prefetcher are different in Core2Duo and Core-i7 ?

Roy__Bholanath
Beginner
2,688 Views

Dear experts,
I am trying to understand the following phenomena.

I have an integer array of size 4KB which spreads across 64 cache lines. I am accessing the 64 cache lines to access whole 4KB as below -

Step 0 : Flush whole 4KB array 

Step 1: Randomly access all 64 cache lines as below

(a) access some random cache line i corresponding to my 4KB integer array

(b) access 16 or more random 4KB pages  from a pool of 160 random 4KB pages

(c) access some random cache line j corresponding to my 4KB integer array

 

I have performed the above experiments in both of my  Intel(R) Core2Duo @2.20GHz and Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with ONLY L2 adjacent line prefetcher enabled ( all other prefetcher disabled).

Here is my results

In Core2Duo system, 32 cache lines ( i+1or i-1 of step1a) are prefetched which indicates adjacent cache line always prefetched.

In Core-i7, no cache line is prefetched.

 

If I say L2 adjacent line prefetcher loads 128 bytes data as per Intel s/w document, that justify results of Core2Duo, but fails to justify Core-i7 results.

If I say although  L2 adjacent line prefetcher loads 128 bytes data, due to access to extra 4KB pages, pre-loaded data is evicted from cache. Again this justify results of Core-i7, but fails to justify Core2Duo results.

How two behaviors can be explained in a logical sense? 

 

 

 

 

0 Kudos
11 Replies
Travis_D_
New Contributor II
2,688 Views

It doesn't seem weird or inconsistent to me that two completely different uarches will have different results on such a test.

You could eliminate some of the possibilities by removing step (b) from your test. Then you would know whether the i7 wasn't doing adjacent line prefetch at all, or whether the line was being evicted.

0 Kudos
Roy__Bholanath
Beginner
2,688 Views

Thank you Travis D.

I have done the experiments with different number of extra 4KB pages accessed in step-b.

Here is my results in Core-i7

# Extra 4KB pages       |    Prefetched cache lines

  0                                 |           32

  2                                 |           27

  4                                 |           25

  8                                 |           24 

  16                               |           1

  20                               |           1
I have cross verified L2 adjacent line prefetcher didn't prefetch adjacent cache line with access of extra 16 random pages (i) I accessed few cache lines in between step0-step1. Later  all those cache lines generates cache hits which indicates those lines are not evicted due to access of random 4KB pages.

Also, due to access of extra 4KB pages there is very tiny possibility of replacement of prefetched cache line as that is only possible if they mapped to same cache line and replaced due to pseudo LRU policy.

 

 

 

 

 

 

0 Kudos
Travis_D_
New Contributor II
2,688 Views

My feeling is that accessing the extra pages results in parallel requests being outstanding to memory at the time the final (check) accesses are performed, and this triggers the heuristic which stops prefetching when too many outstanding requests are in progress.

You could test this theory by putting an lfence between the "extra" page accesses and the final access, or by making the extra accesses serially dependent (i.e., making the address of each extra access depend on the previous one). Both would remove all or all-but-one outstanding accesses when you make the final checks.

0 Kudos
Roy__Bholanath
Beginner
2,688 Views

I have added __asm__("lfence"); before access to each extra random 4KB page , still I am getting same previous result both of Core-i7 and Core2Duo.

0 Kudos
Travis_D_
New Contributor II
2,688 Views

You could share your code - it's hard to guess at what might be going wrong from the description alone.

One possibility is that the adjacent line prefetcher tracks its effectiveness and uses this heuristic to decide whether to keep fetching the adjacent line. For example, it might track how often an prefetched adjacent line is subsequently accessed - and this tracking may have a limited "horizon", i.e., it will only notice subsequent access when it occurs within N accesses of the original.

Something like N=8 would align with your results above: with N=8 or less, the prefetcher determines adjacent line access as useful since it is able to see the subsequent access, so it keeps doing adjacent line PF. Beyond 8 though, it may not see the subsequent access, and hence determines adjacent line PF is not useful and does not use it.

This would be a reasonable behavior for the adjacent line prefetcher, since if it were simply left on indiscriminately it would negatively affect workloads that don't exhibit that type of locality.

0 Kudos
HadiBrais
New Contributor III
2,688 Views

In Core2Duo system, 32 cache lines ( i+1or i-1 of step1a) are prefetched which indicates adjacent cache line always prefetched.

Isn't it also possible that step 1c accesses a line prefetched by the L2 prefetcher? What is the purpose of step 1a and step 1c? It seems to me that the three steps are be executed in a loop until all of the 64 lines are accessed no matter how much time it takes, and so step 1c and 1a are back-to-back.

Note that the L2 in the Core 2 Duo processor is at least 2MB in size and is at least 8-way associative. In contrast, the i7-3770 has an 8-way 256KB non-inclusive L2 (and an inclusive L3) and therefore L2 conflict misses are more likely. The L2 replacement policies are likely to be very different as well.

Are all the accesses (which I think are loads) serialized (i.e., no load is issued until the previous load completes)? The latency to fetch a line from main memory or the L3 into the L2 may be smaller on the Core 2 Duo processor than on the i7-3770. So you might want to wait for a larger amount of time before issuing the memory access to the adjacent line to give sufficient time for the line, in case it's being prefetched, to be filled in the L2. Putting an LFENCE may not be enough to achieve this because the LFENCE will only wait for the demand load to complete and not for the possible line being prefetched to reach the L2.

It's not clear to me what you are counting and how you are counting it. Are you counting the number of times the L2 adjacent line prefetcher has prefetched lines into the L2 or are you counting the number of unique L2 hits on lines prefetched by the L2 adjacent line prefetcher? It would be useful to know whether the adjacent line is being prefetched into the L2 or the L3, so you might want to recognize both cases.

Are you using performance counters to count whatever that is you're counting or are you using precise memory latency measurements? The details matter here.

On the Core 2 Duo, you can try with a much larger (8X or 16X) number of accesses in step 1b.

It's not clear to me what you're trying to achieve by choosing cache lines randomly. I think the L2 adjacent line prefetcher would only track accesses to the same 128-byte aligned chunk. If the prefetcher is designed to evaluate itself (as Travis suggested) then there would be an upper limit on the number of chunks being tracked, which needs to be considered. Otherwise, the prefetcher would always issue a prefetch request for the adjacent line if a superqueue entry is available. The L2 cache in the Core 2 Duo might have a larger superqueue than the i7-3770.

 

0 Kudos
Travis_D_
New Contributor II
2,688 Views

I have done some recent testing on the adjacent line PF and at least very preliminary results would indicate that (a) the adjacent line PF doesn't track its effectiveness (at least in a scenario where every other line is accessed, I observed the L2 adjacent PF filling in every missing line despite that these lines were never accessed) and (b) the adjacent line PF is quite conservative in the sense that it only issues prefetches if there is not much total traffic out of the L2. So benchmarks that say access every other line will not usually trigger the adjacent line PF, because the demand traffic shuts off the adjacent line PF. Only if you organize your benchmark so that the accesses are relatively sparse (e.g., with lfence or by making them serially dependent, or just doing a lot of other work), will you see the adj line PF kick in.

0 Kudos
Roy__Bholanath
Beginner
2,688 Views

Thank you Hadi Brais and Travis D for your comments.

 

@Hadi Brais

Isn't it also possible that step 1c accesses a line prefetched by the L2 prefetcher? What is the purpose of step 1a and step 1c? It seems to me that the three steps are be executed in a loop until all of the 64 lines are accessed no matter how much time it takes, and so step 1c and 1a are back-to-back.

 

actually step 1a and 1c are same , they are not back to back access. In my approach, between access to two random cache line i and cache line j of 4KB array, I am accessing 16 or more cache line of RANDOM 4KB Pages. 

 

So, the corrected sequence of my code is

Step 0 : Flush whole 4KB array 

Step 1: Randomly access all 64 cache lines i as below

(a) access some random cache line i corresponding to my 4KB integer array

(b) access 16 or more random 4KB pages  from a pool of 160 random 4KB pages

 

It's not clear to me what you are counting and how you are counting it. Are you counting the number of times the L2 adjacent line prefetcher has prefetched lines into the L2 or are you counting the number of unique L2 hits on lines prefetched by the L2 adjacent line prefetcher? It would be useful to know whether the adjacent line is being prefetched into the L2 or the L3, so you might want to recognize both cases.

Are you using performance counters to count whatever that is you're counting or are you using precise memory latency measurements? The details matter here.

 

I am not using any performance counter for counting number of prefetched cache line. I am calculating whether an already flushed cache line is loaded into cache ( in any cache level) from memory for my own access (step 1a) ? I am only interested whether cache line (i-1) or (i+1) is prefetched due to my own access of cache line i ?

That means after flushing all 64 cache lines of 4KB integer array (Step 0), when I access cache line 3 (Step 1a) , do cache line 2 or cache line 4 also prefetched in Step 1a later when I access them ?

 

 

On the Core 2 Duo, you can try with a much larger (8X or 16X) number of accesses in step 1b.

My experimental results show there is no impact of number of random pages accessed between two access to 4KB page in Core2Duo. 

 

 

0 Kudos
HadiBrais
New Contributor III
2,688 Views

The randomization in step 1a makes it hard to interpret the results shown in comment #3. We don't know yet whether the behavior of the adjacent line prefetcher depends on the distance between the accesses to the two adjacent lines. For example, let's say that the number of extra 4KB pages accessed between two executions of step 1a is two. So it may go like this:

access line 3 in the target 4KB page 
access two random lines from two random 4KB pages 
access line 2 in the target 4KB page 
access two random lines from two random 4KB pages 
access line 10 in the target 4KB page 
access two random lines from two random 4KB pages 
access line 20 in the target 4KB page 
access two random lines from two random 4KB pages 
access line 11 in the target 4KB page

Notice how line 2, the adjacent line of line 3, is accessed after two accesses from line 3. However, line 11, the adjacent line of line 10 is accessed after 5 accesses from line 10. What happens if the distance is always the same?

Another issue is that it seems to me that you need to remember which lines were accessed to be able to determine whether the line being currently accessed is adjacent to some line that has already been accessed. I don't know how this may impact your test because you've not shown the code.

Another parameter is the number of accesses in step 1b with respect to the number of 4KB pages. Does it matter whether the pool of 4KB pages is very small or very large compared to the number of accesses in step 1b?

It's possible that the behavior of the adjacent line prefetcher has changed between two different microarchitectures.

By the way, how did you disable all hardware prefetchers except the adjacent one on the Core2 processor? As far as I know, you can use bit 9 of IA32_MISC_ENABLE to disable or enable all of the prefetchers. Perhaps the BIOS allows you to selectively disable any of the hardware prefetchers?

There is a some discussion on the hardware prefetchers in this thread: https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/341769. Unfortunately, there are a bunch of unanswered questions there. However, there is a relevant comment written by John McCalpin:

I have not done exhaustive testing of this particular issue, but in my latency testing experiments on Westmere (Xeon 5600) and Sandy Bridge (Xeon E5) I noticed that if I *only* load odd-numbered cache lines or even-numbered cache lines, the adjacent-line prefetcher is not activated. On the other hand, if I load some odd-numbered lines and some even-numbered lines, the adjacent-line prefetcher is activated. I think this is true even if I never explicitly load adjacent lines in an odd-even pair (though I could be misremembering this detail).

No other details related to the adjacent line prefetcher are available in that thread.

0 Kudos
McCalpinJohn
Honored Contributor III
2,688 Views

For most Intel processors, MSR 0x1a4 (MSR_MISC_FEATURE_CONTROL) allows control of the various prefetchers independently, as described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors, and now included in the MSR descriptions in Volume 4 of the SWDM.

Starting with Ivy Bridge, there is an additional prefetcher called the "next page prefetcher".  I have not seen any documented control bits for this one, but since it only prefetches one cache line in the next 4KiB page (by *virtual address*), it does not contaminate results too badly.  The next page prefetcher does cause TLB walks to occur early, so there are extremely few TLB misses for contiguous addresses, even when the data being accessed far exceeds the range of the TLBs.

 

0 Kudos
Roy__Bholanath
Beginner
2,688 Views

@Hadi Brais,

 

In Core2Duo,

Bit 9: Hardware Prefetcher                        (L2 H/W)

Bit 19: Adjacent Cache Line Prefetch     (L2 Adjacent line)

Bit 37: DCU Prefetcher                              (L1 Adjacent line)

Bit 39: IP Prefetcher                                   (L1 IP )

 

0 - enable , 1 = disable

 

I used msr tools to read and write msr bits at 0x1a0 .

 

When all Prefetchers enabled, 

sudo ./rdmsr -p1 0x1a0

0x4062972489

 

To make all Prefetchers disabled,

sudo ./wrmsr -p1 0x1a0 0xe0629f2689

 

 

0 Kudos
Reply