How to defeat H/W prefetcher in Intel Core i3/i7

Bholanath_R_ · ‎08-10-2015

Hello everyone,

I am trying to find a way to defeat the H/w prefetcher to detect the stream pattern and access 4KB data in a random order
so that it is not detected and prefetched by H/w prefetcher.

Initially I was thinking to access all even index data in a random pattern as H/w prefetcher prefetch the next cache lines
always (so when I access even index, next odd index data is already prefetched).

I wrote the code to access all even index data in a random pattern, however the results indicate that the prefetcher detected the pattern
(don't know how ? There is no fixed stride, all are random stride )

I was investigating the reason-why this happened, then I found this article in Intel ; https://software.intel.com/en-us/forums/topic/473493

According to John D. McCalpin, PhD, "Dr. Bandwidth,

In section 2.2.5.4 of "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-028, July 2013),it states that,

streamer prefetcher "etects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain one forward and one backward stream can be maintained.

This implies that the L2 hardware prefetcher tracks the 16 4KiB pages most recently accessed and remembers enough of the access patterns for those pages to track one forward stream and one backward stream. So to defeat the L2 streamer prefetcher with "random" fetches, simply ensure that you access more than 15 other 4 KiB pages before you make a second reference to a previously referenced page. So a "random" sequences of fetches might be composed of a random permutation of more than 16 4 KiB page numbers with a random offset within each page. (I typically use at least 32 pages in my permutation list.)

So it means in between accesses of two different random indexes of same 4KB pages we need to access atleast 16 4KB pages to defeat H/w prefetcher.

I have implemented the concept suggested by John D. McCalpin , however the results again show the h/w prefetcher is not defeated. It is able to detect some pattern and prefetch data (see attached output.txt) . I have varied number of accessed pages from 20-40 4KB pages , but no improvement/change in result.

Here is my code :

#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>

#ifndef _POSIX_THREAD_PROCESS_SHARED
#error This system does not support process shared mutex
#endif

#define MAX_COUNT 3000
#define INDEX (40*1024) // size of DUMMY 40 4KB pages

inline void clflush(volatile void *p)
{
    asm volatile ("clflush (%0)" :: "r"(p));
}

unsigned long probe(char *adrs) {
  volatile unsigned long time;
  asm __volatile__ (
    " mfence              \n"
    " lfence              \n"
    " rdtsc               \n"
    " lfence              \n"
    " movl %%eax, %%esi \n"
    " movl (%1), %%eax     \n"
    " lfence              \n"
    " rdtsc               \n"
    " subl %%esi, %%eax \n"
    " clflush 0(%1)       \n"
    : "=a" (time)
    : "c" (adrs)
    : "%esi", "%edx");
  return time;
}

void shuffle(int *arr, size_t n)
{
    if (n > 1) 
    {
        size_t i;
        srand(time(NULL));
        for (i = 0; i < n - 1; i++) 
        {
          size_t j = i + rand() / (RAND_MAX / (n - i) + 1);
          int t = arr;
          arr = arr;
          arr = t;
        }
    }
}


static const int DATA[1024]={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023};

int main(int argc, char *argv[])
{

	int counter=0,k=0;
	unsigned long Access_Time[MAX_COUNT][64]={0};	
	int DUMMY[INDEX];// dummy array of 40 * 4KB ;  
	
	//Initialize
	for(k=0;k<INDEX;k++)
		DUMMY=k;
	
	//access it to check segmentation fault is happening or not
	for(k=0;k<INDEX;k++)
		DUMMY+=k;
		
	// even index in random order
	int index[32]={4,8,16,32,54,34,62,50,26,52,30,60,46,18,36,58,42,10,20,40,6,12,24,48,22,44,14,28,56,38,2,0};

	int TOTAL_RANDOM_PAGE=40;
	
	int i,PAGE[TOTAL_RANDOM_PAGE]; // PAGE will contain page no of 40 pages which will be accessed in random order to defeat prefetcher
        for (i=0; i<TOTAL_RANDOM_PAGE; i++)
	{
    	    PAGE = i;
        }
	
	shuffle(PAGE, TOTAL_RANDOM_PAGE); // PAGE now have page no in random order
		
	FILE *fp2;
 	int s,s1;
 	int random_index=0,sum=0;
 
 	const int *p0=&DATA[0];
	for (s=0;s<64;s++)
	{
		clflush((void *)(p0+s*16));
	}

	while(counter<MAX_COUNT)
	{				
		// Find Access time for Even Index
		for (s=0;s<32;s++)
		{
			
			// Access a random index
		        Access_Time[counter][index]=probe((char *)(p0+16*index));
		
			//Now, access 40 other indexes belong to other 40 4KB page 		
			shuffle(PAGE, TOTAL_RANDOM_PAGE); // random orderpage
			for(random_index=0;random_index<TOTAL_RANDOM_PAGE;random_index++)
			{
			DUMMY[1024*PAGE[random_index]+16*PAGE[random_index]]=2*DUMMY[1024*PAGE[random_index]+16*PAGE[random_index]];
			}

		}// end of for loop		
				
		// Flush all DATA from cache	 	
		for (s1=0;s1<64;s1++)
		{
			clflush((void *)(p0+s1*16));
		}
	 counter++;

	}// end of while loop

	fp2=fopen("All_access_time.txt","a");

	int index4;
	for(counter=0;counter<MAX_COUNT;counter++)
	{
		for (index4=0;index4<64;index4++)
		{
			if(Access_Time[counter][index4]>0 && Access_Time[counter][index4]<200)
			fprintf(fp2,"%d,%d,%lu\n",counter,index4,Access_Time[counter][index4]);				
		}
	}
	
return 1;
}

Another interesting observation is , the access time of random indexes which were prefetched

has access time around 35-70 ticks. (see attached output.txt file)

In my system, the L1 access time 36-44 ticks, L2 access time 50-70 ticks, L3 access time = 90-120 ticks.

Can you please help me to understand why H/W prefether able to detect my random pattern ? Where am I making mistakes?

How to do the coding so that I can defeat the prefetcher and h/w prefetcher unable to prefetch my data ?

I have performed the experiment on both Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz and Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz.

L1D=32KB, ways_of_associative=8

L1-I = 32KB, ways_of_associative=8

L2 =256KB, ways_of_associative=8

L3= 3072KB (Core-i3), ways_of_associative=12

L3=8192KB(Core-i7), ways_of_associative=16
Cache line size=64Bytes

NOTE : I have disabled s/w prefetcher optimization while compiling using -O0 option with gcc.

Thanking you in advance .

McCalpinJohn · ‎08-10-2015

The easiest way to "defeat" the hardware prefetchers on Core i3/i5/i7 processors is to turn them off. This is documented at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

I am not sure I understand the code, but I don't think it makes sense to try to time individual loads on any Intel processors except the Xeon Phi (where RDTSC is extremely fast). The FENCE and FLUSH instructions in the "probe" function may not be doing what you expect, and they certainly make it harder to understand what is happening.

When I used this approach to try to avoid prefetching, I measured the time required to perform 10 million dependent loads from a circular pointer chain created in a 256 MiB array. I subdivided the array into "blocks" of "M" 4KiB pages, then set up the pointer chain to access one cache line in each of the "M" pages before returning to load a different cache line in the first 4KiB page in the block. After loading all of the even-numbered lines in the block, the pointer would jump into the next block of "M" 4KiB pages and the process would repeat until I reached the end of the array, where I wrapped the pointer chain back to the beginning. I did this with M=1,2,4,8,16,32,64.

In addition to measuring elapsed time, I used various hardware performance counters to try to understand what was happening, and I ran each case with and without the hardware prefetchers enabled. I am not sure exactly which of my ~1500 spreadsheets has the most recent version of these results, but I recall that I saw a fairly clear signal, with the hardware prefetcher ceasing to generate transactions or impact performance once I got to blocks of 32 pages.

All of this depends on the processor implementing cache replacement in a traditional way, which is not guaranteed for the more recent Intel processors. An interesting set of speculations is at http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/

Bholanath_R_ · ‎08-11-2015

Thank you Dr. Bandwidth for your quick and helpful comment. I will investigate and let you know the result.

Bholanath_R_ · ‎08-13-2015

Hi, Dr. Bandwidth ,

The above implementation is defeating h/w prefetcher in Core2duo whereas it is not defeating in Core-i3.

Can you suggest some way to design the algo to defeat the h/w prefetcher ? I was trying to understand from output, however the result not reflecting any information or relationship when it will prefetch a cache line or not .

From my experimental result, what I can see initially h/w prefetcher load the detected sequence into LLC ( sometimes for the initial few rounds I am getting access time in range of 110 and later the same sequence is loaded into L1/L2 .

I am repeating this following for multiple times (let us say for 300 times)

let Ei represent random index which I am interested to access and belongs to Page 1.

Ri represent random index belong to other 4KB page.

I am accessing the sequence Ei, Ri for 32 other 4KB page, Ej, Ri for 32 other 4KB page,Ek,Ri for 32 other 4KB page, Er, Ri for 32 other 4KB page, Em.......

and this sequence I am repeating for 300 times to check whether h/w prefetcher is defeated or not. What I have seen it is able to detect from the first run of sequence ( I am not able to understand how this is possible although I am accessing 32 different 4KB page before next even index and that too in random order).

The same sequence is loaded for 300 times and there are very few occasion where one or two cache lines are different (most of the cache lines are same).

My question is how H/w prefetcher is able to detect and able to prefetch the random index before it is accessed for the first time and even before a fixed stride is detected ( means how it is able to prefetch a random index before its first access in a random order).

Thanks.

McCalpinJohn · ‎08-14-2015

Although the description of the L2 streaming prefetcher in the Intel Optimization Reference Manual is fairly vague, some experiments that I have done suggest that the L2 streaming prefetcher will assume that *any* two accesses to a 4KiB page are a "stream". It assumes that the difference between the two cache line addresses is the stride, and it will issue prefetches for 1-2 cache lines using that stride from the second address -- provided that the resulting computed addresses are within the initial 4 KiB page.

If this model is correct, you can avoid prefetches when executing 2 loads to a page only if the projected 3rd address is outside of the 4KiB page. It is not clear what happens when a 3rd load arrives at the page, since we don't know which combinations of addresses are used to define "forward" and "backward" streams. The prefetcher has to compare the each address to at least two prior addresses in order to identify "forward" and "backward", but it might look at more prior addresses if they are available.

If you have a good test environment (low system noise/jitter, low-overhead performance counters, large pages) you can start making assumptions and see what happens. One sequence of tests might be:

For each 4KiB page, perform 2 loads (a[0] < a[1]), such that the implied stride (&a[1]-&a[0]) takes the computed prefetch target outside of the 4KiB page.
The 3rd load (a[2]) can be in one of 3 categories:
- Case 1: a[2] > a[1] -- this is in the same direction as the original stride. One can compute two implied strides: (&a[2]-&a[1]) and (&a[2]-a[0]). Since we choose a[0] and a[1] so that the stride (&a[1]-&a[0]) from a[1] takes you out of the page, you are guaranteed that the stride (&a[2]-&a[0]) will also be out of the page. The implied stride (&a[2]-&a[1]) might be in the page and might be out of the page. If a[2] is chosen so that this shorter implied stride is also out of the page, then there should not be a prefetch.
- Case 2: a[0] < a[2] < a[1] -- the 3rd address is in the middle. This could be interpreted as a forward stream from a[0] or as a backward stream from a[1] -- or both! It might be impossible to prevent both implied strides from falling outside the 4 KiB page in this case.
- Case 3: a[2] < a[0] -- the third address is below the first. This could be interpreted as a backwards stride from a[1] or as a backwards stride from a[0]. The shorter of the two implied strides will be &a[2]-&[0], and you can choose a[2] so that this will either be inside the 4 KiB page or outside the page. If it is outside the page, there should not be a prefetch.

Once you get to 4 loads in a page, the number of possible combinations of implied streams starts getting unpleasant to work with, and it will be increasingly difficult to come up with a sequence of addresses such that all combinations of forward and backward differences projects to addresses outside the 4 KiB page.