- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
With _mm_clflush(), I flushed an array from all cache levels. Next, I to measure two accesses with __rdtsc(). While I know the distance between two accesses is larger than cache line size, e.g. 80 bytes distance, the TSC for the first access sounds like a miss (which is true), while the TSC for the second element sounds like a hit (which is wrong).
It seems that HW stride prefetcher brings the second element. Is there any way to force the processor not to prefetch?
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello guys
It seems that using lfence() and rdtsc() are fine and since I am not using stores, so mfence() is not applicable here. I also, modified my code and tried with a volatile array.
However, no matter how array is big and other things, the problem with HW prefetcher still exists. Thing is I flush two lines (array[30] and array[70] with a distance greater than cache line) and then try with three accesses.
1) access to array[30] => definitely miss
2) access to array[70] => prefetcher enabled => hit and prefetcher disabled => miss
3) access to array[33] => definitely hit
The code is
/* create array */ int array[ 100 ]; int i; for ( i = 0; i < 100; i++ ) array[ i ] = i; // bring array to the cache for ( i = 0; i < 100000000; i++ ) ; uint64_t t1, t2, ov, diff1, diff2, diff3; /* flush the first cache line */ _mm_lfence(); _mm_clflush( &array[ 30 ] ); _mm_clflush( &array[ 70 ] ); _mm_lfence(); /* READ MISS 1 */ _mm_lfence(); // fence to keep load order t1 = __rdtsc(); // set start time _mm_lfence(); (void) *((volatile int*)array + 30); // read the first elemet => cache miss _mm_lfence(); t2 = __rdtsc(); // set stop time _mm_lfence(); diff1 = t2 - t1; // two fence statements are overhead /* READ MISS 2 */ _mm_lfence(); // fence to keep load order t1 = __rdtsc(); // set start time _mm_lfence(); (void) *((volatile int*)array + 70); // read the second elemet => cache miss (or hit due to prefetching?!) _mm_lfence(); t2 = __rdtsc(); // set stop time _mm_lfence(); diff2 = t2 - t1; // two fence statements are overhead /* READ HIT*/ _mm_lfence(); // fence to keep load order t1 = __rdtsc(); // set start time _mm_lfence(); (void) *((volatile int*)array + 33); // read the first elemet => cache hit _mm_lfence(); t2 = __rdtsc(); // set stop time _mm_lfence(); diff3 = t2 - t1; // two fence statements are overhead /* measuring fence overhead */ _mm_lfence(); t1 = __rdtsc(); _mm_lfence(); _mm_lfence(); t2 = __rdtsc(); _mm_lfence(); ov = t2 - t1; printf( "lfence overhead is %lu\n", ov ); printf( "cache miss1 TSC is %lu\n", diff1-ov ); printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov ); printf( "cache hit TSC is %lu\n", diff3-ov );
I also have disabled the HW prefetcher with wrmsr command as below
# ./msr-tools-master/wrmsr 0x1a4 15 # ./msr-tools-master/rdmsr 0x1a4 f #
How when I compile and run (pin to a processor), I get the following results
# gcc -Wall -O3 -o simple_flush simple_flush.c # taskset -c 30 ./simple_flush lfence overhead is 29 cache miss1 TSC is 279 cache miss2 (or hit due to prefetching) TSC is 209 cache hit TSC is 11 # taskset -c 30 ./simple_flush lfence overhead is 29 cache miss1 TSC is 362 cache miss2 (or hit due to prefetching) TSC is 175 cache hit TSC is 6 # taskset -c 30 ./simple_flush lfence overhead is 29 cache miss1 TSC is 466 cache miss2 (or hit due to prefetching) TSC is 166 cache hit TSC is 8
As you can see the miss number (more than 250 TSC) is reasonable for miss. Also, 8 TSC is reasonable for hit. However, what is 175?!! Sounds like array[70] is prefetched to L3. The MSR value of 0x1A4 talks nothing about L3.
Any idea?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are a lot of things that can go wrong with this kind of testing....
I would start by adding calls to at least one fixed-function performance counter (counter 0, CPU_CYCLES_UNHALTED) alongside each RDTSC call so that you can compute the frequency that the processor is running at in each interval. The dummy loop at the top might be enough to get the processor to full speed, but it is always a good idea to check.
There is at least one prefetcher in the Xeon E5 v2 and later processors that cannot be disabled using MSR 0x1A4. It is called the "next page prefetcher", and is almost completely undocumented. It looks like it exists primary to touch one cache line in the next (virtual) 4 KiB page and start the Page Miss Handler early if the access does not hit in the TLB. This is probably not an issue here, but you should print out the virtual addresses of the array locations you are accessing to see if they are falling in the same 4KiB page or different pages.
Hardware prefetching to the L3 cache is done by the L2 hardware prefetcher. The behavior is dynamic (and so difficult to predict or control), but when the L2 is not very busy, the L2 streamer prefetcher will issue prefetches to L2. (On Xeon processors before Skylake, the cache line must be placed in the L3 as well, since the L3 requires inclusion of the L1 and L2 caches.) When the L2 is busier, the L2 hardware prefetcher will change the prefetch type to "prefetch to L3". I don't know if the L2 adjacent line HW prefetcher has the same policy -- all my testing was with access patterns that are dominated by L2 streamer prefetcher accesses. In any case, if the L2 HW prefetchers are disabled using MSR 0x1A4, there will be no L2 hardware prefetches to either the L2 or L3. (This can be confirmed using the programmable performance counter event L2_RQSTS.ALL_PF (Event 0x24, Umask 0xF8.)
In addition to the core frequency, Xeon E5 v3 and newer processors support independent dynamic uncore frequency. With typical BIOS settings, the uncore frequency will throttle down to the minimum value if there are not many memory requests outstanding, and ramp up to the maximum frequency under heavy loads. I am not aware of any documentation for the algorithms used, but the dynamic behavior can be disabled by the BIOS or the limits can be modified by changing the values in MSR 0x620 (MSR_UNCORE_RATIO_LIMIT). It is a very good idea to record the initial values in this register before you modify it, because there is no other MSR that contains the system defaults. If you forget the correct limits, you need to reboot the system to recover the values. (I can't recall if the hardware obeys requests for uncore frequencies lower than the default minimum -- it will certainly ignore requests for uncore frequencies higher than the default maximum.) For short benchmarks and for benchmarks that will generate a low rate of memory accesses (e.g., a pointer-chasing benchmark that has only one outstanding request at any time), I usually just read the register and set the minimum and maximum ratios to match the default maximum. The Uncore frequency can be monitored by enabling the UBox fixed counter (write bit 22 to MSR 0x703 U_MSR_FIXED_PMON_CTL), then reading the 48-bit Uncore clock count from MSR 0x704 (U_MSR_FIXED_PMON_CTR). (This is described in Section 2.2.2 of the Xeon E5 v4 Uncore Performance Monitoring Guide, document 334291).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is at least one prefetcher in the Xeon E5 v2 and later processors that cannot be disabled using MSR 0x1A4. It is called the "next page prefetcher", and is almost completely undocumented. It looks like it exists primary to touch one cache line in the next (virtual) 4 KiB page and start the Page Miss Handler early if the access does not hit in the TLB. This is probably not an issue here, but you should print out the virtual addresses of the array locations you are accessing to see if they are falling in the same 4KiB page or different pages.
Focusing on this part... Here is the output while 0x1A4 is F
vAddr array[30] = 0x7ffca2d495c8 TSC = 233 vAddr array[33] = 0x7ffca2d49668 TSC = 5 vAddr array[70] = 0x7ffca2d495d4 TSC = 158
Address numbers in binary are
01111111111111100101110101110101101111111 0111000
01111111111111100101110101110101101111111 1000100
011111111111111001011101011101011 100000001011000
It seems that array[30] and array[70] are not in the same page! Is that a confirmation for your statement?
In any case, if the L2 HW prefetchers are disabled using MSR 0x1A4, there will be no L2 hardware prefetches to either the L2 or L3.
So, my thought about prefetching array[70] to L3 was wrong.
Is it possible to tell the compiler to put the array elements in one page? Since I am creating the array, I know about the sizes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
These numbers don't make any sense.... Looks like you reversed the addresses of array[33] and array[70]
Ignoring the common high-order symbols, the addresses shown are:
- array[30] = 0x5c8 = 1480 (decimal)
- array[33] = 0x668 = 1640 (decimal)
- array[70] = 0x5d4 = 1492 (decimal)
Swapping the labels of the last two lines gives 12 bytes from array[30] to array[33] and 160 bytes from array[30] to array[70].
There are lots of ways to control alignment, but you don't even need to do that -- just look at the virtual addresses. Divide by 4096 to get the virtual page number and compute virtual_address % 4096 to find the offset within the 4KiB page. Then you can pick array indices with whatever relationship you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes. That was my fault.
All reside in the same page since the bit indices > 12 are the same.
So, that raises the question again. Why the TSC of array[70] is neither L1 hit nor memory miss? It seems that someone prefetches array[70] upon an access to array[30].
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was able to reproduce your result, but it went away when I fixed the warmup loop in your code (it is optimized away), changed my CPU governor to "performance" and run the test many times back to back.
In that case, I get results like:
lfence overhead is 28 cache miss1 TSC is 214 cache miss2 (or hit due to prefetching) TSC is 164 cache hit TSC is 194 lfence overhead is 28 cache miss1 TSC is 194 cache miss2 (or hit due to prefetching) TSC is 162 cache hit TSC is 8 lfence overhead is 30 cache miss1 TSC is 186 cache miss2 (or hit due to prefetching) TSC is 166 cache hit TSC is 6 lfence overhead is 26 cache miss1 TSC is 182 cache miss2 (or hit due to prefetching) TSC is 176 cache hit TSC is 8 lfence overhead is 28 cache miss1 TSC is 168 cache miss2 (or hit due to prefetching) TSC is 158 cache hit TSC is 6 lfence overhead is 30 cache miss1 TSC is 176 cache miss2 (or hit due to prefetching) TSC is 160 cache hit TSC is 6 lfence overhead is 28 cache miss1 TSC is 206 cache miss2 (or hit due to prefetching) TSC is 160 cache hit TSC is 8 lfence overhead is 30 cache miss1 TSC is 182 cache miss2 (or hit due to prefetching) TSC is 160 cache hit TSC is 8 lfence overhead is 28 cache miss1 TSC is 190 cache miss2 (or hit due to prefetching) TSC is 174 cache hit TSC is 6
As you can see the miss1 and miss2 timings are usually very close now with miss2 usually being a few cycles faster, which makes sense since a "page open" type hit will be faster than the first access which needs to open the page. The TSC ticks at 2.6 GHz on my system so these measurements are about ~60 ns, which I know is in line with the memory latency on this system, plus a bit of lfence overhead (I've measured the latency in the low 50-ns on this box).
I had to fix your warmup loop:
Since the other version is trivially removed by the compiler. I upped the iterations to 1e9 from 1e8. This gave the results above. Before that, the results were all over the place, with the first miss often taking much longer, presumably because the uncore was running at a lower frequency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try running simple_flush in a loop like
for i in {1..99}; do taskset -c 31 ./simple_flush; done
Note that the frequency that I'm talking about is the uncore frequency, which you can't easily see using lscpu and friends. In fact, I'm not sure if there is any easily installable tool that shows it (maybe Intel's PCM tool).
That's how I got fairly stable results. In your most recent results did you disable all prefetchers? It may not apply to your machine but on my laptop all prefetchers get turned back on every time it leaves sleep mode, so it might be worth checking the prefetchers are disabled as you expect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes the prefetcher is disabled. However, I don't see what you saw.
[root@compute-0-6 ~]# cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 4
Setting cpu: 5
Setting cpu: 6
Setting cpu: 7
Setting cpu: 8
Setting cpu: 9
Setting cpu: 10
Setting cpu: 11
Setting cpu: 12
Setting cpu: 13
Setting cpu: 14
Setting cpu: 15
Setting cpu: 16
Setting cpu: 17
Setting cpu: 18
Setting cpu: 19
Setting cpu: 20
Setting cpu: 21
Setting cpu: 22
Setting cpu: 23
Setting cpu: 24
Setting cpu: 25
Setting cpu: 26
Setting cpu: 27
Setting cpu: 28
Setting cpu: 29
Setting cpu: 30
Setting cpu: 31
Setting cpu: 32
Setting cpu: 33
Setting cpu: 34
Setting cpu: 35
Setting cpu: 36
Setting cpu: 37
Setting cpu: 38
Setting cpu: 39
Setting cpu: 40
Setting cpu: 41
Setting cpu: 42
Setting cpu: 43
Setting cpu: 44
Setting cpu: 45
Setting cpu: 46
Setting cpu: 47
Setting cpu: 48
Setting cpu: 49
Setting cpu: 50
Setting cpu: 51
Setting cpu: 52
Setting cpu: 53
Setting cpu: 54
Setting cpu: 55
[root@compute-0-6 ~]# ./msr-tools-master/rdmsr 0x1a4
f
[root@compute-0-6 ~]# for i in {1..99}; do taskset -c 31 ./simple_flush; done
--------
vAddr array[30] = 0x7ffd21663488
TSC = 241
vAddr array[33] = 0x7ffd21663528
TSC = 11
vAddr array[70] = 0x7ffd21663494
TSC = 20
lfence overhead is 29
sink = 9999999999
--------
vAddr array[30] = 0x7ffe4aaa2a08
TSC = 222
vAddr array[33] = 0x7ffe4aaa2aa8
TSC = 1
vAddr array[70] = 0x7ffe4aaa2a14
TSC = 21
lfence overhead is 34
sink = 9999999999
--------
vAddr array[30] = 0x7fff9a9ebb08
TSC = 362
vAddr array[33] = 0x7fff9a9ebba8
TSC = 3
vAddr array[70] = 0x7fff9a9ebb14
TSC = 69
lfence overhead is 32
sink = 9999999999
--------
vAddr array[30] = 0x7fff7ad0b838
TSC = 308
vAddr array[33] = 0x7fff7ad0b8d8
TSC = 9
vAddr array[70] = 0x7fff7ad0b844
TSC = 153
lfence overhead is 34
sink = 9999999999
--------
vAddr array[30] = 0x7ffee98e7d98
TSC = 204
vAddr array[33] = 0x7ffee98e7e38
TSC = 6
vAddr array[70] = 0x7ffee98e7da4
TSC = 189
lfence overhead is 29
sink = 9999999999
^C
[root@compute-0-6 ~]#
Question: Should I run the program on an idle core? The chassis has two CPU with total 56 logical cores. Some jobs are running on the chassis but it is not oversubscribed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I still haven't resolved the issue. It is really strange that why such thing happens while Travis got smooth numbers.
Any Intel guy can help?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My recommendations above (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/785240#comment-1926764) for monitoring core and uncore frequencies may be relevant....
By the way, in the code above (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/785240#comment-1926760), the LFENCE operations after the RDTSC calls are unnecessary. The arithmetic calculation (diff = t2-t1;) can't occur until after the RDTSC has completed and returned t2, and the next instruction after the diff calculation is another LFENCE. (Except for the last one, where it probably does not matter since you are finished with the timing at that point.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Reading 0x620 shows
[root@compute-0-6 ~]# ./msr-tools-master/rdmsr 0x620
c1e
In the developer's manual, I didn't see any description about the meaning of the values. The document is so huge and maybe I missed something. I also read section 2.2.2. of https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-e7-v4-uncore-performance-monitoring.html but didn't understand that.
he Uncore frequency can be monitored by enabling the UBox fixed counter (write bit 22 to MSR 0x703 U_MSR_FIXED_PMON_CTL), then reading the 48-bit Uncore clock count from MSR 0x704 (U_MSR_FIXED_PMON_CTR).
Do you mean that prior to reading 0x620, I have write 0000_0000_0000_0000_0000_0000_0000_0000__0000_0000_0100_0000_0000_0000_0000_0000 to 0x703?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the confusion -- there are two topics here:
1. Controlling Uncore Frequency: MSR 0x620 is documented for several generations of Intel processors in the various tables of Volume 4 of the Intel Architecture Software Developer's Manual (document 335592-067, May 2018). In the PDF, search for "620H".
Your processor has a minimum uncore frequency ratio of 0xC (12 decimal), corresponding to 1.2 GHz, and a maximum uncore frequency ratio of 0x1e (30 decimal), corresponding to 3.0 GHz. Setting this register to 0x1e1e will force the uncore frequency to stay at 3.0 GHz (unless the processor hits a power or thermal limitation). The uncore frequency has a non-negligible impact on memory latency, so it is something that you want to control in any latency tests. After testing, writing 0x0c1e to the register will return the system to its default state and reduce the idle power consumption.
2. Monitoring Uncore Frequency: Write 0x00400000 to MSR 0x703 to enable the uncore cycle counter. Once you have done this, MSR 0x704 will start incrementing once per uncore cycle. You can use these counts to determine the actual average uncore frequency during an interval. The overhead of going into the kernel to read these counters is many thousands of cycles, so you can't use it directly in your test, but you can use it in longer-running tests to convince yourself that MSR 0x620 actually controls the uncore frequency. If I recall correctly, on at least some processors this counter does not increment while in deep package C states, so you will want to ensure that at least one core remains active during the measurement interval (i.e., use a spin loop or something doing actual work rather than a call to sleep() during the interval between the two reads of MSR 0x704.)
----------
Monitoring the core frequency is probably more important, and it can be done using low-overhead RDPMC calls that can be included inline. The fixed-function performance counters can be read using the RDPMC instruction by way of a special counter number. The compiler macro "_rdpmc(int p)" is similar to the "_rdtsc()" macro you are using, but takes an argument for the counter number. The programmable counters are numbered starting from 0, while the fixed-function counters are numbered starting from (1<<30). The three fixed-function counters supported on all recent Intel processors are:
- Fixed-Function Counter 0 (counter number (1<<30)): Instructions Retired
- Fixed-Function Counter 1 (counter number ((1<<30)+1): Actual Cycles Not Halted
- Fixed-Function Counter 2 (counter number ((1<<30)+2): Reference Cycles Not Halted
NOTE: Be sure to use parentheses around the (1<<30)! I have frequency forgotten that the "<<" operator has a low priority in order of operations in C, so (1<<30+1) is very much not the same as ((1<<30)+1). (The default C compiler on my Mac (based on LLVM) gives a warning on the first version by default, while gcc requires an explicit request for an elevated level of warnings to make the same warning.)
The overhead of the "_rdpmc(int p)" macro should be similar to the overhead of the "_rdtsc()" macro -- the details depend a bit on the processor generation.
Given before and after values for the TSC and fixed-function counters 1 and 2, you can compute:
- utilization = (double) (ref_cycles_unhalted_after - ref_cycles_unhalted_before) / (double)(tsc_after - tsc_before);
- average_ghz = (double)(actual_cycles_unhalted_after - actual_cycles_unhalted_before) / (double) (ref_cycles_unhalted_after - ref_cycles_unhalted_before) * nominal_ghz; // nominal_ghz is 2.1 on the Xeon E5-2620 v4
Utilization should be very close to 1.0 for your tests -- I only include it for reference.
The average frequency is more important. When running a single thread, you want this to be consistently very close to the max Turbo frequency (assuming you have not disabled Turbo).
For both of these computations, the numbers will be "fuzzy" over short intervals because the fixed-function "Reference Cycles Not Halted" counter does not increment continuously -- on your processor it will increment by 21 every 10 ns. This means that differences between reads of this counter will always be divisible by 21, while differences between TSC reads will be closer to a continuous distribution.
All of these performance counter tests make more sense with longer test intervals, but I have found that the behavior of the processors is much more predictable and stable when the core and uncore frequencies are properly controlled.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I included references to the documentation on purpose -- the answer to your first question is there.
Your results are certainly a bit odd, but on an aggressive out-of-order processor, measuring the latency of a single load instruction is not what the timers are designed for.
In any case, your results are already more than good enough to confirm that array[70] is not in the same cache line as array[30], which I thought was the original goal of the testing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your results are certainly a bit odd, but on an aggressive out-of-order processor, measuring the latency of a single load instruction is not what the timers are designed for.
I know. But what confuses me is that Travis (post #27) showed that the he got more reasonable numbers.
In any case, your results are already more than good enough to confirm that array[70] is not in the same cache line as array[30], which I thought was the original goal of the testing?
Actually yes and no!
The purpose was to disable the prefetcher to see the impact. Actually writing 0 of 15 to 0x1a4 has no meaningful results for array[70].
I think this thread is getting boring. I will test some cases and post a new thread with more specific issues in the upcoming days.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The L1 hardware prefetcher might generate a prefetch for the next line after seen loads to array[30] and array[33] (note 1), but the L2 hardware prefetcher won't generate prefetches until it sees two fetches to the same 4KiB page. The load to array[70] is the second access to the page, so I expect any L2 HW prefetches to be generated after array[70] is loaded.
Note 1: There is not a lot of documentation for the L1 Hardware Prefetchers, which may have slightly different properties on different processor generations. The L1 streaming prefetcher is not very aggressive (since it is only trying to tolerate the latency of an L1 Miss/L2 Hit), but it is hard to study because most of the hardware performance counters don't distinguish between demand accesses and L1 HW prefetch accesses. For the workloads that I study, the L2 HW prefetchers are much more important, so the studies I have done are focused there....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have read some papers about time-based cache attacks and they all rely on hit/miss time. I know that most of such features are hidden due to the marketing issues. I wonder how they achieve their results!
For example, they usually operate on shared caches. Knowing the last level is L3, there is no document about disabling/enabling the prefetcher at L3. So, if someone measure the access time, how on earth he can decide if the block is accessed by the victim or it is the result of L3 prefetcher or page prefetcher or ...
That really bothers me! Any idea?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel C. wrote:I have read some papers about time-based cache attacks and they all rely on hit/miss time. I know that most of such features are hidden due to the marketing issues. I wonder how they achieve their results!
For example, they usually operate on shared caches. Knowing the last level is L3, there is no document about disabling/enabling the prefetcher at L3. So, if someone measure the access time, how on earth he can decide if the block is accessed by the victim or it is the result of L3 prefetcher or page prefetcher or ...
That really bothers me! Any idea?
For one thing, there is no L3 prefetcher: only the L2 and L1 prefetchers (of which only the L2 prefetcher is relevant here). The L2 prefetcher may fetch lines all the way to the L2 or only to the L3 if some conditions are met (a high number of outstanding requests from L2 or something like that).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »