Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Disabling HW prefetcher

morca
Beginner
6,597 Views

Hi

With _mm_clflush(), I flushed an array from all cache levels. Next, I to measure two accesses with __rdtsc(). While I know the distance between two accesses is larger than cache line size, e.g. 80 bytes distance, the TSC for the first access sounds like a miss (which is true), while the TSC for the second element sounds like a hit (which is wrong).

It seems that HW stride prefetcher brings the second element. Is there any way to force the processor not to prefetch?

 

0 Kudos
39 Replies
morca
Beginner
1,141 Views

Hello guys

It seems that using lfence() and rdtsc() are fine and since I am not using stores, so mfence() is not applicable here. I also, modified my code and tried with a volatile array.

However, no matter how array is big and other things, the problem with HW prefetcher still exists. Thing is I flush two lines (array[30] and array[70] with a distance greater than cache line) and then try with three accesses.

1) access to array[30] => definitely miss

2) access to array[70] => prefetcher enabled => hit and prefetcher disabled => miss

3) access to array[33] => definitely hit

 

The code is
 

    /* create array */
    int array[ 100 ];
    int i;
    for ( i = 0; i < 100; i++ )
        array[ i ] = i;   // bring array to the cache

    for ( i = 0; i < 100000000; i++ ) ;

    uint64_t t1, t2, ov, diff1, diff2, diff3;

    /* flush the first cache line */
    _mm_lfence();
    _mm_clflush( &array[ 30 ] );
    _mm_clflush( &array[ 70 ] );
    _mm_lfence();

    /* READ MISS 1 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    (void) *((volatile int*)array + 30);   // read the first elemet => cache miss
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();
    diff1 = t2 - t1;        // two fence statements are overhead

    /* READ MISS 2 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    (void) *((volatile int*)array + 70);      // read the second elemet => cache miss (or hit due to prefetching?!)
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();
    diff2 = t2 - t1;        // two fence statements are overhead


    /* READ HIT*/
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    (void) *((volatile int*)array + 33);   // read the first elemet => cache hit
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();
    diff3 = t2 - t1;        // two fence statements are overhead

    /* measuring fence overhead */
    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;

    printf( "lfence overhead is %lu\n", ov );
    printf( "cache miss1 TSC is %lu\n", diff1-ov );
    printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov );
    printf( "cache hit TSC is %lu\n", diff3-ov );

 

I also have disabled the HW prefetcher with wrmsr command as below

# ./msr-tools-master/wrmsr 0x1a4 15
# ./msr-tools-master/rdmsr 0x1a4
f
#

How when I compile and run (pin to a processor), I get the following results

# gcc -Wall -O3 -o simple_flush simple_flush.c
# taskset -c 30 ./simple_flush
lfence overhead is 29
cache miss1 TSC is 279
cache miss2 (or hit due to prefetching) TSC is 209
cache hit TSC is 11
# taskset -c 30 ./simple_flush
lfence overhead is 29
cache miss1 TSC is 362
cache miss2 (or hit due to prefetching) TSC is 175
cache hit TSC is 6
# taskset -c 30 ./simple_flush
lfence overhead is 29
cache miss1 TSC is 466
cache miss2 (or hit due to prefetching) TSC is 166
cache hit TSC is 8

 

As you can see the miss number (more than 250 TSC) is reasonable for miss. Also, 8 TSC is reasonable for hit. However, what is 175?!! Sounds like array[70] is prefetched to L3. The MSR value of 0x1A4 talks nothing about L3.

Any idea?

0 Kudos
McCalpinJohn
Honored Contributor III
1,141 Views

There are a lot of things that can go wrong with this kind of testing....

I would start by adding calls to at least one fixed-function performance counter (counter 0, CPU_CYCLES_UNHALTED) alongside each RDTSC call so that you can compute the frequency that the processor is running at in each interval.  The dummy loop at the top might be enough to get the processor to full speed, but it is always a good idea to check. 

There is at least one prefetcher in the Xeon E5 v2 and later processors that cannot be disabled using MSR 0x1A4.  It is called the "next page prefetcher", and is almost completely undocumented.  It looks like it exists primary to touch one cache line in the next (virtual) 4 KiB page and start the Page Miss Handler early if the access does not hit in the TLB.  This is probably not an issue here, but you should print out the virtual addresses of the array locations you are accessing to see if they are falling in the same 4KiB page or different pages. 

Hardware prefetching to the L3 cache is done by the L2 hardware prefetcher.  The behavior is dynamic (and so difficult to predict or control), but when the L2 is not very busy, the L2 streamer prefetcher will issue prefetches to L2.  (On Xeon processors before Skylake, the cache line must be placed in the L3 as well, since the L3 requires inclusion of the L1 and L2 caches.)   When the L2 is busier, the L2 hardware prefetcher will change the prefetch type to "prefetch to L3".  I don't know if the L2 adjacent line HW prefetcher has the same policy -- all my testing was with access patterns that are dominated by L2 streamer prefetcher accesses.  In any case, if the L2 HW prefetchers are disabled using MSR 0x1A4, there will be no L2 hardware prefetches to either the L2 or L3.  (This can be confirmed using the programmable performance counter event L2_RQSTS.ALL_PF (Event 0x24, Umask 0xF8.)

In addition to the core frequency, Xeon E5 v3 and newer processors support independent dynamic uncore frequency.   With typical BIOS settings, the uncore frequency will throttle down to the minimum value if there are not many memory requests outstanding, and ramp up to the maximum frequency under heavy loads.  I am not aware of any documentation for the algorithms used, but the dynamic behavior can be disabled by the BIOS or the limits can be modified by changing the values in MSR 0x620 (MSR_UNCORE_RATIO_LIMIT).   It is a very good idea to record the initial values in this register before you modify it, because there is no other MSR that contains the system defaults.  If you forget the correct limits, you need to reboot the system to recover the values.  (I can't recall if the hardware obeys requests for uncore frequencies lower than the default minimum -- it will certainly ignore requests for uncore frequencies higher than the default maximum.)  For short benchmarks and for benchmarks that will generate a low rate of memory accesses (e.g., a pointer-chasing benchmark that has only one outstanding request at any time), I usually just read the register and set the minimum and maximum ratios to match the default maximum.  The Uncore frequency can be monitored by enabling the UBox fixed counter (write bit 22 to MSR 0x703 U_MSR_FIXED_PMON_CTL), then reading the 48-bit Uncore clock count from MSR 0x704 (U_MSR_FIXED_PMON_CTR).  (This is described in Section 2.2.2 of the Xeon E5 v4 Uncore Performance Monitoring Guide, document 334291).

0 Kudos
morca
Beginner
1,141 Views

There is at least one prefetcher in the Xeon E5 v2 and later processors that cannot be disabled using MSR 0x1A4.  It is called the "next page prefetcher", and is almost completely undocumented.  It looks like it exists primary to touch one cache line in the next (virtual) 4 KiB page and start the Page Miss Handler early if the access does not hit in the TLB.  This is probably not an issue here, but you should print out the virtual addresses of the array locations you are accessing to see if they are falling in the same 4KiB page or different pages. 

 

Focusing on this part... Here is the output while 0x1A4 is F

vAddr array[30] = 0x7ffca2d495c8
TSC = 233

vAddr array[33] = 0x7ffca2d49668
TSC = 5

vAddr array[70] = 0x7ffca2d495d4
TSC = 158

Address numbers in binary are

‭01111111111111100101110101110101101111111 0111000‬

‭01111111111111100101110101110101101111111 1000100‬

‭011111111111111001011101011101011 100000001011000‬

It seems that array[30] and array[70] are not in the same page! Is that a confirmation for your statement?

In any case, if the L2 HW prefetchers are disabled using MSR 0x1A4, there will be no L2 hardware prefetches to either the L2 or L3. 

So, my thought about prefetching array[70] to L3 was wrong.

Is it possible to tell the compiler to put the array elements in one page? Since I am creating the array, I know about the sizes.

0 Kudos
McCalpinJohn
Honored Contributor III
1,141 Views

These numbers don't make any sense.... Looks like you reversed the addresses of array[33] and array[70]

Ignoring the common high-order symbols, the addresses shown are:

  • array[30] = 0x5c8 = 1480 (decimal)
  • array[33] = 0x668 = 1640 (decimal)
  • array[70] = 0x5d4 = 1492 (decimal)

Swapping the labels of the last two lines gives 12 bytes from array[30] to array[33] and 160 bytes from array[30] to array[70].

There are lots of ways to control alignment, but you don't even need to do that -- just look at the virtual addresses.  Divide by 4096 to get the virtual page number and compute virtual_address % 4096 to find the offset within the 4KiB page.  Then you can pick array indices with whatever relationship you want.

0 Kudos
morca
Beginner
1,141 Views

Yes. That was my fault.

All reside in the same page since the bit indices > 12 are the same.

So, that raises the question again. Why the TSC of array[70] is neither L1 hit nor memory miss? It seems that someone prefetches array[70] upon an access to array[30].

0 Kudos
Travis_D_
New Contributor II
1,141 Views

I was able to reproduce your result, but it went away when I fixed the warmup loop in your code (it is optimized away), changed my CPU governor to "performance" and run the test many times back to back.

In that case, I get results like:

lfence overhead is 28
cache miss1 TSC is 214
cache miss2 (or hit due to prefetching) TSC is 164
cache hit TSC is 194
lfence overhead is 28
cache miss1 TSC is 194
cache miss2 (or hit due to prefetching) TSC is 162
cache hit TSC is 8
lfence overhead is 30
cache miss1 TSC is 186
cache miss2 (or hit due to prefetching) TSC is 166
cache hit TSC is 6
lfence overhead is 26
cache miss1 TSC is 182
cache miss2 (or hit due to prefetching) TSC is 176
cache hit TSC is 8
lfence overhead is 28
cache miss1 TSC is 168
cache miss2 (or hit due to prefetching) TSC is 158
cache hit TSC is 6
lfence overhead is 30
cache miss1 TSC is 176
cache miss2 (or hit due to prefetching) TSC is 160
cache hit TSC is 6
lfence overhead is 28
cache miss1 TSC is 206
cache miss2 (or hit due to prefetching) TSC is 160
cache hit TSC is 8
lfence overhead is 30
cache miss1 TSC is 182
cache miss2 (or hit due to prefetching) TSC is 160
cache hit TSC is 8
lfence overhead is 28
cache miss1 TSC is 190
cache miss2 (or hit due to prefetching) TSC is 174
cache hit TSC is 6

As you can see the miss1 and miss2 timings are usually very close now with miss2 usually being a few cycles faster, which makes sense since a "page open" type hit will be faster than the first access which needs to open the page. The TSC ticks at 2.6 GHz on my system so these measurements are about ~60 ns, which I know is in line with the memory latency on this system, plus a bit of lfence overhead (I've measured the latency in the low 50-ns on this box).

I had to fix your warmup loop:

volatile int sink;
for (i = 0; i < 1000000000; i++)
sink = i;

 

Since the other version is trivially removed by the compiler. I upped the iterations to 1e9 from 1e8. This gave the results above. Before that, the results were all over the place, with the first miss often taking much longer, presumably because the uncore was running at a lower frequency.

0 Kudos
morca
Beginner
1,141 Views
Hi Travis, Unfortunately, I wasn't able to reproduce your results. Still I see a gap between array[30] and array[70]. Is seems that the performance governor acts like a balance power plan. # cpupower frequency-set -g performance Setting cpu: 0 Setting cpu: 1 Setting cpu: 2 Setting cpu: 3 Setting cpu: 4 Setting cpu: 5 Setting cpu: 6 Setting cpu: 7 Setting cpu: 8 Setting cpu: 9 Setting cpu: 10 Setting cpu: 11 Setting cpu: 12 Setting cpu: 13 Setting cpu: 14 Setting cpu: 15 Setting cpu: 16 Setting cpu: 17 Setting cpu: 18 Setting cpu: 19 Setting cpu: 20 Setting cpu: 21 Setting cpu: 22 Setting cpu: 23 Setting cpu: 24 Setting cpu: 25 Setting cpu: 26 Setting cpu: 27 Setting cpu: 28 Setting cpu: 29 Setting cpu: 30 Setting cpu: 31 Setting cpu: 32 Setting cpu: 33 Setting cpu: 34 Setting cpu: 35 Setting cpu: 36 Setting cpu: 37 Setting cpu: 38 Setting cpu: 39 Setting cpu: 40 Setting cpu: 41 Setting cpu: 42 Setting cpu: 43 Setting cpu: 44 Setting cpu: 45 Setting cpu: 46 Setting cpu: 47 Setting cpu: 48 Setting cpu: 49 Setting cpu: 50 Setting cpu: 51 Setting cpu: 52 Setting cpu: 53 Setting cpu: 54 Setting cpu: 55 [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 1782.140 [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 2461.718 [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 2407.003 Though that is not important. I tried powersave to get a constant frequency. [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 1200.042 [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 1200.042 [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 1201.480 [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 1224.300 I used the following code volatile unsigned long sink; unsigned long j; for ( j = 0; j < 10000000000; j++ ) sink = j; And it really warms up the cpu! [root@compute-0-6 ~]# taskset -c 30 ./simple_flush vAddr array[30] = 0x7ffe72c7a018 TSC = 245 vAddr array[33] = 0x7ffe72c7a0b8 TSC = 6 vAddr array[70] = 0x7ffe72c7a024 TSC = 55 lfence overhead is 34 sink = 9999999999 [root@compute-0-6 ~]# taskset -c 30 ./simple_flush vAddr array[30] = 0x7ffcc2675a98 TSC = 262 vAddr array[33] = 0x7ffcc2675b38 TSC = 9 vAddr array[70] = 0x7ffcc2675aa4 TSC = 156 lfence overhead is 31 sink = 9999999999 [root@compute-0-6 ~]# taskset -c 30 ./simple_flush vAddr array[30] = 0x7fff81e91688 TSC = 894 vAddr array[33] = 0x7fff81e91728 TSC = 8 vAddr array[70] = 0x7fff81e91694 TSC = 163 lfence overhead is 29 sink = 9999999999 [root@compute-0-6 ~]# taskset -c 30 ./simple_flush vAddr array[30] = 0x7ffd6a48f498 TSC = 257 vAddr array[33] = 0x7ffd6a48f538 TSC = 6 vAddr array[70] = 0x7ffd6a48f4a4 TSC = 190 lfence overhead is 31 sink = 9999999999 [root@compute-0-6 ~]#
0 Kudos
Travis_D_
New Contributor II
1,141 Views

Try running simple_flush in a loop like

for i in {1..99}; do taskset -c 31 ./simple_flush; done

Note that the frequency that I'm talking about is the uncore frequency, which you can't easily see using lscpu and friends. In fact, I'm not sure if there is any easily installable tool that shows it (maybe Intel's PCM tool).

That's how I got fairly stable results. In your most recent results did you disable all prefetchers? It may not apply to your machine but on my laptop all prefetchers get turned back on every time it leaves sleep mode, so it might be worth checking the prefetchers are disabled as you expect.

0 Kudos
morca
Beginner
1,141 Views

Yes the prefetcher is disabled. However, I don't see what you saw.

 

[root@compute-0-6 ~]# cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 4
Setting cpu: 5
Setting cpu: 6
Setting cpu: 7
Setting cpu: 8
Setting cpu: 9
Setting cpu: 10
Setting cpu: 11
Setting cpu: 12
Setting cpu: 13
Setting cpu: 14
Setting cpu: 15
Setting cpu: 16
Setting cpu: 17
Setting cpu: 18
Setting cpu: 19
Setting cpu: 20
Setting cpu: 21
Setting cpu: 22
Setting cpu: 23
Setting cpu: 24
Setting cpu: 25
Setting cpu: 26
Setting cpu: 27
Setting cpu: 28
Setting cpu: 29
Setting cpu: 30
Setting cpu: 31
Setting cpu: 32
Setting cpu: 33
Setting cpu: 34
Setting cpu: 35
Setting cpu: 36
Setting cpu: 37
Setting cpu: 38
Setting cpu: 39
Setting cpu: 40
Setting cpu: 41
Setting cpu: 42
Setting cpu: 43
Setting cpu: 44
Setting cpu: 45
Setting cpu: 46
Setting cpu: 47
Setting cpu: 48
Setting cpu: 49
Setting cpu: 50
Setting cpu: 51
Setting cpu: 52
Setting cpu: 53
Setting cpu: 54
Setting cpu: 55
[root@compute-0-6 ~]# ./msr-tools-master/rdmsr 0x1a4
f
[root@compute-0-6 ~]# for i in {1..99}; do taskset -c 31 ./simple_flush; done
--------
vAddr array[30] = 0x7ffd21663488
TSC = 241

vAddr array[33] = 0x7ffd21663528
TSC = 11

vAddr array[70] = 0x7ffd21663494
TSC = 20
lfence overhead is 29
sink = 9999999999
--------
vAddr array[30] = 0x7ffe4aaa2a08
TSC = 222

vAddr array[33] = 0x7ffe4aaa2aa8
TSC = 1

vAddr array[70] = 0x7ffe4aaa2a14
TSC = 21
lfence overhead is 34
sink = 9999999999
--------
vAddr array[30] = 0x7fff9a9ebb08
TSC = 362

vAddr array[33] = 0x7fff9a9ebba8
TSC = 3

vAddr array[70] = 0x7fff9a9ebb14
TSC = 69
lfence overhead is 32
sink = 9999999999
--------
vAddr array[30] = 0x7fff7ad0b838
TSC = 308

vAddr array[33] = 0x7fff7ad0b8d8
TSC = 9

vAddr array[70] = 0x7fff7ad0b844
TSC = 153
lfence overhead is 34
sink = 9999999999
--------
vAddr array[30] = 0x7ffee98e7d98
TSC = 204

vAddr array[33] = 0x7ffee98e7e38
TSC = 6

vAddr array[70] = 0x7ffee98e7da4
TSC = 189
lfence overhead is 29
sink = 9999999999
^C
[root@compute-0-6 ~]#

 

 

 

 

Question: Should I run the program on an idle core? The chassis has two CPU with total 56 logical cores. Some jobs are running on the chassis but it is not oversubscribed.

0 Kudos
morca
Beginner
1,141 Views

I still haven't resolved the issue. It is really strange that why such thing happens while Travis got smooth numbers.

Any Intel guy can help?

0 Kudos
McCalpinJohn
Honored Contributor III
1,141 Views

My recommendations above (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/785240#comment-1926764) for monitoring core and uncore frequencies may be relevant....

By the way, in the code above (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/785240#comment-1926760), the LFENCE operations after the RDTSC calls are unnecessary.   The arithmetic calculation (diff = t2-t1;) can't occur until after the RDTSC has completed and returned t2, and the next instruction after the diff calculation is another LFENCE.   (Except for the last one, where it probably does not matter since you are finished with the timing at that point.)

0 Kudos
morca
Beginner
1,141 Views

John,

Reading 0x620 shows

 

[root@compute-0-6 ~]# ./msr-tools-master/rdmsr 0x620
c1e

In the developer's manual, I didn't see any description about the meaning of the values. The document is so huge and maybe I missed something. I also read section 2.2.2. of https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-e7-v4-uncore-performance-monitoring.html but didn't understand that.

he Uncore frequency can be monitored by enabling the UBox fixed counter (write bit 22 to MSR 0x703 U_MSR_FIXED_PMON_CTL), then reading the 48-bit Uncore clock count from MSR 0x704 (U_MSR_FIXED_PMON_CTR).

Do you mean that prior to reading 0x620, I have write 0000_0000_0000_0000_0000_0000_0000_0000__0000_0000_0100_0000_0000_0000_0000_0000 to 0x703?

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,141 Views

Sorry for the confusion -- there are two topics here:

1. Controlling Uncore Frequency:   MSR 0x620 is documented for several generations of Intel processors in the various tables of Volume 4 of the Intel Architecture Software Developer's Manual (document 335592-067, May 2018).   In the PDF, search for "620H".

Your processor has a minimum uncore frequency ratio of 0xC (12 decimal), corresponding to 1.2 GHz, and a maximum uncore frequency ratio of 0x1e (30 decimal), corresponding to 3.0 GHz.  Setting this register to 0x1e1e will force the uncore frequency to stay at 3.0 GHz (unless the processor hits a power or thermal limitation).  The uncore frequency has a non-negligible impact on memory latency, so it is something that you want to control in any latency tests.   After testing, writing 0x0c1e to the register will return the system to its default state and reduce the idle power consumption.

2. Monitoring Uncore Frequency:  Write 0x00400000 to MSR 0x703 to enable the uncore cycle counter.   Once you have done this, MSR 0x704 will start incrementing once per uncore cycle.  You can use these counts to determine the actual average uncore frequency during an interval.  The overhead of going into the kernel to read these counters is many thousands of cycles, so you can't use it directly in your test, but you can use it in longer-running tests to convince yourself that MSR 0x620 actually controls the uncore frequency.  If I recall correctly, on at least some processors this counter does not increment while in deep package C states, so you will want to ensure that at least one core remains active during the measurement interval (i.e., use a spin loop or something doing actual work rather than a call to sleep() during the interval between the two reads of MSR 0x704.)

----------

Monitoring the core frequency is probably more important, and it can be done using low-overhead RDPMC calls that can be included inline.  The fixed-function performance counters can be read using the RDPMC instruction by way of a special counter number.     The compiler macro "_rdpmc(int p)" is similar to the "_rdtsc()" macro you are using, but takes an argument for the counter number.   The programmable counters are numbered starting from 0, while the fixed-function counters are numbered starting from (1<<30).   The three fixed-function counters supported on all recent Intel processors are:

  • Fixed-Function Counter 0 (counter number (1<<30)):  Instructions Retired
  • Fixed-Function Counter 1 (counter number ((1<<30)+1):  Actual Cycles Not Halted
  • Fixed-Function Counter 2 (counter number ((1<<30)+2):  Reference Cycles Not Halted

NOTE: Be sure to use parentheses around the (1<<30)!   I have frequency forgotten that the "<<" operator has a low priority in order of operations in C, so (1<<30+1) is very much not the same as ((1<<30)+1).  (The default C compiler on my Mac (based on LLVM) gives a warning on the first version by default, while gcc requires an explicit request for an elevated level of warnings to make the same warning.)

The overhead of the "_rdpmc(int p)" macro should be similar to the overhead of the "_rdtsc()" macro -- the details depend a bit on the processor generation.

Given before and after values for the TSC and fixed-function counters 1 and 2, you can compute:

  • utilization = (double) (ref_cycles_unhalted_after - ref_cycles_unhalted_before) / (double)(tsc_after - tsc_before);
  • average_ghz = (double)(actual_cycles_unhalted_after - actual_cycles_unhalted_before) / (double) (ref_cycles_unhalted_after - ref_cycles_unhalted_before) * nominal_ghz;    // nominal_ghz is 2.1 on the Xeon E5-2620 v4

Utilization should be very close to 1.0 for your tests -- I only include it for reference.

The average frequency is more important.  When running a single thread, you want this to be consistently very close to the max Turbo frequency (assuming you have not disabled Turbo).

For both of these computations, the numbers will be "fuzzy" over short intervals because the fixed-function "Reference Cycles Not Halted" counter does not increment continuously -- on your processor it will increment by 21 every 10 ns.  This means that differences between reads of this counter will always be divisible by 21, while differences between TSC reads will be closer to a continuous distribution.

All of these performance counter tests make more sense with longer test intervals, but I have found that the behavior of the processors is much more predictable and stable when the core and uncore frequencies are properly controlled.

0 Kudos
morca
Beginner
1,141 Views
Lets start with the following part of your reply Your processor has a minimum uncore frequency ratio of 0xC (12 decimal), corresponding to 1.2 GHz, and a maximum uncore frequency ratio of 0x1e (30 decimal), corresponding to 3.0 GHz. Setting this register to 0x1e1e will force the uncore frequency to stay at 3.0 GHz (unless the processor hits a power or thermal limitation). The uncore frequency has a non-negligible impact on memory latency, so it is something that you want to control in any latency tests. After testing, writing 0x0c1e to the register will return the system to its default state and reduce the idle power consumption. First thing I want to know is that why should I write 0x1e1e while 0x1e is 30? What is the purpose of double 30? Because of two CPU sockets? I did you that however, and still get the same results as before. [root@compute-0-6 ~]# ./msr-tools-master/rdmsr 0x620 c1e [root@compute-0-6 ~]# ./msr-tools-master/wrmsr 0x620 0x1e1e [root@compute-0-6 ~]# ./msr-tools-master/rdmsr 0x620 1e1e [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 2799.980 [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 2799.980 [root@compute-0-6 ~]# lscpu | grep -E "(Model|CPU MHz)" Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz CPU MHz: 2799.980 [root@compute-0-6 ~]# gcc -Wall -O3 -o simple_flush simple_flush.c [root@compute-0-6 ~]# ./simple_flush^C [root@compute-0-6 ~]# for i in {1..10}; do taskset -c 31 ./simple_flush; done -------- vAddr array[30] = 0x7fffa8b30438 TSC = 368 vAddr array[33] = 0x7fffa8b304d8 TSC = 210 vAddr array[70] = 0x7fffa8b30444 TSC = 221 lfence overhead is 32 sink = 9999999999 -------- vAddr array[30] = 0x7ffc46d0c108 TSC = 327 vAddr array[33] = 0x7ffc46d0c1a8 TSC = 11 vAddr array[70] = 0x7ffc46d0c114 TSC = 192 lfence overhead is 35 sink = 9999999999 -------- vAddr array[30] = 0x7ffd3dbdf758 TSC = 293 vAddr array[33] = 0x7ffd3dbdf7f8 TSC = 14 vAddr array[70] = 0x7ffd3dbdf764 TSC = 161 lfence overhead is 35 sink = 9999999999 -------- vAddr array[30] = 0x7ffe2cf72e28 TSC = 319 vAddr array[33] = 0x7ffe2cf72ec8 TSC = 5 vAddr array[70] = 0x7ffe2cf72e34 TSC = 164 lfence overhead is 35 sink = 9999999999 -------- vAddr array[30] = 0x7fff2e24a0d8 TSC = 299 vAddr array[33] = 0x7fff2e24a178 TSC = 6 vAddr array[70] = 0x7fff2e24a0e4 TSC = 176 lfence overhead is 31 sink = 9999999999 ^C [root@compute-0-6 ~]# As you can see, array[70] is still odd.
0 Kudos
McCalpinJohn
Honored Contributor III
1,141 Views

I included references to the documentation on purpose -- the answer to your first question is there.

Your results are certainly a bit odd, but on an aggressive out-of-order processor, measuring the latency of a single load instruction is not what the timers are designed for.

In any case, your results are already more than good enough to confirm that array[70] is not in the same cache line as array[30], which I thought was the original goal of the testing?

0 Kudos
morca
Beginner
1,141 Views

Your results are certainly a bit odd, but on an aggressive out-of-order processor, measuring the latency of a single load instruction is not what the timers are designed for.

I know. But what confuses me is that Travis (post #27) showed that the he got more reasonable numbers.

In any case, your results are already more than good enough to confirm that array[70] is not in the same cache line as array[30], which I thought was the original goal of the testing?

Actually yes and no!

The purpose was to disable the prefetcher to see the impact. Actually writing 0 of 15 to 0x1a4 has no meaningful results for array[70].

 

I think this thread is getting boring. I will test some cases and post a new thread with more specific issues in the upcoming days.

0 Kudos
McCalpinJohn
Honored Contributor III
1,141 Views

The L1 hardware prefetcher might generate a prefetch for the next line after seen loads to array[30] and array[33] (note 1), but the L2 hardware prefetcher won't generate prefetches until it sees two fetches to the same 4KiB page.  The load to array[70] is the second access to the page, so I expect any L2 HW prefetches to be generated after array[70] is loaded.

Note 1: There is not a lot of documentation for the L1 Hardware Prefetchers, which may have slightly different properties on different processor generations.  The L1 streaming prefetcher is not very aggressive (since it is only trying to tolerate the latency of an L1 Miss/L2 Hit), but it is hard to study because most of the hardware performance counters don't distinguish between demand accesses and L1 HW prefetch accesses.  For the workloads that I study, the L2 HW prefetchers are much more important, so the studies I have done are focused there....

0 Kudos
morca
Beginner
1,141 Views

I have read some papers about time-based cache attacks and they all rely on hit/miss time. I know that most of such features are hidden due to the marketing issues. I wonder how they achieve their results!

For example, they usually operate on shared caches. Knowing the last level is L3, there is no document about disabling/enabling the prefetcher at L3. So, if someone measure the access time, how on earth he can decide if the block is accessed by the victim or it is the result of L3 prefetcher or page prefetcher or ...

That really bothers me! Any idea?

0 Kudos
Travis_D_
New Contributor II
1,141 Views

Intel C. wrote:

I have read some papers about time-based cache attacks and they all rely on hit/miss time. I know that most of such features are hidden due to the marketing issues. I wonder how they achieve their results!

For example, they usually operate on shared caches. Knowing the last level is L3, there is no document about disabling/enabling the prefetcher at L3. So, if someone measure the access time, how on earth he can decide if the block is accessed by the victim or it is the result of L3 prefetcher or page prefetcher or ...

That really bothers me! Any idea?

For one thing, there is no L3 prefetcher: only the L2 and L1 prefetchers (of which only the L2 prefetcher is relevant here). The L2 prefetcher may fetch lines all the way to the L2 or only to the L3 if some conditions are met (a high number of outstanding requests from L2 or something like that).

 

0 Kudos
Reply