Disabling HW prefetcher

morca · ‎08-25-2018

Hi

With _mm_clflush(), I flushed an array from all cache levels. Next, I to measure two accesses with __rdtsc(). While I know the distance between two accesses is larger than cache line size, e.g. 80 bytes distance, the TSC for the first access sounds like a miss (which is true), while the TSC for the second element sounds like a hit (which is wrong).

It seems that HW stride prefetcher brings the second element. Is there any way to force the processor not to prefetch?

TimP · ‎08-26-2018

If you can find hints about the use of MSR setting, you should be able (with full privilege) to control the various prefetchers independently. It sounds like for your purpose it may be sufficient to double the distance between memory access so as to be in separate cache line pairs.

morca · ‎08-26-2018

Yes I can increase the distance. However, I don't want to to that.

I am curious to know more about MSR. What do you mean by privilege? root account in linux?

How can I control MSR?

morca · ‎08-27-2018

With msr-tools I want to control the Intel prefetcher's operation. The region according to [1] is 0x1a4. Problem is that wrmsr has no effect!

# modprobe msr
# rdmsr -p0 0x1a4
0
# wrmsr -p0 0x1a4 1
# rdmsr -p0 0x1a4
0
#

CPU is reported as


# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping:            1
CPU MHz:             2097.571
BogoMIPS:            4195.14
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20

480K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm 3dnowprefetch epb pti dtherm ida arat pln pts

Any thought?

[1] https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

McCalpinJohn · ‎08-27-2018

The Hypervisor is probably intercepting the MSR writes and preventing them from taking effect.

This should work as desired on "bare metal".

morca · ‎08-28-2018

Yes you are right. I verified that.

Moreover, the Intel document about HW prefetcher [1] seems to be old because there is no information about L3 cache. Also, Bit #3 in the manual is said to be reserved while in the document it is related to DCU IP prefetcher (volume 4, table 2-10)

[1] https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

morca · ‎08-28-2018

I have written the following code in order to measure the line size. I have created an array and then flush the first element from cache. Then I measure the time to read the first element using rdstc(). Since each element is 4 bytes, the distance between array[0] and array[20] is 80-bytes. I am pretty sure that they don't reside in the same cache line. int array[ 100 ]; int i; for ( i = 0; i < 100; i++ ) array[ i ] = i; // bring array to the cache uint64_t t1, t2, ov, diff1, diff2; _mm_lfence(); _mm_clflush( &array[ 0 ] ); _mm_lfence(); _mm_lfence(); // fence to keep load order t1 = __rdtsc(); // set start time _mm_lfence(); int tmp = array[ 0 ]; // read the first elemet => cache miss _mm_lfence(); t2 = __rdtsc(); // set stop time _mm_lfence(); diff1 = t2 - t1; printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 ); _mm_lfence(); t1 = __rdtsc(); int tmp2 = array[ 20 ]; _mm_lfence(); t2 = __rdtsc(); _mm_lfence(); diff2 = t2 - t1; printf( "tmp2 is %d\ndiff2 is %lu\n", tmp2, diff2 ); _mm_lfence(); t1 = __rdtsc(); _mm_lfence(); _mm_lfence(); t2 = __rdtsc(); _mm_lfence(); ov = t2 - t1; printf( "lfence overhead is %lu\n", ov ); printf( "TSC1 is %lu\n", diff1-ov ); printf( "TSC2 is %lu\n", diff2-ov ); Next, I disabled prefetcher with wrmsr and then I saw some weird results. [root@compute-0-6 ~]# ./msr-tools-master/rdmsr 0x1a4 0 [root@compute-0-6 ~]# ./msr-tools-master/rdmsr -p0 0x1a4 f [root@compute-0-6 ~]# ./msr-tools-master/wrmsr -p0 0x1a4 15 [root@compute-0-6 ~]# ./msr-tools-master/rdmsr -p0 0x1a4 f [root@compute-0-6 ~]# ./simple_flush1 tmp is 0 diff1 is 771 tmp2 is 20 diff2 is 64 lfence overhead is 64 TSC1 is 707 TSC2 is 0 [root@compute-0-6 ~]# ./simple_flush1 tmp is 0 diff1 is 760 tmp2 is 20 diff2 is 52 lfence overhead is 68 TSC1 is 692 TSC2 is 18446744073709551600 [root@compute-0-6 ~]# ./simple_flush1 tmp is 0 diff1 is 660 tmp2 is 20 diff2 is 62 lfence overhead is 69 TSC1 is 591 TSC2 is 18446744073709551609 [root@compute-0-6 ~]# Any guess?

James_C_Intel2 · ‎08-28-2018

I have written the following code in order to measure the line size.

Why? It is documented in the fine manual, as is how to use cpuid (with EAX==01H) to read it from the processor on which you are running if you are paranoid and think it will change, alternatively, Google also knows if you ask it.

McCalpinJohn · ‎08-28-2018

FYI, the "lfence" operator will have no impact on the ordering of the execution of the RDTSC instruction. It typically executes in program order, but can execute before the completion of a preceding long-latency load or mispredicted branch.

As an alternative, the RDTSCP instruction will wait to execute until all prior instructions have execution. This means that RDTSCP will not execute until after preceding long-latency load or mispredicted branches have executed. So it can't execute early, but there is no way to prevent subsequent instructions from executing before the RDTSCP. (They typically don't, but the architecture makes no guarantees.)

Another alternative approach to ordering is to use RDPMC instead (after programming one of the performance counters to measure either actual cycles not halted or reference cycles not halted). The RDPMC instruction has an input argument (the counter number), and this can be used to force a dependency between the execution of prior instructions and the execution of the RDPMC. For example, if you are loading a value from memory, you can use that value in a simple formula to create the counter number -- this will force the RDPMC instruction to wait until after the load has completed. Some cleverness is required to come up with a formula that does not depend on specific values of the data being loaded. One way that should work for all data is to pre-load a GPR with zero, then perform a logical AND of the data that you are waiting on with the zeroed GPR, then add whatever counter number you want. Don't use an immediate operand of zero for the AND operation -- the hardware may notice this idiom and eliminate the operation.

morca · ‎08-28-2018

>FYI, the "lfence" operator will have no impact on the ordering of the execution of the RDTSC instruction. It typically executes in program >order, but can execute before the completion of a preceding long-latency load or mispredicted branch.

What about using mfence? It seems that replacing lfence with mfence in the code and leaving other parts intact, will do what you say.

>As an alternative, the RDTSCP instruction will wait to execute until all prior instructions have execution.

And that is not suitable for stores. Am I right?

back to the question, still I want to finish the code for my own purposes to learn somethings. I want to evaluate different prefetcher methods for data structures. So, for a simple case, I have an array and want to first check and measure the latencies of array[0] and array[20].

Also, my previous post has not been answered. I appreciate if you give me some tips to understand.

McCalpinJohn · ‎08-29-2018

The RDTSC instruction cannot be ordered by anything short of a serializing instruction, and there are not many of those available in user mode. CPUID is the preferred serializing instruction in user mode, but it has a very high latency. (I seem to recall measuring an overhead of >200 cycles on one of my systems, while Agner Fog's "instruction_tables.pdf" reports an overhead of 100-250 cycles on most Intel processors.)

RDTSCP will not execute until all prior stores have executed, but if you want to defer execution until it is guaranteed that the results of the store have become visible, additional serialization is needed.

morca · ‎08-30-2018

John, I understand what you say, but I would like to know if the previous code is technically wrong or it is right but not efficient. Let me state in another way. Assume that I want to measure the latency of int tmp = array[0]; What I wrote is _mm_lfence(); t1 = __rdtsc(); _mm_lfence(); int tmp = arrray[0]; _mm_lfence(); t2 = __rdtsc(); _mm_lfence(); You say that it is possible that in the pipeline, the execution of t2 may become completes before the execution of int tmp... Am I right? Then that will be technically a wrong measurement. At the time I was writing the code, I thought that between t1 and t2 there are two lfence and a memory read. So, I have to subtract the two lfences since they are overhead. It seems that you say that the code should be t1 = __rdtscp(); int tmp = arrray[0]; t2 = __rdtscp(); Is that right?

jimdempseyatthecove · ‎08-30-2018

>> Assume that I want to measure the latency of int tmp = array[0];

I assume from the effort you are doing that you want the read from RAM latency as opposed from some cache (or prefetch).

Suggestion:

Create two static arrays (iow not allocated from heap).
Each array size is to be much larger than total cache capacity (e.g. 2x).
(Note, total size must fit in physical memory)
Run a loop to initialize each array.
Run a loop a few times to read the first array (make sure the compiler does not optimize out the code)
Now then time the reading of specific cells of the second array...
... using constant values for array indexes...
... and with a separation of larger than page size

*** Additional note, to generate the worst case latency, you will need to assure that the array sizes are each large enough to consume the capacity of the TLB (Translation Look aside Buffer).

Repeat the test a few times, take worst case where it appears the O/S wasn't interfering with your test.

Bear in mind that the worst case test will incur the overhead of reading the page table entry(s) plus the overhead of the RAM read.

Jim Dempsey

McCalpinJohn · ‎08-30-2018

Using RDTSCP instructions will provide ordering control that is closer to what you are looking for, and the LFENCE instructions only add overhead, not control.

There are still some fundamental problems here.

(1) A statement like

int tmp=array[0];

may not actually correspond to executable instructions (unless optimization is completely disabled).

A compiler with good aliasing analysis can move the assignment upstream or downstream, or may replace the next use of tmp with a reference to array[0] (which may already be in a register), or may replace the next use of tmp with a reference to whatever source was used in the most recent write to array[0] (which may have been a constant, which may allow the compiler to eliminate the assignment entirely), or which may allow the hardware to eliminate the instruction at the register allocation stage.

Careful inspection of the generated assembly code is requirement in this case. You may need to fiddle with optimization levels or the "volatile" keyword, or inline assembly code to get exactly what you want.

(2) Even if the statement is compiled as a load instruction from memory to a register, the overhead of the measurement is large compared to the execution time of the operation. In addition, all of the instructions that read the TSC or performance counters are microcoded, so they will interfere with the pipelining of the execution of surrounding instructions in ways that are difficult to predict or understand.

As a general rule, you probably don't want to try to measure the execution time of any piece of code whose expected minimum execution time is less than 20x the overhead of the measurement instructions. Anything under 200 cycles is definitely problematic, and requires extremely careful attention to detail and lots of experimentation with variations of the coding to develop any confidence that the results mean what you think they mean. If the code you want to understand takes such a short amount of time, you probably need to add another loop to repeat it (often requiring extra tricks to prevent the compiler from eliminating the redundant operations). For code that involves memory accesses, generating code that repeats a sequence of operations requires that you understand where the data is located in the cache hierarchy in the original case and that you figure out how to construct a test framework that ensures that each repetition obtains the data from the same place(s). This can be a significant exercise.

McCalpinJohn · ‎08-30-2018

This seems like a good opportunity to point to my recent discussion of some of the issues involved in timing short code sections on Intel processors: http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

Travis_D_ · ‎09-01-2018

McCalpin, John wrote:

The RDTSC instruction cannot be ordered by anything short of a serializing instruction, and there are not many of those available in user mode. CPUID is the preferred serializing instruction in user mode, but it has a very high latency. (I seem to recall measuring an overhead of >200 cycles on one of my systems, while Agner Fog's "instruction_tables.pdf" reports an overhead of 100-250 cycles on most Intel processors.)

RDTSCP will not execute until all prior stores have executed, but if you want to defer execution until it is guaranteed that the results of the store have become visible, additional serialization is needed.

Are you sure about RDTSC? Everything that I have read and tried indicates that on Intel CPUs rdtsc will be ordered by lfence.

In particular, on Intel, lfence is an execution barrier: all earlier instructions complete before the lfence executes, and no later instruction starts until the the lfence executes. So lfence neatly segregates instructions before and after it. The only thing that sneaks across the lfence is stores: when they retire, they still sit in the store buffer and the lfence doesn't have any effect there, so stores before an lfence may still be sitting in the store buffer, which may slow down stores you do in the timed region (but often not). You can throw in an mfence before the lfence if you want to avoid that (on current Intel CPUs with up-to-date microcode mfence is probably all you need, since it also serializes execution - but that's not guaranteed in the future).

Assuming this how lfence works, its hard to see how it wouldn't order rdtsc, which after all is "just another instruction" until it executes.

FWIW lfence is widely used to serialize execution exactly to make timing more reliable.

McCalpinJohn · ‎09-02-2018

It looks like I was wrong about LFENCE. Intel has combined memory access ordering and instruction execution ordering in this instruction in a way that is not obvious from some of the descriptions. There is a hint about this behavior of LFENCE in footnote 2 of Section 8.2.5, but it is not as clearly written as one might hope.

The description of the RDTSC instruction in Volume 2 of the Intel SW Developer's Manual is very clear:

If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible,1 it can execute LFENCE immediately before RDTSC.
If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.
If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC.

Travis_D_ · ‎09-02-2018

Yes, the guarantees for lfence have changed over time. Originally, the implementation of lfence had, as a side effect, the effect of forming a barrier to out-of-order execution, since that's a simple way to fencing loads (since they become observable at the moment they execute basically - there is no load equivalent of a store buffer to confused things).

So I think lfence always worked as a serializing execution, but at some point Intel decided to document the behavior in the SDM, and now you have this text in the SDM Instruction reference:

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. In particular, an instruction that loads from memory and that precedes an LFENCE receives data from memory prior to completion of the LFENCE. (An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible.) Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

In particular, the part starting with "Specifically, " was added, to document the serializing behavior. When they say "completed locally" it is a hint that it doesn't imply store buffer flushing, and they are explicit about this part later one.

Note that AMD does not make the same guarantees, and indeed lfence doesn't serialize on some AMD chips (and it executes much faster). This kind of barrier (sometimes referred to as a speculation batter) became important in the age of Spectre, so now even AMD forces lfence to be serializing if some certain bits are set in some MSR.

McCalpinJohn · ‎09-03-2018

I am sure that I read that section many times, but it looked more like poor wording than a new feature! :-(

They start off talking about load fencing, then about instruction execution, then the "in particular" goes back to memory references again. It was not clear if "all prior instructions have completely locally" was intended to include non-memory instructions. I can see now that it was, but I would have included BIG BOLD WORDS to point to the new execution serialization functionality if I had been trying to describe this. The comments with the description of the RDTSC instruction remove all doubt.

(This does make me wonder if there is a semantic difference between "LFENCE; RDTSC" and "RDTSCP". The description of the RDTSCP instruction in Volume 2 makes it look like these are the same?)

Travis_D_ · ‎09-03-2018

Yes, although in Intel's defence the confusion seems to have arisen because the text wasn't written from scratch but edited from an earlier version. An earlier version said:

Performs a serializing operation on all load instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible. The LFENCE instruction is ordered with respect to load instructions, other LFENCE instructions, any MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to store instructions or the SFENCE instruction.

Here you can clearly see how when they decided to make it apply to all instructions, they just edited the second sentence to remove the reference to "load instruction" and change it to plain "instruction". Of course, the cohesion of the paragraph was left lacking as a result...

As far as I know the out-or-order semantics of lfence; rdtsc are essentially the same as rdtscp, although the latter might in principle be a bit faster if it integrates the lfence behavior. Of course, with with rdtscp you get the MSR read at the same time!

I have seen reports that when comparing lfence; rdtsc; lfence to rdtscp; lfence (i.e., the two main "fully fenced tsc read" options), the former gives more stable results. That is, it might be slightly slower but has less run-to-run variation. Maybe something to consider for your low-level timers.

McCalpinJohn · ‎09-05-2018

A very interesting white paper on controlling speculative execution in AMD processors is available at https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf

Several of these techniques are probably the same as those used with Intel processors, but some look to be AMD-specific.

The "CMOV" approach is very clever. It provides an implicit dependency via the condition code flags, and (if I am not confused) a slightly different example could be used to prevent a non-zero value from actually appearing in the target address register unless the bounds check passes. This could mitigate some (future?) security nightmares due to value prediction in an OOO processor.

Back to the original topic....

When I have time, I will try adding LFENCE operations to my https://github.com/jdmccalpin/low-overhead-timers project. I can already see that I will need to expand the scope of the measurement harness to see how the timer overhead and accuracy change with fencing in cases where long-latency instructions (memory references or mispredicted branches) are in flight.