<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hello guys in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176711#M6703</link>
    <description>&lt;P&gt;Hello guys&lt;/P&gt;

&lt;P&gt;It seems that using lfence() and rdtsc() are fine and since I am not using stores, so mfence() is not applicable here. I also, modified my code and tried with a volatile array.&lt;/P&gt;

&lt;P&gt;However, no matter how array is big and other things, the problem with HW prefetcher still exists. Thing is I flush two lines (array[30] and array[70] with a distance greater than cache line) and then try with three accesses.&lt;/P&gt;

&lt;P&gt;1) access to array[30] =&amp;gt; definitely miss&lt;/P&gt;

&lt;P&gt;2) access to array[70] =&amp;gt; prefetcher enabled =&amp;gt; hit and prefetcher disabled =&amp;gt; miss&lt;/P&gt;

&lt;P&gt;3) access to array[33] =&amp;gt; definitely hit&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The code is&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;    /* create array */
    int array[ 100 ];
    int i;
    for ( i = 0; i &amp;lt; 100; i++ )
        array[ i ] = i;   // bring array to the cache

    for ( i = 0; i &amp;lt; 100000000; i++ ) ;

    uint64_t t1, t2, ov, diff1, diff2, diff3;

    /* flush the first cache line */
    _mm_lfence();
    _mm_clflush( &amp;amp;array[ 30 ] );
    _mm_clflush( &amp;amp;array[ 70 ] );
    _mm_lfence();

    /* READ MISS 1 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    (void) *((volatile int*)array + 30);   // read the first elemet =&amp;gt; cache miss
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();
    diff1 = t2 - t1;        // two fence statements are overhead

    /* READ MISS 2 */
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    (void) *((volatile int*)array + 70);      // read the second elemet =&amp;gt; cache miss (or hit due to prefetching?!)
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();
    diff2 = t2 - t1;        // two fence statements are overhead


    /* READ HIT*/
    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    (void) *((volatile int*)array + 33);   // read the first elemet =&amp;gt; cache hit
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();
    diff3 = t2 - t1;        // two fence statements are overhead

    /* measuring fence overhead */
    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;

    printf( "lfence overhead is %lu\n", ov );
    printf( "cache miss1 TSC is %lu\n", diff1-ov );
    printf( "cache miss2 (or hit due to prefetching) TSC is %lu\n", diff2-ov );
    printf( "cache hit TSC is %lu\n", diff3-ov );
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I also have disabled the HW prefetcher with wrmsr command as below&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;# ./msr-tools-master/wrmsr 0x1a4 15
# ./msr-tools-master/rdmsr 0x1a4
f
#&lt;/PRE&gt;

&lt;P&gt;How when I compile and run (pin to a processor), I get the following results&lt;/P&gt;

&lt;PRE class="brush:bash; class-name:dark;"&gt;# gcc -Wall -O3 -o simple_flush simple_flush.c
# taskset -c 30 ./simple_flush
lfence overhead is 29
cache miss1 TSC is 279
cache miss2 (or hit due to prefetching) TSC is 209
cache hit TSC is 11
# taskset -c 30 ./simple_flush
lfence overhead is 29
cache miss1 TSC is 362
cache miss2 (or hit due to prefetching) TSC is 175
cache hit TSC is 6
# taskset -c 30 ./simple_flush
lfence overhead is 29
cache miss1 TSC is 466
cache miss2 (or hit due to prefetching) TSC is 166
cache hit TSC is 8
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;As you can see the miss number (more than 250 TSC) is reasonable for miss. Also, 8 TSC is reasonable for hit. However, what is 175?!! Sounds like array[70] is prefetched to L3. The MSR value of 0x1A4 talks nothing about L3.&lt;/P&gt;

&lt;P&gt;Any idea?&lt;/P&gt;</description>
    <pubDate>Thu, 06 Sep 2018 17:06:41 GMT</pubDate>
    <dc:creator>morca</dc:creator>
    <dc:date>2018-09-06T17:06:41Z</dc:date>
    <item>
      <title>Disabling HW prefetcher</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176690#M6682</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;

&lt;P&gt;With _mm_clflush(), I flushed an array from all cache levels. Next, I to measure two accesses with __rdtsc(). While I know the distance between two accesses is larger than cache line size, e.g. 80 bytes distance, the TSC for the first access sounds like a miss (which is true), while the TSC for the second element sounds like a hit (which is wrong).&lt;/P&gt;

&lt;P&gt;It seems that HW stride prefetcher brings the second element. Is there any way to force the processor not to prefetch?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Aug 2018 10:02:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176690#M6682</guid>
      <dc:creator>morca</dc:creator>
      <dc:date>2018-08-25T10:02:47Z</dc:date>
    </item>
    <item>
      <title>If you can find hints about</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176691#M6683</link>
      <description>&lt;P&gt;If you can find hints about the use of MSR setting, you should be able (with full privilege) to control the various prefetchers independently.&amp;nbsp; It sounds like for your purpose it may be sufficient to double the distance between memory access so as to be in separate cache line pairs.&lt;/P&gt;</description>
      <pubDate>Sun, 26 Aug 2018 11:23:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176691#M6683</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2018-08-26T11:23:34Z</dc:date>
    </item>
    <item>
      <title>Yes I can increase the</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176692#M6684</link>
      <description>&lt;P&gt;Yes I can increase the distance. However, I don't want to to that.&lt;/P&gt;

&lt;P&gt;I am curious to know more about MSR. What do you mean by privilege? root account in linux?&lt;/P&gt;

&lt;P&gt;How can I control MSR?&lt;/P&gt;</description>
      <pubDate>Sun, 26 Aug 2018 12:17:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176692#M6684</guid>
      <dc:creator>morca</dc:creator>
      <dc:date>2018-08-26T12:17:40Z</dc:date>
    </item>
    <item>
      <title>With msr-tools I want to</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176693#M6685</link>
      <description>&lt;P&gt;With &lt;CODE&gt;msr-tools&lt;/CODE&gt; I want to control the Intel prefetcher's operation. The region according to [1] is &lt;CODE&gt;0x1a4&lt;/CODE&gt;. Problem is that &lt;CODE&gt;wrmsr&lt;/CODE&gt; has no effect!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;PRE&gt;&lt;STRONG&gt;&lt;CODE&gt;# modprobe msr
# rdmsr -p0 0x1a4
0
# wrmsr -p0 0x1a4 1
# rdmsr -p0 0x1a4
0
#

&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/PRE&gt;

&lt;P&gt;CPU is reported as&lt;/P&gt;

&lt;PRE&gt;&lt;STRONG&gt;&lt;CODE&gt;
# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping:            1
CPU MHz:             2097.571
BogoMIPS:            4195.14
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/PRE&gt;

&lt;PRE&gt;&lt;STRONG&gt;&lt;CODE&gt;480K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm 3dnowprefetch epb pti dtherm ida arat pln pts

&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/PRE&gt;

&lt;P&gt;Any thought?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;[1] &lt;A href="https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors" target="_blank"&gt;https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Aug 2018 09:34:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176693#M6685</guid>
      <dc:creator>morca</dc:creator>
      <dc:date>2018-08-27T09:34:30Z</dc:date>
    </item>
    <item>
      <title>The Hypervisor is probably</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176694#M6686</link>
      <description>&lt;P&gt;The Hypervisor is probably intercepting the MSR writes and preventing them from taking effect.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;This should work as desired on "bare metal".&lt;/P&gt;</description>
      <pubDate>Mon, 27 Aug 2018 18:25:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176694#M6686</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-08-27T18:25:25Z</dc:date>
    </item>
    <item>
      <title>Yes you are right. I verified</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176695#M6687</link>
      <description>&lt;P&gt;Yes you are right. I verified that.&lt;/P&gt;

&lt;P&gt;Moreover, the Intel document about HW prefetcher [1] seems to be old because there is no information about L3 cache. Also, Bit #3 in the manual is said to be reserved while in the document it is related to DCU IP prefetcher (volume 4, table 2-10)&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;[1] &lt;A href="https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors" target="_blank"&gt;https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Aug 2018 09:31:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176695#M6687</guid>
      <dc:creator>morca</dc:creator>
      <dc:date>2018-08-28T09:31:17Z</dc:date>
    </item>
    <item>
      <title>I have written the following</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176696#M6688</link>
      <description>I have written the following code in order to measure the line size. I have created an array and then flush the first element from cache. Then I measure the time to read the first element using rdstc().

Since each element is 4 bytes, the distance between array[0] and array[20] is 80-bytes. I am pretty sure that they don't reside in the same cache line.

    int array[ 100 ];
    int i;
    for ( i = 0; i &amp;lt; 100; i++ )
       array[ i ] = i;   // bring array to the cache

    uint64_t t1, t2, ov, diff1, diff2;

    _mm_lfence();
    _mm_clflush( &amp;amp;array[ 0 ] );
    _mm_lfence();

    _mm_lfence();           // fence to keep load order
    t1 = __rdtsc();         // set start time
    _mm_lfence();
    int tmp = array[ 0 ];   // read the first elemet =&amp;gt; cache miss
    _mm_lfence();
    t2 = __rdtsc();         // set stop time
    _mm_lfence();

    diff1 = t2 - t1;       
    printf( "tmp is %d\ndiff1 is %lu\n", tmp, diff1 );

    _mm_lfence();
    t1 = __rdtsc();
    int tmp2 = array[ 20 ];
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    diff2 = t2 - t1;
    printf( "tmp2 is %d\ndiff2 is %lu\n", tmp2, diff2 );

    _mm_lfence();
    t1 = __rdtsc();
    _mm_lfence();
    _mm_lfence();
    t2 = __rdtsc();
    _mm_lfence();
    ov = t2 - t1;

    printf( "lfence overhead is %lu\n", ov );
    printf( "TSC1 is %lu\n", diff1-ov );
    printf( "TSC2 is %lu\n", diff2-ov );




Next, I disabled prefetcher with wrmsr and then I saw some weird results. 


[root@compute-0-6 ~]# ./msr-tools-master/rdmsr 0x1a4
0
[root@compute-0-6 ~]# ./msr-tools-master/rdmsr -p0 0x1a4
f
[root@compute-0-6 ~]# ./msr-tools-master/wrmsr -p0 0x1a4 15
[root@compute-0-6 ~]# ./msr-tools-master/rdmsr -p0 0x1a4
f
[root@compute-0-6 ~]# ./simple_flush1
tmp is 0
diff1 is 771
tmp2 is 20
diff2 is 64
lfence overhead is 64
TSC1 is 707
TSC2 is 0
[root@compute-0-6 ~]# ./simple_flush1
tmp is 0
diff1 is 760
tmp2 is 20
diff2 is 52
lfence overhead is 68
TSC1 is 692
TSC2 is 18446744073709551600
[root@compute-0-6 ~]# ./simple_flush1
tmp is 0
diff1 is 660
tmp2 is 20
diff2 is 62
lfence overhead is 69
TSC1 is 591
TSC2 is 18446744073709551609
[root@compute-0-6 ~]#



Any guess?</description>
      <pubDate>Tue, 28 Aug 2018 12:24:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176696#M6688</guid>
      <dc:creator>morca</dc:creator>
      <dc:date>2018-08-28T12:24:00Z</dc:date>
    </item>
    <item>
      <title>I have written the following</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176697#M6689</link>
      <description>&lt;BLOCKQUOTE&gt;
	&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;I have written the following code in order to measure the line size.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Why? It is documented in the fine manual, as is how to use cpuid (with EAX==01H) to read it from the processor on which you are running if you are paranoid and think it will change, alternatively, &lt;A href="https://www.google.com/search?q=x86+cache+line+size"&gt;Google also knows&lt;/A&gt; if you ask it.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Aug 2018 12:40:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176697#M6689</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2018-08-28T12:40:16Z</dc:date>
    </item>
    <item>
      <title>FYI, the "lfence" operator</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176698#M6690</link>
      <description>&lt;P&gt;FYI, the "lfence" operator will have no impact on the ordering of the execution of the RDTSC instruction.&amp;nbsp;&amp;nbsp; It typically executes in program order, but can execute before the completion of a preceding long-latency load or mispredicted branch.&lt;/P&gt;

&lt;P&gt;As an alternative, the RDTSCP instruction will wait to execute until all prior instructions have execution.&amp;nbsp;&amp;nbsp; This means that RDTSCP will not execute until after preceding long-latency load or mispredicted branches have executed.&amp;nbsp; So it can't execute early, but there is no way to prevent subsequent instructions from executing before the RDTSCP.&amp;nbsp;&amp;nbsp; (They typically don't, but the architecture makes no guarantees.)&lt;/P&gt;

&lt;P&gt;Another alternative approach to ordering is to use RDPMC instead (after programming one of the performance counters to measure either actual cycles not halted or reference cycles not halted).&amp;nbsp; The RDPMC instruction has an input argument (the counter number), and this can be used to force a dependency between the execution of prior instructions and the execution of the RDPMC.&amp;nbsp;&amp;nbsp; For example, if you are loading a value from memory, you can use that value in a simple formula to create the counter number -- this will force the RDPMC instruction to wait until after the load has completed.&amp;nbsp; Some cleverness is required to come up with a formula that does not depend on specific values of the data being loaded.&amp;nbsp;&amp;nbsp; One way that should work for all data is to pre-load a GPR with zero, then perform a logical AND of the data that you are waiting on with the zeroed GPR, then add whatever counter number you want.&amp;nbsp;&amp;nbsp; Don't use an immediate operand of zero for the AND operation -- the hardware may notice this idiom and eliminate the operation.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Aug 2018 16:45:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176698#M6690</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-08-28T16:45:40Z</dc:date>
    </item>
    <item>
      <title>&gt;FYI, the "lfence" operator</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176699#M6691</link>
      <description>&lt;P&gt;&amp;gt;FYI, the "lfence" operator will have no impact on the ordering of the execution of the RDTSC instruction.&amp;nbsp;&amp;nbsp; It typically executes in program &amp;gt;order, but can execute before the completion of a preceding long-latency load or mispredicted branch.&lt;/P&gt;

&lt;P&gt;What about using mfence? It seems that replacing lfence with mfence in the code and leaving other parts intact, will do what you say.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;gt;As an alternative, the RDTSCP instruction will wait to execute until all prior instructions have execution.&lt;/P&gt;

&lt;P&gt;And that is not suitable for stores. Am I right?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;back to the question, still I want to finish the code for my own purposes to learn somethings. I want to evaluate different prefetcher methods for data structures. So, for a simple case, I have an array and want to first check and measure the latencies of array[0] and array[20].&lt;/P&gt;

&lt;P&gt;Also, my &lt;STRONG&gt;&lt;A href="https://software.intel.com/en-us/forums/intel-isa-extensions/topic/785240#comment-1926318"&gt;previous post&lt;/A&gt;&lt;/STRONG&gt; has not been answered. I appreciate if you give me some tips to understand.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Aug 2018 19:36:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176699#M6691</guid>
      <dc:creator>morca</dc:creator>
      <dc:date>2018-08-28T19:36:39Z</dc:date>
    </item>
    <item>
      <title>The RDTSC instruction cannot</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176700#M6692</link>
      <description>&lt;P&gt;The RDTSC instruction cannot be ordered by anything short of a serializing instruction, and there are not many of those available in user mode.&amp;nbsp; CPUID is the preferred serializing instruction in user mode, but it has a very high latency. (I seem to recall measuring an overhead of &amp;gt;200 cycles on one of my systems, while Agner Fog's "instruction_tables.pdf" reports an overhead of 100-250 cycles on most Intel processors.)&lt;/P&gt;

&lt;P&gt;RDTSCP will not execute until all prior stores have executed, but if you want to defer execution until it is guaranteed that the results of the store have become visible, additional serialization is needed.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Aug 2018 16:41:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176700#M6692</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-08-29T16:41:52Z</dc:date>
    </item>
    <item>
      <title>John,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176701#M6693</link>
      <description>John,
I understand what you say, but I would like to know if the previous code is technically wrong or it is  right but not efficient. Let me state in another way. Assume that I want to measure the latency of 

int tmp = array[0];

What I wrote is

_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
int tmp = arrray[0];
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();

You say that it is possible that in the pipeline, the execution of t2 may become completes before the execution of int tmp... Am I right? Then that will be technically a wrong measurement.
At the time I was writing the code, I thought that between t1 and t2 there are two lfence and a memory read. So, I have to subtract the two lfences since they are overhead.


It seems that you say that the code should be 

t1 = __rdtscp();
int tmp = arrray[0];
t2 = __rdtscp();

Is that right?</description>
      <pubDate>Thu, 30 Aug 2018 10:29:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176701#M6693</guid>
      <dc:creator>morca</dc:creator>
      <dc:date>2018-08-30T10:29:53Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt; Assume that I want to</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176702#M6694</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt; &lt;EM&gt;Assume that I want to measure the latency of&amp;nbsp; int tmp = array[0];&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;I assume from the effort you are doing that you want the read from RAM latency as opposed from some cache (or prefetch).&lt;/P&gt;

&lt;P&gt;Suggestion:&lt;/P&gt;

&lt;P&gt;Create two static arrays (iow not allocated from heap).&lt;BR /&gt;
	Each array size is to be much larger than total cache capacity (e.g. 2x).&lt;BR /&gt;
	(Note, total size must fit in physical memory)&lt;BR /&gt;
	Run a loop to initialize each array.&lt;BR /&gt;
	Run a loop a few times to read the first array (make sure the compiler does not optimize out the code)&lt;BR /&gt;
	Now then time the reading of specific cells of the second array...&lt;BR /&gt;
	... using constant values for array indexes...&lt;BR /&gt;
	... and with a separation of larger than page size&lt;/P&gt;

&lt;P&gt;*** Additional note, to generate the worst case latency, you will need to assure that the array sizes are each large enough to consume the capacity of the TLB (Translation Look aside Buffer).&lt;/P&gt;

&lt;P&gt;Repeat the test a few times, take worst case where it appears the O/S wasn't interfering with your test.&lt;/P&gt;

&lt;P&gt;Bear in mind that the worst case test will incur the overhead of reading the page table entry(s) plus the overhead of the RAM read.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 30 Aug 2018 12:40:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176702#M6694</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2018-08-30T12:40:22Z</dc:date>
    </item>
    <item>
      <title>Using RDTSCP instructions</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176703#M6695</link>
      <description>&lt;P&gt;Using RDTSCP instructions will provide ordering control that is closer to what you are looking for, and the LFENCE instructions only add overhead, not control.&lt;/P&gt;

&lt;P&gt;There are still some fundamental problems here.&lt;/P&gt;

&lt;P&gt;(1) A statement like&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;int tmp=array[0];&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;may not actually correspond to executable instructions (unless optimization is completely disabled).&lt;/P&gt;

&lt;P&gt;A compiler with good aliasing analysis can move the assignment upstream or downstream, or may replace the next use of tmp with a reference to array[0] (which may already be in a register), or may replace the next use of tmp with a reference to whatever source was used in the most recent write to array[0] (which may have been a constant, which may allow the compiler to eliminate the assignment entirely), or which may allow the hardware to eliminate the instruction at the register allocation stage.&lt;/P&gt;

&lt;P&gt;Careful inspection of the generated assembly code is requirement in this case.&amp;nbsp; You may need to fiddle with optimization levels or the "volatile" keyword, or inline assembly code to get exactly what you want.&lt;/P&gt;

&lt;P&gt;(2) Even if the statement is compiled as a load instruction from memory to a register, the overhead of the measurement is large compared to the execution time of the operation.&amp;nbsp;&amp;nbsp; In addition, all of the instructions that read the TSC or performance counters are microcoded, so they will interfere with the pipelining of the execution of surrounding instructions in ways that are difficult to predict or understand.&lt;/P&gt;

&lt;P&gt;As a general rule, you probably don't want to try to measure the execution time of any piece of code whose expected minimum execution time is less than 20x the overhead of the measurement instructions.&amp;nbsp;&amp;nbsp; Anything under 200 cycles is definitely problematic, and requires extremely careful attention to detail and lots of experimentation with variations of the coding to develop any confidence that the results mean what you think they mean.&amp;nbsp;&amp;nbsp; If the code you want to understand takes such a short amount of time, you probably need to add another loop to repeat it (often requiring extra tricks to prevent the compiler from eliminating the redundant operations).&amp;nbsp; For code that involves memory accesses, generating code that repeats a sequence of operations requires that you understand where the data is located in the cache hierarchy in the original case and that you figure out how to construct a test framework that ensures that each repetition obtains the data from the same place(s).&amp;nbsp; This can be a significant exercise.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Aug 2018 15:52:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176703#M6695</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-08-30T15:52:53Z</dc:date>
    </item>
    <item>
      <title>This seems like a good</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176704#M6696</link>
      <description>&lt;P&gt;This seems like a good opportunity to point to my recent discussion of some of the issues involved in timing short code sections on Intel processors: &lt;A href="http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/" target="_blank"&gt;http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Aug 2018 16:14:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176704#M6696</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-08-30T16:14:20Z</dc:date>
    </item>
    <item>
      <title>Quote:McCalpin, John wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176705#M6697</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;McCalpin, John wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;The RDTSC instruction cannot be ordered by anything short of a serializing instruction, and there are not many of those available in user mode.&amp;nbsp; CPUID is the preferred serializing instruction in user mode, but it has a very high latency. (I seem to recall measuring an overhead of &amp;gt;200 cycles on one of my systems, while Agner Fog's "instruction_tables.pdf" reports an overhead of 100-250 cycles on most Intel processors.)&lt;/P&gt;

&lt;P&gt;RDTSCP will not execute until all prior stores have executed, but if you want to defer execution until it is guaranteed that the results of the store have become visible, additional serialization is needed.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Are you sure about RDTSC? Everything that I have read and tried indicates that on Intel CPUs rdtsc will be ordered by lfence.&lt;/P&gt;

&lt;P&gt;In particular, on Intel, lfence is an execution barrier: all earlier instructions complete before the lfence executes, and no later instruction starts until the the lfence executes. So lfence neatly segregates instructions before and after it. The only thing that sneaks across the lfence is stores: when they retire, they still sit in the store buffer and the lfence doesn't have any effect there, so stores before an lfence may still be sitting in the store buffer, which may slow down stores you do in the timed region (but often not). You can throw in an mfence before the lfence if you want to avoid that (on current Intel CPUs with up-to-date microcode mfence is probably &lt;EM&gt;all&lt;/EM&gt; you need, since it also serializes execution - but that's not guaranteed in the future).&lt;/P&gt;

&lt;P&gt;Assuming this how lfence works, its hard to see how it wouldn't order rdtsc, which after all is "just another instruction" until it executes.&lt;/P&gt;

&lt;P&gt;FWIW lfence is widely used to serialize execution exactly to make timing more reliable.&lt;/P&gt;</description>
      <pubDate>Sun, 02 Sep 2018 03:04:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176705#M6697</guid>
      <dc:creator>Travis_D_</dc:creator>
      <dc:date>2018-09-02T03:04:00Z</dc:date>
    </item>
    <item>
      <title>It looks like I was wrong</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176706#M6698</link>
      <description>&lt;P&gt;It looks like I was wrong about LFENCE.&amp;nbsp;&amp;nbsp; Intel has combined memory access ordering and instruction execution ordering in this instruction in a way that is not obvious from some of the descriptions.&amp;nbsp; There is a hint about this behavior of LFENCE in footnote 2 of Section 8.2.5, but it is not as clearly written as one might hope.&lt;/P&gt;

&lt;P&gt;The description of the RDTSC instruction in Volume 2 of the Intel SW Developer's Manual is very clear:&lt;/P&gt;

&lt;DIV class="page" title="Page 1197"&gt;
	&lt;DIV class="layoutArea"&gt;
		&lt;DIV class="column"&gt;
			&lt;UL style="list-style-type: disc"&gt;
				&lt;LI style="font-size: 14.000000pt; font-family: 'TimesNewRoman'"&gt;
					&lt;P&gt;&lt;SPAN style="font-size: 9.000000pt; font-family: 'Verdana'"&gt;If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible,&lt;/SPAN&gt;&lt;SPAN style="font-size: 7.000000pt; font-family: 'Verdana'; vertical-align: 4.000000pt"&gt;1 &lt;/SPAN&gt;&lt;SPAN style="font-size: 9.000000pt; font-family: 'Verdana'"&gt;it can execute LFENCE immediately before RDTSC. &lt;/SPAN&gt;&lt;/P&gt;
				&lt;/LI&gt;
				&lt;LI style="font-size: 14.000000pt; font-family: 'TimesNewRoman'"&gt;
					&lt;P&gt;&lt;SPAN style="font-size: 9.000000pt; font-family: 'Verdana'"&gt;If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC. &lt;/SPAN&gt;&lt;/P&gt;
				&lt;/LI&gt;
				&lt;LI style="font-size: 14.000000pt; font-family: 'TimesNewRoman'"&gt;
					&lt;P&gt;&lt;SPAN style="font-size: 9.000000pt; font-family: 'Verdana'"&gt;If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC.&lt;/SPAN&gt;&lt;/P&gt;
				&lt;/LI&gt;
			&lt;/UL&gt;
		&lt;/DIV&gt;
	&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Sun, 02 Sep 2018 20:29:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176706#M6698</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-09-02T20:29:55Z</dc:date>
    </item>
    <item>
      <title>Yes, the guarantees for</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176707#M6699</link>
      <description>&lt;P&gt;Yes, the guarantees for lfence have changed over time. Originally, the &lt;EM&gt;implementation &lt;/EM&gt;of lfence had, as a side effect, the effect of forming a barrier to out-of-order execution, since that's a simple way to fencing loads (since they become observable at the moment they execute basically - there is no load equivalent of a store buffer to confused things).&lt;/P&gt;

&lt;P&gt;So I think lfence always worked as a serializing execution, but at some point Intel decided to document the behavior in the SDM, and now you have this text in the SDM Instruction reference:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. In particular, an instruction that loads from memory and that precedes an LFENCE receives data from memory prior to completion of the LFENCE. (An LFENCE that follows an instruction that stores to memory might complete &lt;STRONG&gt;before&lt;/STRONG&gt; the data being stored have become globally visible.) Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;In particular, the part starting with "Specifically, " was added, to document the serializing behavior. When they say "completed &lt;EM&gt;locally&lt;/EM&gt;" it is a hint that it doesn't imply store buffer flushing, and they are explicit about this part later one.&lt;/P&gt;

&lt;P&gt;Note that AMD does not make the same guarantees, and indeed lfence doesn't serialize on some AMD chips (and it executes much faster). This kind of barrier (sometimes referred to as a speculation batter) became important in the age of Spectre, so now even AMD forces lfence to be serializing if some certain bits are set in some MSR.&lt;/P&gt;</description>
      <pubDate>Sun, 02 Sep 2018 21:10:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176707#M6699</guid>
      <dc:creator>Travis_D_</dc:creator>
      <dc:date>2018-09-02T21:10:00Z</dc:date>
    </item>
    <item>
      <title>I am sure that I read that</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176708#M6700</link>
      <description>&lt;P&gt;I am sure that I read that section many times, but it looked more like poor wording than a new feature!&amp;nbsp; :-(&lt;/P&gt;

&lt;P&gt;They start off talking about load fencing, then about instruction execution, then the "in particular" goes back to memory references again.&amp;nbsp; It was not clear if "all prior instructions have completely locally" was intended to include non-memory instructions.&amp;nbsp;&amp;nbsp;&amp;nbsp; I can see now that it was, but I would have included &lt;STRONG&gt;BIG BOLD WORDS&lt;/STRONG&gt; to point to the new execution serialization functionality if I had been trying to describe this.&amp;nbsp;&amp;nbsp; The comments with the description of the RDTSC instruction remove all doubt.&lt;/P&gt;

&lt;P&gt;(This does make me wonder if there is a semantic difference between "LFENCE; RDTSC" and "RDTSCP".&amp;nbsp; The description of the RDTSCP instruction in Volume 2 makes it look like these are the same?)&lt;/P&gt;</description>
      <pubDate>Mon, 03 Sep 2018 19:30:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176708#M6700</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-09-03T19:30:04Z</dc:date>
    </item>
    <item>
      <title>Yes, although in Intel's</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176709#M6701</link>
      <description>&lt;P&gt;Yes, although in Intel's defence the confusion seems to have arisen because the text wasn't written from scratch but edited from an earlier version. An &lt;A href="http://qcd.phys.cmu.edu/QCDcluster/intel/vtune/reference/vc153.htm"&gt;earlier version&lt;/A&gt; said:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;Performs a serializing operation on all load instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible. The LFENCE instruction is ordered with respect to load instructions, other LFENCE instructions, any MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to store instructions or the SFENCE instruction.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Here you can clearly see how when they decided to make it apply to all instructions, they just edited the second sentence to remove the reference to "load instruction" and change it to plain "instruction". Of course, the cohesion of the paragraph was left lacking as a result...&lt;/P&gt;

&lt;P&gt;As far as I know the out-or-order semantics of lfence; rdtsc are essentially the same as rdtscp, although the latter might in principle be a bit faster if it integrates the lfence behavior. Of course, with with rdtscp you get the MSR read at the same time!&lt;/P&gt;

&lt;P&gt;I have seen reports that when comparing lfence; rdtsc; lfence to rdtscp; lfence (i.e., the two main "fully fenced tsc read" options), the former gives more stable results. That is, it might be slightly slower but has less run-to-run variation. Maybe something to consider for your low-level timers.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Sep 2018 00:46:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Disabling-HW-prefetcher/m-p/1176709#M6701</guid>
      <dc:creator>Travis_D_</dc:creator>
      <dc:date>2018-09-04T00:46:40Z</dc:date>
    </item>
  </channel>
</rss>

