Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

PMC. Count cache misses. Why don't I get L1 cache hits when accessing an address after a fflush

linson__sebastien
1,339 Views

I try to make some measurements with performance counters, but I encounter some strange behaviours.
Basically, I try to measure a cache miss with MEM_LOAD_UOPS_RETIRED.L1_MISS.
For that, I did two experiments (the code can be found at https://pastebin.com/ZY2hDxZM and also in this post):

# Experiment 1.

I did the following steps:
1/ Allocate a buffer in such a way that its address is aligned with the set 0 of L1 data cache (this means that the 6 bits of the virtual address from bit 5 to bit 10 is equal to 0). The register r12 stores this virtual address.
2/ Flush two addresses : [r12] and [r12+32768] in order that they are not in cache. (The reason I use two addresses is because one seems to influence the caching of the other as the result shows).
3/ Access [r12]
4/ Access [r12+32768] by counting L1 misses with MEM_LOAD_UOPS_RETIRED.L1_MISS. Here I expect a cache miss when accessing [r12+32768] since [r12+32768] was flushed at step 2 and not accessed
back.


The steps 2 to 4 are repeated 1000000 times.
Over the 1000000 repetitions, I get one cache miss 999785 times and no cache misses 215 times.
My first question for this first experiment is why there are cache hits?
I have the intuition that a load operation is performed speculatively without being retired, but it seems strange and I do not know how to check that.

# Experiment 2.

The second experiment is the same than the first one, but I replaced [r12] by [r12+4096] in step 2 and 3.
I expected no significant changes in the results, but quite surprisingly,  when measuring the number of cache miss to access [r12+32768], I get NO cache miss 999958 times over the 1000000 measures.
I don't understand why because, even [r12+4096] and [r12+32768] are mapped into the same L1 cache set (if I am doing well), they are not the same cache line.


Note that I disabled the prefetchers with the commands `wrmsr -a 0x1a4 15` to disable all the prefetchers (the rdmsr command to MSR 0x1a4 return 0xf as expected).
My CPU is Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (cpu family: 06H, model : 5EH).
If required, I can give you more information.

 

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define CACHE_SIZE (64*64*8)
unsigned char *cache_L1;
unsigned long *T = NULL;

void measures(){

    asm volatile(
        "mov $1000000,%%r11\n\t"// number of loop iteration
        "mov %%rax,%%r12\n\t"
        "mov %%rbx,%%r10\n\t"

        "loop:\n\t"

            // Serialize before beginning
            "xor %%rax,%%rax\n\tcpuid\n\t"  

            // Make sure that [r12+4096] is not is cache
            "mov %%r12,%%rax\n\t"
            "add $4096, %%rax\n\t" // For experiment 1, this line is changed to 'add $0,%%rax'
            "clflush (%%rax)\n\t"

            // Make sure that [r12 + 32768] is not in cache
            "mov %%r12,%%rax\n\t"
            "add $32768, %%rax\n\t"
            "clflush (%%rax)\n\t"
            

            // Serialize to make sure that all previous operation are finished before continuing.
            "xor %%rax,%%rax\n\tcpuid\n\t"  // 1 load

            // Put [r12 + 4096] in cache
            "mov %%r12,%%rax\n\t"
            "add $4096,%%rax\n\t" // For experiment 1, this line is changed to 'add $0,%%rax'
                        "mov (%%rax),%%r8\n\t"

            // Serialize to make sure that all previous operation are finished before continuing.
            "xor %%rax,%%rax\n\tcpuid\n\t"

            // ------ Now I will try to access [r12 + 32768]. Since it was fflushed, I expect it is not in cache and I expect to measure a cache miss when reading [r12 + 32768].
                        

            // Read PMC configured with MEM_LOAD_UOPS_RETIRED.L1_MISS
            "xor %%rcx,%%rcx\n\trdpmc\n\tshl $32,%%rdx\n\tor %%rdx,%%rax\n\tmov %%rax,%%r9\n\t"
            
            // Serialize to make sure that all previous operation are finished before continuing.
            "xor %%rax,%%rax\n\tcpuid\n\t"  // 1 load


            // Access [r12 + 32768 ]
            "mov  %%r12, %%rax\n\t"
            "add $32768,%%rax\n\t"
            "mov (%%rax),%%r8\n\t"

            // Serialize to make sure that all previous operation are finished before continuing.
            "xor %%rax,%%rax\n\tcpuid\n\t" // 1 load
                        
            // Read PMC configured with MEM_LOAD_UOPS_RETIRED.L1_MISS. Substract with the first read. Store the result in r15
                           "xor %%rcx,%%rcx\n\trdpmc\n\tshl $32,%%rdx\n\tor %%rdx,%%rax\n\tmov %%rax,%%r15\n\t"
            
            // Serialize to make sure that all previous operation are finished before continuing.
            "xor %%rax,%%rax\n\tcpuid\n\t" // 1 load

            // STORE
            "mov %%r10,%%rbx\n\t"
            "sub %%r9,%%r15\n\t"
            "movq %%r15, (%%rbx)\n\t"
            "clflush (%%r10)\n\t"
            "add $8,%%r10\n\t"
            
            // Serialize to make sure that all previous operation are finished before continuing.            
            "xor %%rax,%%rax\n\tcpuid\n\t" // 1 load


        "dec %%r11\n\t"
        "jnz loop\n\t"
        "endloop:"
            
            :
            :"a"(cache_L1),"b"(T)
            :"rcx","rdx","r8","r9","r10","r11","r15");
    
    

}


int main()
{
    // Configure PMC by writting in /dev/CONFIG_MODULE0 (This file is created by a kernel module I made)
    int pmcfd;
    char *pmcname;
    pmcname = "MEM_LOAD_UOPS_RETIRED.L1_MISS";
    pmcfd = open("/dev/CONFIG_MODULE0", O_RDWR);
    assert(pmcfd != -1);
    write(pmcfd, "CHANGEPMC=0",strlen("CHANGEPMC=0"));
    write(pmcfd, pmcname,strlen(pmcname));
    close(pmcfd);

    // Allocate a memory buffer and align it to L1 set 0
    cache_L1 = calloc(6*CACHE_SIZE,1);
    unsigned long currentSet = ((unsigned long)cache_L1 >> 6) & 63;
    while(currentSet != 0){
        cache_L1 += 64; // cache line size to jump to the next set 
        currentSet = ((long)cache_L1 >> 6) & 63;
    }

    // Allocate an array to store results
    T = calloc(1000000,sizeof(unsigned long));
    

    // Perform the measures
    measures();

    // Print the measures
    for(int nbrmes=0;nbrmes<1000000;++nbrmes)
        printf("%ld\n",T[nbrmes]);

}

 

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
1,339 Views

I don't have specific answers to your questions, but I will note that when I have used FFLUSH, I have needed MFENCE instructions between the FLUSH and the associated loads.  On SKX processors, I was able to get good results (>99% of the expected number of misses) with an inner loop structure like:

  • load
  • mfence
  • clflush
  • mfence

This is used in lines 660-667 of https://github.com/jdmccalpin/SKX-SF-Conflicts/blob/master/SF_test_offsets.c, for example.

I have also seen that the specific ordering requirements to get the expected results with FFLUSH depend on the processor model.   I have never gotten reliable results on Xeon Phi x200 (KNL), for example....

0 Kudos
linson__sebastien
1,339 Views

Thank you, I will try to play with those barriers.

I also have to add the following information. My problem seems to depend on whether the binary is compiled in static or not:

$ gcc -static -O0 test.c
$ ./a.out | sort | uniq -c

 999974      0 # <----------No cache miss
         26     1

 

$ gcc -O0 test.c
$ ./a.out | sort | uniq -c

     2699   0
 997301   1 # <------------ One cache miss (997301 times)

 

Some extra information about my OS and gcc version:

$ uname -a

Linux archlinux 5.6.15-arch1-1 #1 SMP PREEMPT Wed, 27 May 2020 23:42:26 +0000 x86_64 GNU/Linux

$ gcc -v

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/10.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++,d --with-isl --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-install-libiberty --enable-linker-build-id --enable-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-libunwind-exceptions --disable-werror gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
Supported LTO compression algorithms: zlib zstd

 

Depending whether my binary is compiled statically or dynamically, I observed  differences on the following PMCs:

L2_RQSTS.DEMAND_DATA_RD_MISS
L2_RQSTS.ALL_DEMAND_DATA_RD
L2_RQSTS.ALL_DEMAND_MISS
L2_RQSTS.ALL_DEMAND_REFERENCES
L2_RQSTS.MISS
L2_RQSTS.REFERENCES
LONGEST_LAT_CACHE.REFERENCE
LONGEST_LAT_CACHE.MISS
CPU_CLK_UNHALTED.THREAD_P
CPU_CLK_THREAD_UNHALTED.REF_XCLK
L1D_PEND_MISS.PENDING
L1D.REPLACEMENT
CPL_CYCLES.RING123
RS_EVENTS.EMPTY_CYCLES
OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD
OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD
ILD_STALL.IQ_FULL
IDQ_UOPS_NOT_DELIVERED.CORE
UOPS_EXECUTED_PORT.PORT_1
UOPS_EXECUTED_PORT.PORT_6
CYCLE_ACTIVITY.CYCLES_L2_PENDING
CYCLE_ACTIVITY.CYCLES_LDM_PENDING
CYCLE_ACTIVITY.STALLS_L2_PENDING
CYCLE_ACTIVITY.CYCLES_L1D_PENDING
CYCLE_ACTIVITY.STALLS_L1D_PENDING
OFFCORE_REQUESTS.DEMAND_DATA_RD
OFFCORE_REQUESTS.ALL_DATA_RD
UOPS_EXECUTED.CORE
MEM_LOAD_UOPS_RETIRED.L1_HIT
MEM_LOAD_UOPS_RETIRED.L1_MISS
MEM_LOAD_UOPS_RETIRED.L2_MISS
MEM_LOAD_UOPS_RETIRED.L3_MISS
MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM
L2_TRANS.DEMAND_DATA_RD
L2_TRANS.L2_FILL
L2_TRANS.ALL_REQUESTS
L2_LINES_IN.E
L2_LINES_IN.ALL
L2_LINES_OUT.DEMAND_CLEAN

Maybe something is causing a cache miss of my data in LLC and, thus of L1 because of inclusivity.

I'm still  investigate.

0 Kudos
HadiBrais
New Contributor III
1,273 Views

I've a few suggestions:

  • I don't think calloc is guaranteed to actually write to all bytes in the specified memory region. Add a loop after allocation and explicitly initialize the memory region in your code and see if the results are different.
  • Remove all load instructions from [r12] while keeping all other instructions and see if there are any unexpected L1D hit or miss counts.
  • Remove all the code related to measuring performance events and use "perf record" to capture sample on the L1 hit or miss events. These events support precise sample, so append ":pp" to the event name to capture samples precisely on the instructions that caused the events to occur.

Also it's not clear to me how you're measuring L1 hits. Are you measuring MEM_LOAD_UOPS_RETIRED.L1_HIT?

 

0 Kudos
Reply