mesure latence between RAM and cache

NC_s_ · ‎10-18-2016

Hello,

I try to see the difference between the time to access the cache and the time to access the RAM.
The first idea is :

execute an instruction to put it in the cache
measure the time it take to access it again (it is short because there is a cache hit)
clflush the instruction. This invalidate cache line for all levels.
measure the time it take to access it again (it is long because of the cache miss)

The time for cache hit is 66 cycles and the time for cache miss is 270 cycles.

Then I discovered it is possible to disable the cache by using bit CD in CR0 register. So I disabled caches and execute the same previous steps and the cycles are 9699 and 9486.

I expected around 270.

Do you have any explain about that big differences? maybe CD disable all kind of caches (including TLB...) and not only L1,L2,L3 ?

The code is the following :

char a[78];

unsigned long long gettsc()
{
        unsigned long long t;
        asm volatile(
                "mfence\n\t"
                "rdtsc\n\t"
                "shl $32,%%rdx\n\t"
                "or %%rdx,%%rax\n\t":"=a"(t)::"rdx");
        return t;
}
int main()
{
        unsigned long long t0,t1;
        t0=gettsc();//rougly put t0 and gettsc instructions in cache
        t1=gettsc();//idem for t1
        for(j=0;j<2;j++)//the first iteration is here to put the following instruction in cache
        {
                b=a[50];
                t0=gettsc();
                b=a[50];
                t0=gettsc()-t0;
                asm volatile("mfence\n\tclflush (%0)\n\t"::"r"(&a[50]));//FLUSHADDR(&a[50]);
                t1=gettsc();
                b=a[50];
                t1=gettsc()-t1;
        }
                printf("%lld\n",t0);
                printf("%lld\n",t1);
}

McCalpinJohn · ‎10-18-2016

Setting CR0.CD should disable caching of the instructions as well as data. This will also disable caching of page translation information in the cache hierarchy.

Your first approach was a better idea, but a more common approach is to perform a sequence of dependent loads so that the overhead of any CLFLUSH instructions, memory fences, and RDTSC calls is minimized.

Whenever you perform more than one load operation, you are implicitly making many decisions about how the addresses map into the address translation mechanisms (both documented and undocumented), the caches, and the DRAM. This rapidly becomes an extremely complex topic.

NC_s_ · ‎10-19-2016

nice answer thank you very much.

Just another thing : is doing mfence is equivalent to do sfence then sfence ?

McCalpinJohn · ‎10-19-2016

MFENCE and SFENCE have different semantics, so there is no general equivalence. The performance may be the same or different, depending on the presence of loads or stores in the instruction stream on either side of the fence.