Why mfence behaviour is different for CPU powersave mode and performance mode?

JasperMa · ‎02-15-2021

I'm running the following code to measure the load time for accessing one specific line for many times. Ideally, we will observe cache hit on that line after the first access to that line. However I find if I run the test for 1000 times, most of the case will be cache miss. This is the result under CPU powersave mode.(time show the cpu cycles)

Repeat access to one elems for 1000 times
time: 270, cnt: 436, percent:43.600%
time: 275, cnt: 386, percent:38.600%
time: 280, cnt: 105, percent:10.500%
time: 285, cnt: 11, percent:1.100%
.........
Cache hit ratio: 0.000%

Repeat access to one elems for 100000 times
time: 270, cnt: 41526, percent:41.526%
time: 275, cnt: 39436, percent:39.436%
time: 280, cnt: 9917, percent:9.917%
time: 285, cnt: 1285, percent:1.285%
......
Cache hit ratio: 0.000%

If I change the CPU mode to performance mode, I get the following result, which show only the first access is cache miss.

Repeat access to one elems for 1000 times
time: 90, cnt: 43, percent:4.300%
time: 95, cnt: 175, percent:17.500%
time: 100, cnt: 588, percent:58.800%
time: 105, cnt: 193, percent:19.300%
time: 270, cnt: 1, percent:0.100%
Cache hit ratio: 99.900%

Repeat access to one elems for 100000 times
time: 90, cnt: 4845, percent:4.845%
time: 95, cnt: 12269, percent:12.269%
time: 100, cnt: 57276, percent:57.276%
time: 105, cnt: 25587, percent:25.587%
time: 110, cnt: 8, percent:0.008%
time: 115, cnt: 14, percent:0.014%
time: 265, cnt: 1, percent:0.001%
Cache hit ratio: 99.999%

So I think maybe the cause of the differences lie in the mfence behaviour for powersave mode and performance mode. So does anyone know something related to that? Or maybe the guess I make is wrong.

Here is the code:

#include <stdio.h>
#include<stdlib.h>
#include<string.h>
#include <stdint.h>



void flush(void* p) {
    asm volatile ("clflush (%0)\n"
      :
      : "c" (p)
      : "rax");
}



/**
 * @brief from this case we can see we must run the case for many many time to make it hit? 
 * Ideally we will make the assumption that after the first access to one cacheline, the following access to it will hit in the cache.
 */
void TestCacheHitCondition(int repeat_times){
    int temp12,a,b,c,d,e,f;
    int array[5*1024];
    int timer[200]={0}; 
    memset(array,-1,5*1024*sizeof(int));
    flush(&array[2048]); // only flush once before access it for many times

    for(int i=0; i< repeat_times; i++){
        asm volatile("mfence");
        asm volatile("rdtsc"
                            : "=a"(a), "=d"(d));
        temp12 = array[2048]; // we will repeat visit one element for many times
        asm volatile("mfence");
        asm volatile("rdtsc"
                            : "=a"(e), "=d"(f));
        timer[(e-a))/5]++;
    }
    printf("\nRepeat access to one elems for %d times\n", repeat_times);
	double hit_ratio = 0.0;
	for(int i=0; i< 200; i++){
        if(timer[i]>0) printf("time: %d, cnt: %d, percent:%.3f%\n",5*i, timer[i], 1.0*timer[i]/repeat_times*100);
		if(i*5<150) hit_ratio += (1.0*timer[i]/repeat_times*100);
    }
	printf("Cache hit ratio: %.3f%\n", hit_ratio);
}



int main(){
    TestCacheHitCondition(1000);
    TestCacheHitCondition(100000);

    return 0;
}

HadiBrais · ‎02-26-2021

Assuming the load from array[2048] was not optimized away by the compiler, which you can check by looking at the source code, the vast majority of loads are definitely hitting in the L1D, irrespective of the frequency scaling policy. The problem is that there is a clear misunderstanding of what is being measured.

The unit of measurements taken using RDTSC is TSC cycles and the duration of a TSC cycle is determined by the TSC frequency. On the other hand, the latency of a load from array[2048] is determined by the core frequency and, if the demand request goes to the uncore on an L2 miss on modern Intel processors, which is unlikely in this case, other factors come into play. Core frequency doesn't affect the latency of a hit in the L1D or L2 in terms of core cycles, but it does affect the latency in terms of TSC cycles. The higher the core frequency, the load latency would be smaller in terms of TSC cycles, and vice versa. This is exactly why the average measured latency in TSC cycles is much smaller in "performance mode" compared to "powersave mode," even though the load hits most of the time. This also explains why using a constant threshold (150 TSC cycles in your code) doesn't make sense in general.
The latency of a load usually refers to the dispatch-to-use latency, but this is clearly not the latency that is being measured here, so calling these measurements load latencies can be confusing or misleading. The MFENCE and RDTSC instructions themselves significantly impact the latency being measured. You can obtain a measure of the overhead by measuring the latency of the region without the load instruction. That said, you only seem to be interested in distinguishing between loads sourced from different units (the question is ambiguous in this regard).
On Intel processors, the MFENCE instruction offers no architectural guarantee to order RDTSC, so the second MFENCE may not prevent the following RDTSC from being executed before the load completes. If that happens, the measured latency may not be inclusive the of the load latency. Use LFENCE instead in the particular case.

Other than the source of the load, the addressing mode, the address translation process, and the effective memory type of the target address also impact the dispatch-to-use latency in terms of core cycles of an isolated load instruction like the one in your region of interest. Current Intel processors don't support value prediction on the result of load instructions.

I guess this may help you with your other question on the L1 IP-based prefetcher (which I've not read yet).

By the way, I've discussed all of these issues before on multiple posts somewhere on Stack Overflow.

JasperMa · ‎02-26-2021

Thanks for your reply. I will go to read other posts you made on the stackoverflow.