Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Overhead difference while flushing dirty and clean cache line

I measure the overhead difference while flushing a dirty and clean cache line. But the results varies. I just wonder if there is someone could explain it.
Results(In cycle):
Dirty V.S. Clean
356        228
988        44
992        56
56         44
1032       24
452        24

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
static size_t MM = 90000000;

inline void cflush_yh(volatile int *p) {
	asm volatile ("clflush (%0)" :: "r"(p));


static __inline__ unsigned long long rdtsc(void){
	unsigned long long int x;
	__asm__ volatile (".byte 0x0f, 0x31" : "=A"(x));
	return x;

int main() {
    int *a, *b;
    int i;
    unsigned long long cycle;
    unsigned long  long elapsed1, elapsed2;

    a = (int *) malloc(MM*sizeof(int));
    b = (int *) malloc(MM*sizeof(int));

    for(i = 0; i < MM; ++i){
        a = i;
        b = MM-i;

    b[0] = a[0];
    cycle = rdtsc();
    elapsed1 = rdtsc() - cycle;
    printf("b[0]=%d\n", b[0]);
    cycle = rdtsc();
    elapsed2 = rdtsc() - cycle;

    printf("Time for dirty cache line is %d\n", (unsigned)elapsed1);
    printf("Time for clean cache line is %d\n", (unsigned)elapsed2);

    return 0;



0 Kudos
1 Reply
Black Belt

It is not reasonable to try to measure the overhead of a single instruction on any modern out-of-order processor.   In particular, the RDTSC instruction is not ordered with respect to other instructions, so the second RDTSC could begin execution well before the instructions that you want to time have completed.  The RDTSCP instruction provides a bit more control, but it is still not enough to provide an unambiguous interpretation of the overhead of a single instruction.

An an alternative approach, you might try

  1. Copy a block of elements from a[] to b[], where the block size is variable.
  2. Follow this with a loop executing one CLFLUSH per cache line on the elements of b[] that you just wrote.
  3. Follow this with a loop executing one CLFLUSH per cache line on the elements of a[] that you just wrote.

If you vary the block size from 1 element to 4096 elements, all of the data should still be in the L1 Data Cache, and you will be able to look at the slope to estimate the overhead per CLFLUSH.   You will probably want to reverse the order of 2 and 3 and see if that changes the results.

0 Kudos