- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I measure the overhead difference while flushing a dirty and clean cache line. But the results varies. I just wonder if there is someone could explain it.
Results(In cycle): Dirty V.S. Clean 356 228 988 44 992 56 56 44 1032 24 452 24 Codes: #include <stdio.h> #include <stdlib.h> #include <stdint.h> static size_t MM = 90000000; inline void cflush_yh(volatile int *p) { asm volatile ("clflush (%0)" :: "r"(p)); } static __inline__ unsigned long long rdtsc(void){ unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A"(x)); return x; } int main() { int *a, *b; int i; unsigned long long cycle; unsigned long long elapsed1, elapsed2; a = (int *) malloc(MM*sizeof(int)); b = (int *) malloc(MM*sizeof(int)); for(i = 0; i < MM; ++i){ a = i; b = MM-i; } b[0] = a[0]; cycle = rdtsc(); cflush_yh(&b[0]); elapsed1 = rdtsc() - cycle; printf("b[0]=%d\n", b[0]); cycle = rdtsc(); cflush_yh(&b[0]); elapsed2 = rdtsc() - cycle; printf("Time for dirty cache line is %d\n", (unsigned)elapsed1); printf("Time for clean cache line is %d\n", (unsigned)elapsed2); free(a); free(b); return 0; }
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is not reasonable to try to measure the overhead of a single instruction on any modern out-of-order processor. In particular, the RDTSC instruction is not ordered with respect to other instructions, so the second RDTSC could begin execution well before the instructions that you want to time have completed. The RDTSCP instruction provides a bit more control, but it is still not enough to provide an unambiguous interpretation of the overhead of a single instruction.
An an alternative approach, you might try
- Copy a block of elements from a[] to b[], where the block size is variable.
- Follow this with a loop executing one CLFLUSH per cache line on the elements of b[] that you just wrote.
- Follow this with a loop executing one CLFLUSH per cache line on the elements of a[] that you just wrote.
If you vary the block size from 1 element to 4096 elements, all of the data should still be in the L1 Data Cache, and you will be able to look at the slope to estimate the overhead per CLFLUSH. You will probably want to reverse the order of 2 and 3 and see if that changes the results.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page