- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I want to compare the execution time of a read instruction on an uncacheable page and a cacheable page.
So, I tested simple code, and the uncacheable page showed about 10 times longer than the cacheable page.
Even though array index is set to cause cache misses (stride for cache line), it is difficult to understand that there is a 10 times long execution time.
Could you please explain the reason for the difference? (In terms of HW or SW)
The code used in the experiment is attached to help understanding.
void perf_time(unsigned int *pt, int size, int num_page, int &temp) { #ifdef TIME_PRINT struct timespec start, end; uint64_t diff; #endif //volatile int sum=0; register int sum; int tsize = size / sizeof(int); #ifdef TIME_PRINT clock_gettime(CLOCK_MONOTONIC, &start); /* mark start time */ #endif //cache-line 64B for (int i = 0; i < tsize/16; i++){ #ifdef TEST_WR #ifdef ALL_CACHE_HIT pt[i%16] = sum; //write test #else pt[i*16] = sum; //write test #endif #else #ifdef ALL_CACHE_HIT //sum = pt[i%16]; //read test sum = pt[0]; //read test #else sum += pt[i*16]; //read test #endif #endif } #ifdef TIME_PRINT clock_gettime(CLOCK_MONOTONIC, &end); /* mark the end time */ diff = BILLION*(end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec); printf("CLOCK_MONOTONIC elapsed time = %llu nanoseconds page num %d\n", (long long unsigned int) diff, num_page); #endif #ifdef TEST_WR temp = pt[10]; #else printf("sum %d\r\n", sum); #endif }
//Page attribute set
//The kernel was modified to set the page attribute.
//unsigned int *pt = (unsigned int *)mmap(0, size, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS | MAP_WB, -1, 0);
//unsigned int *pt = (unsigned int *)mmap(0, size, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS | MAP_WT, -1, 0);
//unsigned int *pt = (unsigned int *)mmap(0, size, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS | MAP_WC, -1, 0);
unsigned int *pt = (unsigned int *)mmap(0, size, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS | MAP_UC, -1, 0);
400af0: 8b 17 mov (%rdi),%edx 400af2: 48 83 c7 40 add $0x40,%rdi 400af6: 48 39 c7 cmp %rax,%rdi 400af9: 75 f5 jne 400af0 <_Z9perf_timePjiiRi+0x60>
//Disassemble code in iterative section
Thank you
Changhyun
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is always a good idea to provide the absolute timings in addition to the ratio....
The main difference is that independent accesses to cached pages can execute concurrently, while accesses to uncacheable pages execute completely sequentially. Recent Intel processor cores support 10-12 concurrent L1 Data Cache misses, so it is entirely reasonable for the cached version to be 10x faster on independent accesses. The difference would be much smaller if the accesses were dependent -- such as in a pointer-chasing benchmark.
Depending on the specific processor you are using, the latency for uncached accesses should be in the range of 70-90 ns each.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is always a good idea to provide the absolute timings in addition to the ratio....
The main difference is that independent accesses to cached pages can execute concurrently, while accesses to uncacheable pages execute completely sequentially. Recent Intel processor cores support 10-12 concurrent L1 Data Cache misses, so it is entirely reasonable for the cached version to be 10x faster on independent accesses. The difference would be much smaller if the accesses were dependent -- such as in a pointer-chasing benchmark.
Depending on the specific processor you are using, the latency for uncached accesses should be in the range of 70-90 ns each.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page