Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Uncacheable page vs Cachable page

Kim__Changhyun
Beginner
893 Views

Hi all,

 

 

I want to compare the execution time of a read instruction on an uncacheable page and a cacheable page.

 

So, I tested simple code, and the uncacheable page showed about 10 times longer than the cacheable page.

Even though array index is set to cause cache misses (stride for cache line), it is difficult to understand that there is a 10 times long execution time.

 

Could you please explain the reason for the difference? (In terms of HW or SW)

The code used in the experiment is attached to help understanding.

void perf_time(unsigned int *pt, int size, int num_page, int &temp)
{

#ifdef TIME_PRINT
    struct timespec start, end;
    uint64_t diff;
#endif    
    
    //volatile int sum=0;
    register int sum;

    int tsize = size / sizeof(int); 

#ifdef TIME_PRINT
    clock_gettime(CLOCK_MONOTONIC, &start); /* mark start time */
#endif        

    //cache-line 64B
    for (int i = 0; i < tsize/16; i++){

#ifdef TEST_WR
        #ifdef ALL_CACHE_HIT
            pt[i%16] = sum;   //write test
        #else
            pt[i*16] = sum;   //write test
        #endif

#else
        #ifdef ALL_CACHE_HIT
            //sum = pt[i%16]; //read test
            sum = pt[0]; //read test
        #else            
            sum += pt[i*16]; //read test
        #endif
#endif
}
    
#ifdef TIME_PRINT    
    clock_gettime(CLOCK_MONOTONIC, &end); /* mark the end time */
    diff = BILLION*(end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec);
    printf("CLOCK_MONOTONIC elapsed time = %llu nanoseconds page num %d\n", (long long unsigned int) diff, num_page);
#endif
    
#ifdef TEST_WR    
    temp = pt[10];
#else    
    printf("sum %d\r\n", sum);
#endif    
    
}

//Page attribute set

//The kernel was modified to set the page attribute.

//unsigned int *pt = (unsigned int *)mmap(0, size, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS | MAP_WB, -1, 0);
//unsigned int *pt = (unsigned int *)mmap(0, size, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS | MAP_WT, -1, 0);
//unsigned int *pt = (unsigned int *)mmap(0, size, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS | MAP_WC, -1, 0);
unsigned int *pt = (unsigned int *)mmap(0, size, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS | MAP_UC, -1, 0);

 

  400af0:	8b 17                	mov    (%rdi),%edx
  400af2:	48 83 c7 40          	add    $0x40,%rdi
  400af6:	48 39 c7             	cmp    %rax,%rdi
  400af9:	75 f5                	jne    400af0 <_Z9perf_timePjiiRi+0x60>

//Disassemble code in iterative section 

Thank you

Changhyun

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
893 Views

It is always a good idea to provide the absolute timings in addition to the ratio....

The main difference is that independent accesses to cached pages can execute concurrently, while accesses to uncacheable pages execute completely sequentially.  Recent Intel processor cores support 10-12 concurrent L1 Data Cache misses, so it is entirely reasonable for the cached version to be 10x faster on independent accesses.   The difference would be much smaller if the accesses were dependent -- such as in a pointer-chasing benchmark.

Depending on the specific processor you are using, the latency for uncached accesses should be in the range of 70-90 ns each.

View solution in original post

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
894 Views

It is always a good idea to provide the absolute timings in addition to the ratio....

The main difference is that independent accesses to cached pages can execute concurrently, while accesses to uncacheable pages execute completely sequentially.  Recent Intel processor cores support 10-12 concurrent L1 Data Cache misses, so it is entirely reasonable for the cached version to be 10x faster on independent accesses.   The difference would be much smaller if the accesses were dependent -- such as in a pointer-chasing benchmark.

Depending on the specific processor you are using, the latency for uncached accesses should be in the range of 70-90 ns each.

0 Kudos
Reply