Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Analyzing execution time

Yukyoung_L_
Beginner
297 Views

Hello. I am working on analyzing execution time.

I have two similar codes below.

 

1st code:

int size = 256*1024*1024;

int stride = 256;

void *array = malloc(size);

for (unsigned long off = 0; off < size; off += stride) {

    *(unsigned int *)(array+off) = off+stride;

}

*(unsigned int*)(array+off) = 0;

int i=10000000;

struct timeval start, end;

gettimeofday(&start, NULL);

while (i>=1) {

    offset = *(unsigned int*)(array+off);

    i--;

}

gettimeofday(&end, NULL);

*(volatile unsigned int*)(array+offset);

printf("%.2f\n", (end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec));

 

2nd code:

int size = 256*1024*1024;

int stride = 256;

void *array = malloc(size);

for (unsigned long off = 0; off < size; off += stride) {

    *(unsigned int *)(array+off) = off+stride;

}

*(unsigned int*)(array+off) = 0;

int i=10000000;

struct timeval start, end;

gettimeofday(&start, NULL);

#define ONE offset = *(unsigned int*)(array+off);

#define FIVE ONE ONE ONE ONE ONE

#define TEN FIVE FIVE

#define FIFTY TEN TEN TEN TEN TEN

#define HUNDRED FIFTY FIFTY

while (i>=1000) {

    HUNDRED

    HUNDRED

    HUNDRED

    HUNDRED

    HUNDRED

    HUNDRED

    HUNDRED

    HUNDRED

    HUNDRED

    HUNDRED

    i-=1000;

}

gettimeofday(&end, NULL);

*(volatile unsigned int*)(array+offset);

printf("%.2f\n", (end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec));

 


Questions

1) The only difference between two codes is "while loop."

They both measure the elapsed time for while loop.

When I executed two codes with my computer (with disabled hardware prefetch), the first code makes a result of 779,851,000 ns and the second code makes a result of 1,624,344,000 ns (2.1 times larger)

I thought this difference comes from L1-i cache misses, so I measured L1-i cache misses with perf.

However, the L1-i cache miss of the first code is 34,541 and the L1-i cache miss of the second code is 43,078 (1.2 times larger),

This result cannot completely explain the difference in elapsed times for while loop.

What makes the big difference between elapsed times of two codes? Is there anything that I miss?

 

2) When I used Top-down analysis of general exploration of VTune with the first code, I got total elapsed time of 1.125 s and DRAM bound of 51.3 %. The measured elapsed time for while loop (=result of this program) was 926,295,000 ns.

I expected the DRAM stall time (1.125*51.3/100 = 0.577 s) would be equal or larger than the result of the program (0.926295 s) since all instructions in the while loop make LLC miss.

However, measured elapsed time is about 1.6 times larger than the DRAM stall time.

Why are the two values different?

0 Kudos
1 Reply
mayer__max
Beginner
297 Views

The only think i recognised when i looked at tour code it is: seond one is faster

but i have no idea what to answer about your question

0 Kudos
Reply