Why does number of cycles change a lot

Zhou__Fang · ‎12-10-2019

Hi all,

I'm writing a simple program and define a for-loop as a request. Then I want to test the latency, number of instructions, number of cycles during a request. The main part of my code is shown below.

int main() {
...
int i = 0, j = 0;
double tsc_freq = 2.2;
while (j < 10000) { // I just collect 10000 requests.
unsigned long start_time = read_tsc();
unsigned long start_ins = read_ins();
unsigned long start_cyc = read_cyc();
// request starts
for (i = 0; i < 250000; i++) {}
// request ends
unsigned long end_time = read_tsc();
unsigned long end_ins = read_ins();
unsigned long end_cyc = read_cyc();
unsigned long req_latency = end_time - start_time;
unsigned long req_ins = end_ins - start_ins;
unsigned long req_cyc = end_cyc - start_cyc;
printf("%f, %ld, %ld\n", req_latency/tsc_freq, req_ins, req_cyc);
...
}
}

Then I get several results whose number of instructions are the same, but the number of cycles and latency are different:
100032.727273 100050 250081
58414.545455 100050 175244

The first column is the latency in the unit of nanosecond, the second column is the number instructions, and the third column is the number of cycles.
I'm very confused why the number of cycles are so different between this two requests. I also checked some information I can get from MSR, like branch mis-prediction or recovery cycles from some states, or even L1, L2, and L3 cache miss. The difference of MSR values is quite small.

It seems the processor use more cycles to finish the same job. It does not make sense to me. Therefore, where does the extra number of cycles come from?

McCalpinJohn · ‎12-10-2019

Your first sample appears to be running at 2.5 GHz and your second sample appears to be running at 3.0 GHz.

With either BIOS or OS control of frequency, the runtime may be short enough to have measurements that are significantly contaminated by the overhead of changing frequency and the unpredictable latency between frequency changes. OS interrupts are also capable of introducing modestly large overheads in intervals of this size.

Zhou__Fang · ‎12-11-2019

Hi John,

Thank you for your reply. You are right. The two examples are not good to introduce my question. Sorry for the confusion.
In my test, most CPU frequency is about 3.0 GHz due to turbo-boost. I list several results with about 3.0 GHz:
90921.818182 100050 272766
82410.000000 100050 247232
74614.545455 100050 223845
66444.545455 100050 199335
58297.272727 100050 174893

I'm still a little confused. You mean the different numbers of cycles come from frequency changing? If so, where can I find some official documents to confirm? For OS interrupts, I didn't show some results here. But I can observe the number of instructions is increased a lot when interrupts happen. So, we can ignore the impact of interrupt for these requests.

McCalpinJohn · ‎12-11-2019

The numbers don't make a lot of sense to me, but it is a bit difficult to tell what is happening without the assembly code.... Your "dummy" loop has 250,000 increments, but only executes 100,000 instructions. It is difficult to guess what is happening because the inner loop does not do anything.

Recommendations:

Include a "real" computation in the inner loop and printing the output of the computation along with the timing results (to prevent the compiler from making unexpected changes to the code).
Store the counter deltas in memory and print them all at the end. Printing every few microseconds introduces too much uncertainty in the processor's instruction stream. I usually reduce the test size so that the timing results will fit in the L1 Data Cache -- up to about 1500 iterations with 24 Bytes of counter data per iteration. (Make sure the array holding the counter data is initialized before use to prevent page faults when you are storing the counter data.)
Generate the assembly code for each test and save it along with the results. (The Intel compiler option "-sox" is very helpful as well -- it includes the full set of compiler options in the output -- executable, assembler, or unlinked object file.)
Include an optional "warmup" before you start recording timings. This should be of the same form as the "real" computation, but without recording the per-iteration counter values. You should record and report the counter values for the "warm-up" as a whole, so you can tune its duration. Depending on BIOS options, it can take up to ~1 second for a processor to reach full frequency.
Make sure that the task is pinned to a single logical processor for each test.
If possible, record the configuration registers for the counters you are using. TSC has no configuration, but the counters for instructions and for cycles can be configured to measure in user space or kernel space or both, and can be configured to count for the owning thread context or for any thread context in the same physical core.

Zhou__Fang · ‎12-11-2019

Hi John,

Thank you for the suggestions.

I double check my code and find my "dummy" loop runs 25,000 times, not 250,000 times. Sorry for the confusion.

I will follow your suggestions and see what happens.

The cache part is interesting for me to think about. Suppose we have 32KB L1D cache, I should test less than 32 KBytes / (3 * 8 bytes) = 1365 to avoid the impact of cache miss. My last question: Is the number of tests/requests too small to be accepted? I mean can we convince others only with thousands tests?

Thanks, again.

McCalpinJohn · ‎12-11-2019

The overhead of storing the results won't be large in this particular case, but for shorter timing intervals it can be important. If the memory for the timing result array has not been instantiated, storing will cause page faults that take ~microseconds each. If the memory for the timing array is instantiated, but not in any cache, storing the results can take anywhere between "not very long" and a few tens of nanoseconds.

Zhou__Fang · ‎12-16-2019

Hi John,

I have modified the code with your suggestions. I still observe the similar problem: requests with the same number of instructions have different number of cycles. The columns are latency, number of instructions, and number of cycles

83330.000000 100064 250061 (around P99.9)
82836.363636 100064 248579 (around P99)
58710.000000 100064 176207 (around P50)
58207.272727 100064 174696 (bottom latency)

You mentioned "the counters for instructions and for cycles can be configured to measure in user space or kernel space or both, and can be configured to count for the owning thread context or for any thread context in the same physical core." Can I find it from the developer's manuals?

Below is part of my code. Should I insert any fence before reading pmc?

Thank you so much. Look forward to your reply.

#define TOTAL_REQ 500000
......
unsigned long res_time[TOTAL_REQ];
unsigned long res_ins[TOTAL_REQ];
unsigned long res_cycle[TOTAL_REQ];

int main() {
        int count = 0;
        int i = 0 , j = 0;
        int nid = 0;
        unsigned long stime, etime, total_time;
        unsigned long sins, eins, scycle, ecycle, sref, eref;

        int total = 25000;
        int local_req_id = 0;

        // prevent from page fault
        for (i = 0; i < TOTAL_REQ; i++) {
                res_time = 0;
                res_ins = 0;
                res_cycle = 0;
        }

        // warm up
        for (i = 0; i < 100000; i++) {
                for (count = 0; count < total; count++) {
                }
        }

        // start to test
        for (i = 0; i < TOTAL_REQ/100; i++) {
                // fit in cache
                for (j = 0; j < 100; j++) {
                        res_time[i*100+j] = 0;
                        res_ins[i*100+j] = 0;
                        res_cycle[i*100+j] = 0;
                }

                for (j = 0; j < 100; j++) {
                        local_req_id = i * 100 + j;
                        sins = rdpmc_ins();
                        scycle = rdpmc_cycle();
                        stime = rdpmc_tsc();

                        for (count = 0; count < total; count++) {
                        }

                        etime = rdpmc_tsc();
                        ecycle = rdpmc_cycle();
                        eins = rdpmc_ins();

                        res_time[local_req_id] = etime - stime;
                        res_ins[local_req_id] = eins - sins;
                        res_cycle[local_req_id] = ecycle - scycle;
                }
        }

        // generate the output
        for (i = 0; i < TOTAL_REQ; i++) {
                printf("%f %ld %ld %ld\n", 1.0 * res_time / 2.2, res_ins, res_cycle, res_ref);
        }
}

McCalpinJohn · ‎12-16-2019

There should not be any need to use FENCE instructions here. I usually use the RDTSCP instruction to get the TSC, and this has a one-sided fence functionality.

The documentation for the performance counter controls is in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual (document 325384-071, October 2019). Section 18.2.2 describes the control register for the fixed-function counters, IA32_FIXED_CTR_CTRL (MSR 0x38d).

I don't know what is underneath your functions "rdpmc_tsc", "rdpmc_cycle", and "rdpmc_ins". The versions that I use are inline assembler and are available at https://github.com/jdmccalpin/low-overhead-timers. If your code is not performing inline assembly calls to the RDPMC instruction, then the counts will depend on what the lower layers of software are doing. The standard "perf events" library on Linux will virtualize the counters, so that you only see explicit counts accumulate while your code is running. This is not a bad thing, but it can hide causes of performance issues. Virtualization also adds a lot of overhead -- the RDPMC instruction only takes 24-48 cycles (depending on processor model), while a system call to get the virtualized counter values can easily cost 100 times as much.

Your instruction counts look like they include only your user code, so you need to check to see if the fixed counters are counting for only user mode (MSR 0x38d = 0x111) or for both user and kernel mode (MSR 0x38d = 0x333).

I am still quite concerned about what the compiler is generating for your code -- the timings don't make sense to me. The fastest case is only executing one instruction every 1.75 cycles, which seems much too slow for a simple spin loop. It should execute a minimum of 1 instruction per cycle.

Zhou__Fang · ‎12-16-2019

Thank you so much, John! I compare my code to the github repository you shared. I think most parts are the same. But I find a bug that I didn't use __attribute__((always_inline)) when I define the functions rdpmc_tsc(), rdpmc_ins(), rdpmc_cycle(). I have added that part in my code now.

Thank you for pointing out. I double checked MSR 0x38d. The value is 0xbb (1011,1011). This means I have enabled the number of instructions and number of actual cycles to record at all ring levels.

Then I re-test the application get the following values, which are similar to previous results:

83683.636364 100074 251119 (around P99.9)
78048.181818 100074 234207 (around P99)
62543.636364 100073 187699 (around P50)
58541.818182 100074 175688 (around the bottom)

I have attached my rdpmc functions and part of assembly code below. If you have time, please help me have a quick check. Thank you so much.

rdpmc functions:

static inline __attribute__((always_inline)) unsigned long rdpmc_tsc(void)
{
        unsigned hi, lo;
        __asm__ __volatile__ ("rdtscp" : "=a"(lo), "=d"(hi));
        return ( (unsigned long)lo)|( ((unsigned long)hi)<<32 );
}

static inline __attribute__((always_inline)) unsigned long rdpmc_ins(void)
{
    unsigned a, d, c;

    c = (1<<30);
    __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));
    return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

static inline __attribute__((always_inline)) unsigned long rdpmc_cycle(void) {
    unsigned a, d, c;

    c = (1<<30)+1;
    __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));
    return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

Assembly code:
   0x0000000000400c72 <+420>:   rdpmc
   0x0000000000400c74 <+422>:   mov    %eax,-0x54(%rbp)
   0x0000000000400c77 <+425>:   mov    %edx,-0x50(%rbp)
   0x0000000000400c7a <+428>:   mov    -0x54(%rbp),%eax
   0x0000000000400c7d <+431>:   mov    -0x50(%rbp),%edx
   0x0000000000400c80 <+434>:   shl    $0x20,%rdx
   0x0000000000400c84 <+438>:   or     %rdx,%rax
   0x0000000000400c87 <+441>:   mov    %rax,-0x28(%rbp)
   0x0000000000400c8b <+445>:   rdtscp
   0x0000000000400c8e <+448>:   mov    %eax,-0x5c(%rbp)
   0x0000000000400c91 <+451>:   mov    %edx,-0x58(%rbp)
   0x0000000000400c94 <+454>:   mov    -0x5c(%rbp),%eax
   0x0000000000400c97 <+457>:   mov    -0x58(%rbp),%edx
   0x0000000000400c9a <+460>:   shl    $0x20,%rdx
   0x0000000000400c9e <+464>:   or     %rdx,%rax
   0x0000000000400ca1 <+467>:   mov    %rax,-0x20(%rbp)
   0x0000000000400ca5 <+471>:   movl   $0x0,-0x88(%rbp)
   0x0000000000400caf <+481>:   jmp    0x400cb8 <main+490>
   0x0000000000400cb1 <+483>:   addl   $0x1,-0x88(%rbp)
   0x0000000000400cb8 <+490>:   mov    -0x88(%rbp),%eax
   0x0000000000400cbe <+496>:   cmp    -0x78(%rbp),%eax
   0x0000000000400cc1 <+499>:   jl     0x400cb1 <main+483>
   0x0000000000400cc3 <+501>:   rdtscp
   0x0000000000400cc6 <+504>:   mov    %eax,-0x64(%rbp)
   0x0000000000400cc9 <+507>:   mov    %edx,-0x34(%rbp)
   0x0000000000400ccc <+510>:   mov    -0x64(%rbp),%eax
   0x0000000000400ccf <+513>:   mov    -0x34(%rbp),%edx
   0x0000000000400cd2 <+516>:   shl    $0x20,%rdx
   0x0000000000400cd6 <+520>:   or     %rdx,%rax
   0x0000000000400cd9 <+523>:   mov    %rax,-0x18(%rbp)
   0x0000000000400cdd <+527>:   movl   $0x40000001,-0x60(%rbp)
   0x0000000000400ce4 <+534>:   mov    -0x60(%rbp),%eax
   0x0000000000400ce7 <+537>:   mov    %eax,%ecx
   0x0000000000400ce9 <+539>:   rdpmc
   0x0000000000400ceb <+541>:   mov    %eax,-0x3c(%rbp)
   0x0000000000400cee <+544>:   mov    %edx,-0x38(%rbp)
   0x0000000000400cf1 <+547>:   mov    -0x3c(%rbp),%eax
   0x0000000000400cf4 <+550>:   mov    -0x38(%rbp),%edx
   0x0000000000400cf7 <+553>:   shl    $0x20,%rdx
   0x0000000000400cfb <+557>:   or     %rdx,%rax

McCalpinJohn · ‎12-17-2019

The setting in MSR 0x38d includes "interrupt on overflow" for Fixed Counter 0 and Fixed Counter 1. If the counters overflow and generate interrupts, it will definitely add microseconds to the execution time for those intervals. It is common for the interrupt handling code to reset the counter value, but that does not appear to be happening here.

The code was clearly compiled without optimization -- the loop at lines 20-23 of the assembly code includes two loads and one store, which explains the high cycle count.... This may not make any difference, but including memory references where they are not absolutely essential provides opportunities for unexpected stalls. For example, the OS could shoot down the TLB entry for the page you are accessing, causing the next iteration to require an unnecessary TLB walk. For this case it should be possible to create a kernel that does not require memory references and which can still be compiled with some level of optimization. An easy approach would be to maintain a global "dummy" counter -- initialize to zero above line 31 of the C code, increment the counter by the value of "stime" inside each iteration of the loop at lines 45-46, then print the value at the end of the program.

Zhou__Fang · ‎12-17-2019

Thank you, John.

Thank you for your explanation for the "interrupt" on MSR 0x38d.

I will add a dummy counter inside the loop and see what happens.

Can I ask a question: why does this can solve TLB miss?

McCalpinJohn · ‎12-19-2019

Adding a dummy counter inside the loop won't solve the TLB miss problem, but it will allow you to compile the code with optimization -- which should eliminate the memory references in the inner loop. I was answering a related question yesterday (https://stackoverflow.com/a/59397480/1264917) and realized that with the default settings, recent Linux kernels will periodically shoot down TLB entries for pages of active user processes so that it can monitor affinity when the address is accessed again.

You check to see if automatic NUMA page migration is enabled with

$ cat /proc/sys/kernel/numa_balancing

and you can disable it with

$ echo 0 > /proc/sys/kernel/numa_balancing

Of course it is better to have the compiler remove all of the memory references from the inner loop. Memory references are not needed for the "spin-wait" activity, so removing them reduces opportunities for confusing behavior.