You said your kprobe handler

Zhou__Fang · ‎02-28-2019

I'm using rdmsr to read cache reference and cache miss when the benchmark is running.

However, I found a weird phenomenon that the cache ref and cache miss number are smaller when the timestamp is larger.

For example,

TSC1: 120245427914623, #CacheRef: 822658374

TSC2: 120245427921009, #CacheRef: 822656949

Thank you for any help in advance.

HadiBrais · ‎03-01-2019

Are you taking these samples on the same logical core? Are you taking these samples immediately one after the other? In that case, what is the sequence of assembly instruction being used to read the counters?

Zhou__Fang · ‎03-01-2019

Hi Hadi,

Thank you for your reply.

I use rdtsc() to read time stamp counter and rdmsr() to read corresponding msr. Both operations are done in a kernel module.

The code is shown below:
static __inline__ unsigned long long tsc_now(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

Use rdmsrl_safe(0xc1, &tmp_cref) to read MSR value

Is there anything wrong? Do you think instruction reorder happens?

Thanks.

HadiBrais · ‎03-01-2019

Yes I think instruction reordering may possibly explain the apparent inconsistency, but it's not clear to me how that can happen in your particular case because you have not shown all the relevant pieces of code. The RDTSC and RDMSR instructions have no serialization properties. But I'm not sure yet because I need to see the code that takes the two consecutive samples, i.e., the code that calls tsc_now and rdmsrl_safe to capture the two samples. Preferably both the minimal C/C++ code and the corresponding assembly code generated by your compiler. Also how frequently does the inconsistency occur across many runs?

Zhou__Fang · ‎03-01-2019

Thank you, Hadi.

One event is recorded in kprobe function I registered in advance. The other event is recorded in user application including issue operation on a module to read MSR values. I try to compare the values between these two events to see what happens in the system, like how many instructions have run, how many cache references and misses happen during this period, and so on.

The code I used in kprobe is like this:

kprobe_function(...) {
......
tmp_ins = rdpmc_ins();
rdmsrl_safe(0xc1, &tmp_cref);
rdmsrl_safe(0xc2, &tmp_cmiss);
rdmsrl_safe(0xc3, &tmp_l2ref);
rdmsrl_safe(0xc4, &tmp_l2miss);

fresult.perf_ins = tmp_ins;
fresult.perf_cref = tmp_cref;
fresult.perf_cmiss = tmp_cmiss;
fresult.perf_l2ref = tmp_l2ref;
fresult.perf_l2miss = tmp_l2miss;

fresult.ts = tsc_now();

......
}

User application use similar code:
record_event(......) {
ioctl(fd_lperf, IOCTL_PERF, perf_res); // I use a kernel module to get MSR values.
ur.ins = perf_res[0];
ur.cref = perf_res[1];
ur.cmiss = perf_res[2];
ur.l2ref = perf_res[3];
ur.l2miss = perf_res[4];

ur.ts = tsc_now();
}

The kernel module code is here:
......
case IOCTL_PERF:
perf_res[0] = rdpmc_ins();
rdmsrl_safe(0xc1, &perf_res[1]);
rdmsrl_safe(0xc2, &perf_res[2]);
rdmsrl_safe(0xc3, &perf_res[3]);
rdmsrl_safe(0xc4, &perf_res[4]);
copy_to_user((void*)arg, perf_res, 5 * sizeof(long));
......

Is there anything wrong?

Below are some inconsistency results I got from tests:

Case 1:
#Core TSC #Instructions #CacheRef #CacheMiss
24 120245427914623 140741506190124 140737535585929 140737489655453
24 120245427921009 140741506076483 140737535584483 140737489654399

Case 2:
#Core TSC #Instructions #CacheRef #CacheMiss
29 120255606620839 140738542376673 140737499721683 140737488750264
29 120255606621427 140738542298976 140737499721203 140737488750140

Is there anything wrong?

Looking forward to your reply. Thanks.

HadiBrais · ‎03-01-2019

Based on the information you've provided so far, there are a few potential issues that I can think of:

Suppose that the user application has called ioctl and you kernel module has read some or all of the performance counters. At this point, the tsc_now() call from the user application has not been executed yet. If your kprobe handler got invoked at this point then it will read larger counts for the performance events and it will read the current (smaller) TSC value. Later the user application will read the TSC value, which will be larger than the one read from kprobe, resulting in an inconsistency like the one you're seeing.
Optimized kprobes can be interrupted (but cannot be preempted). I don't know whether kprobe optimization is enabled in your kernel build or if it can be applied to your kprobe. But if your kprobe handler got interrupted and somehow that interrupt handler ended up invoking the user application where the performance counters are read, then an inconsistency may occur.
Is it guaranteed that the kprobe handler will be executed on the same logical core as the user application at the time the performance counters are read? If the kprobe handler got invoked on a different logical core, different counters will be read which may not be related or comparable to the counts obtained by the user application on different core. You have not specified exactly which PMU events Instructions, CacheRef, and CacheMiss represent.

I think you can easily check whether one or more of these issues may occur in your code.

Zhou__Fang · ‎03-02-2019

Thank you, Hadi.

The first event is recorded in a kprobe function in exit_to_usermode_loop(), which is the last function from kernel back to user space.
The second event is recorded in user application.

So, I think may not be the first assumption.

I don't quite understand the second possibility. Even an interrupt happens during kprobe context, later when hardirq and softirq finishes, it should return to kprobe context, right?

For third one. When recording events, I also record the core number. The number is not changed.

I will run a small test to keep reading LLC miss value (setting 0x0043412e to MSR 0x186) to make sure the value in register keeps increasing.

Have you ever met with this problem before? Can we trust the register value?

HadiBrais · ‎03-02-2019

The user application captures the event counter values using the following code:

ioctl(...)
.
.
.
ur.ts = tsc_now();

Note how there is no guarantee that the instructions that read counters in the kernel module will be executed *atomically* as a group. In addition, there is no guarantee that the TSC counter will be read after returning from ioctl in an atomic fashion as a single group with all of the other counters.

One thing you can try is to reorder the call to tsc_now() with the calls to rdmsrl_safe and see whether inconsistencies still occur. For example, you can move rtsc_now() from the user application to the kernel module either after, before, or somewhere in the middle with respect to the calls to rdmsrl_safe and see what happens.

Regarding the second point from my previous comment, an optimized krpobe may be interrupted. For more information, refer to https://www.kernel.org/doc/Documentation/kprobes.txt.

I think the value in any counter for any event can only get larger, unless the counter overflows back to zero because it has a limited size.

Zhou__Fang · ‎03-02-2019

Thank you, Hadi.

I can understand there's no guarantee that read TSC happens after rdmsr in kernel module. But the gap between two events is really large, if tsc value is right. I still do not understand why later rdmsr's value is smaller than the previous rdmsr.

You're right. I will move tsc_now() into the kernel module and modify the order with other rdmsrl_safe to see what will happen.

I also write a simple application keep calling ioctl to read MSR values. I didn't find any inconsistency case in a 30-minute test.

I'll report here once I have any findings about this weird phenomena.

Thank you so much.

HadiBrais · ‎03-02-2019

You said your kprobe handler is executed on exit_to_usermode_loop(). It seems to me that this is what is happening:

ioctl is called from the user application to read the performance counters.
rdmsrl_safe is called 4 times in the kernel module to read the specified performance counters. At this point, the user application has not called tsc_now yet.
ioctl is about return to use mode and so exit_to_usermode_loop is called.
Your kprobe handler gets executed. It reads the same performance counters and the values obtained will be larger than the ones from step 2. In addition, tsc_now is called to read the current TSC value. This is what you're calling the first sample.
The kernel returns to the user application where it calls tsc_now. This is the second sample. Note how the TSC value of the second sample is larger than that of the first.

This perfectly explains the inconsistency between the first sample and the second sample in:

#Core TSC #Instructions #CacheRef #CacheMiss
24 120245427914623 140741506190124 140737535585929 140737489655453
24 120245427921009 140741506076483 140737535584483 140737489654399

The Instructions, CacheRef, and CacheMiss values of the first sample and second sample are read in step 4 and step 2, respectively. The values read in step 4 are larger than those read in step 2. However, TSC of the first sample is read in step 4 but TSC of the second sample is read in step 5 and therefore it is indeed larger.

Zhou__Fang · ‎03-03-2019

@Hadi, ioctl() does not invoke exit_to_usermode_loop() function in my operating system. exit_to_usermode_loop() is a function called when a softirq finishes.

The first event is recorded in kprobe function, which is in kernel context.
The second event is recorded in an application with calling ioctl() to get MSR values.

If this is caused by instruction reordering, do you know which function we can try? I have tried to add smp_rmb() before reading MSRs, but no effects.

Thank you.

Travis_D_ · ‎03-04-2019

Zhou, Fang wrote:
However, I found a weird phenomenon that the cache ref and cache miss number are smaller when the timestamp is larger.

You should indicate whether this phenomenon is consistent: i.e., are all your readings like that? Most, or at least many? Or 1 out 1,000? 1 in a million? Etc.

There are plenty of reasons that might happen now and then, but it's all the time it's more interesting.

Why are the results of rdmsr not consistency