Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Trouble using wrmsr and rdmsr to record total LLC misses

Ryan1
Beginner
1,567 Views

Hello,

I am trying to use wrmsr and rdmsr to program and read from the performance monitoring counters to record the total number of L3 cache misses. According to Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3, section 18.2.1.1 states that IA32_PERFEVTSELx MSRs start at 0x186 and map to IA32_PMCx MSRs, which start at 0x0C1. Further, according to table 18.2.1.2, the LLC Misses UMask is 0x41, and the Event Select is 0x2E. However, despite following this information, I consistently get "0" for LLC misses. I am very new to kernel module development, so please let me know if any of my code is wrong/in bad form. 

Side note: I am aware that there are already tools out there that will measure this information for me, but, if possible, I would like to gather the information myself.

Thank you!

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/delay.h>

void enable_l3_cache_miss(void);
unsigned long total_l3_cache_misses(void);

 
int __init PMC_init(void){
	printk(KERN_INFO "Inside the %s function.\n", __FUNCTION__);
	enable_l3_cache_miss();
	return 0;
}
 
void __exit PMC_exit(void){
	unsigned long total_misses = -1;

	printk(KERN_INFO "Inside the %s function.\n", __FUNCTION__);
	total_misses = total_l3_cache_misses();
	printk(KERN_INFO "Total L3 cache miss: %lu\n", total_misses);
	
}

void enable_l3_cache_miss(void){
	int reg_addr = 0x186; 		/* IA32_PERFEVTSELx MSRs start address */
	int event_num = 0x002e; 	/* L3 cache miss event number */
	int umask = 0x4100; 		/* L3 cache miss umask */
	int enable_bits = 0x430000; 	/* Enables user mode, OS mode, counters*/
	int event = enable_bits | umask | event_num;

	__asm__ ("wrmsr" : : "c"(reg_addr), "a"(event), "d"(0x00));
}

unsigned long total_l3_cache_misses(void){
	unsigned long total_misses;
	unsigned long eax_low, edx_high;
	int reg_addr = 0x0C1;		/* IA32_PMCx MSRs start address */

	__asm__("rdmsr" : "=a"(eax_low), "=d"(edx_high) : "c"(reg_addr));
	total_misses = ((long int)eax_low | (long int)edx_high<<32);

	
	return total_misses;
}
 
module_init(PMC_init);
module_exit(PMC_exit);

 

0 Kudos
5 Replies
McCalpinJohn
Honored Contributor III
1,566 Views

Welcome to the low-level performance counter masochism society!

Two things that come to mind:

  1. You need to be sure that the counters are enabled globally in IA32_PERF_GLOBAL_CTRL (MSR 0x38f).
  2. You need to explicitly control the processor that executes the WRMSR and RDMSR instructions.    The kernel that I am using (3.10.0-693) uses the functions "wrmsrl_on_cpu()" and "rdmsrl_on_cpu()" (defined in arch/x86/lib/msr-smp.c) to read and write MSRs on a specific logical processor.

If you want an independent way to test the counter programming, you can do it from user space (with root privileges) using the /dev/cpu/*/msr devices drivers and the "rdmsr" and "wrmsr" executables from msr-tools1.3.   You still have to manage everything, but the command-line tools make it easy to choose which cores to work with and provide nice options for output formats and bit-field extraction.  For example, you can enable this event on all cores and read the values on all cores before and after a test with a few simple shell commands:

wrmsr -a 0x38f 0x70000000f        # write IA32_PERF_GLOBAL_CTRL on all cores to enable the 3 fixed and 4 programmable counters

wrmsr -a 0x38d 0x0333                # write IA32_FIXED_CTR_CTRL on all cores to enable the 3 fixed counters for user+system counting

wrmsr -a 0x186 0x0043412e   # write IA32_PERFEVTSEL0 on all cores to enable the architectural LLC miss event

rdmsr -a -d 0xc1             # read IA32_PMC0 on all cores and print the values in decimal (one line per logical processor)

[execute program that you want to test]

rdmsr -a -d 0xc1             # read IA32_PMC0 on all cores and print the values in decimal (one line per logical processor)

0 Kudos
Ryan1
Beginner
1,566 Views

McCalpin, John wrote:

 

Welcome to the low-level performance counter masochism society!

Two things that come to mind:

  1. You need to be sure that the counters are enabled globally in IA32_PERF_GLOBAL_CTRL (MSR 0x38f).
  2. You need to explicitly control the processor that executes the WRMSR and RDMSR instructions.    The kernel that I am using (3.10.0-693) uses the functions "wrmsrl_on_cpu()" and "rdmsrl_on_cpu()" (defined in arch/x86/lib/msr-smp.c) to read and write MSRs on a specific logical processor.

If you want an independent way to test the counter programming, you can do it from user space (with root privileges) using the /dev/cpu/*/msr devices drivers and the "rdmsr" and "wrmsr" executables from msr-tools1.3.   You still have to manage everything, but the command-line tools make it easy to choose which cores to work with and provide nice options for output formats and bit-field extraction.  For example, you can enable this event on all cores and read the values on all cores before and after a test with a few simple shell commands:

wrmsr -a 0x38f 0x70000000f        # write IA32_PERF_GLOBAL_CTRL on all cores to enable the 3 fixed and 4 programmable counters

wrmsr -a 0x38d 0x0333                # write IA32_FIXED_CTR_CTRL on all cores to enable the 3 fixed counters for user+system counting

wrmsr -a 0x186 0x0043412e   # write IA32_PERFEVTSEL0 on all cores to enable the architectural LLC miss event

rdmsr -a -d 0xc1             # read IA32_PMC0 on all cores and print the values in decimal (one line per logical processor)

[execute program that you want to test]

rdmsr -a -d 0xc1             # read IA32_PMC0 on all cores and print the values in decimal (one line per logical processor)

Thank you so much! I had no idea that I needed to enable the counters globally. Would you be able to go more in depth about what I have to do with explicitly controlling the processor that executes the WRMSR and RDMSR instructions? I've implemented the enabling of global counters, and now I'm getting it to sometimes spit out numbers other than zeros, but many times it's still just 0.

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>


void enable_l3_cache_miss(void);
void disable_l3_cache_miss(void);
void enable_global_counters(void);
void disable_global_counters(void);
unsigned long total_l3_cache_misses(void);


 
int __init PMC_init(void){
	printk(KERN_INFO "Inside the %s function.\n", __FUNCTION__);
	enable_global_counters();
	enable_l3_cache_miss();
	
	return 0;
}
 
void __exit PMC_exit(void){
	unsigned long total_misses = -1;

	printk(KERN_INFO "Inside the %s function.\n", __FUNCTION__);
	total_misses = total_l3_cache_misses();
	printk(KERN_INFO "Total L3 cache miss: %lu\n", total_misses);

	disable_l3_cache_miss();
	disable_global_counters();
	
}

void enable_l3_cache_miss(void){
	int reg_addr = 0x186; 		/* IA32_PERFEVTSELx MSRs start address */
	int event_num = 0x002e; 	/* L3 cache miss event number */
	int umask = 0x4100; 		/* L3 cache miss umask */
	int enable_bits = 0x430000; 	/* Enables user mode, OS mode, counters*/
	int event = enable_bits | umask | event_num;

	__asm__ ("wrmsr" : : "c"(reg_addr), "a"(event), "d"(0x00));
}

unsigned long total_l3_cache_misses(void){ 
	unsigned long total_misses;
	unsigned long eax_low, edx_high;
	int reg_addr = 0x0C1;		/* IA32_PMCx MSRs start address */

	__asm__("rdmsr" : "=a"(eax_low), "=d"(edx_high) : "c"(reg_addr));
	total_misses = ((long int)eax_low | (long int)edx_high<<32);


	
	return total_misses;
}

void enable_global_counters(void){
	int reg_addr = 0x38f;				/*  IA32_PERF_GLOBAL_CTRL start address */
	unsigned long enable_bits = 0x70000000f;	/*  IA32_PERF_GLOBAL_CTRL to enable the 3 fixed and 4 programmable counters */

	__asm__("wrmsr" : : "c"(reg_addr), "a"(enable_bits), "d"(0x00));
}

void disable_l3_cache_miss(void){
	int reg_addr_PEREVTSEL = 0x186;						/* IA32_PERFEVTSELx MSRs start address */
	int reg_addr_PMCx = 0x0C1;						/* IA32_PMCx MSRs start address */
	
	__asm__("wrmsr" : : "c"(reg_addr_PEREVTSEL), "a"(0x00), "d"(0x00));	/* Clears  IA32_PERFEVTSELx MSRs */
	__asm__("wrmsr" : : "c"(reg_addr_PMCx), "a"(0x00), "d"(0x00));		/* Clears counter */
}

void disable_global_counters(void){
	int reg_addr = 0x38f;						/*  IA32_PERF_GLOBAL_CTRL start address */		
	
	__asm__("wrmsr" : : "c"(reg_addr), "a"(0x00), "d"(0x00));	/* Clears IA32_PERF_GLOBAL_CTRL */
}
 
module_init(PMC_init);
module_exit(PMC_exit);

 

Output: 

[  431.744257] Inside the PMC_init function.
[  451.192163] Inside the PMC_exit function.
[  451.192170] Total L3 cache miss: 0
[ 2082.502522] Inside the PMC_init function.
[ 2094.454753] Inside the PMC_exit function.
[ 2094.454759] Total L3 cache miss: 0
[ 2295.911107] Inside the PMC_init function.
[ 2328.540815] Inside the PMC_exit function.
[ 2328.540822] Total L3 cache miss: 0
[ 3421.700951] Inside the PMC_init function.
[ 3431.077258] Inside the PMC_exit function.
[ 3431.077265] Total L3 cache miss: 0
[ 4093.143332] Inside the PMC_init function.
[ 4100.874691] Inside the PMC_exit function.
[ 4100.874697] Total L3 cache miss: 3048995487
[ 4176.410929] Inside the PMC_init function.
[ 4181.621208] Inside the PMC_exit function.
[ 4181.621216] Total L3 cache miss: 0
[ 4218.784471] Inside the PMC_init function.
[ 4227.928742] Inside the PMC_exit function.
[ 4227.928751] Total L3 cache miss: 0
[ 4244.728500] Inside the PMC_init function.
[ 4248.639759] Inside the PMC_exit function.
[ 4248.639767] Total L3 cache miss: 3064212811
[ 4266.180755] Inside the PMC_init function.
[ 4267.922943] Inside the PMC_exit function.
[ 4267.922949] Total L3 cache miss: 0
[ 4285.079945] Inside the PMC_init function.
[ 4286.702810] Inside the PMC_exit function.
[ 4286.702816] Total L3 cache miss: 290335770

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,566 Views

The kernel code that implements the /dev/cpu/*/msr device drivers is probably the best place to look for examples of how to control what processor executes the rdmsr and wrmsr commands.   In the kernel source, the file is arch/x86/kernel/msr.c.   This interface uses the "minor device number" of the device driver to control which core to run on.  You will have to decide what core(s) you want to use for these core performance counters.    The file "kernel/smp.c" includes lots of functions for running other functions on various cores, including "smp_call_function_single()", "smp_call_function_any()", "smp_call_function_many()", and "smp_call_function()" .

The "uncore" performance counters are a little different -- the ones that are accessed by MSRs are not core-specific, they are chip-specific.  So for those counters you can use any core on the target chip to perform the reads.   In a multi-chip system, this still means that you have to specify where to run it (though you could use "smp_call_function_any()" with a mask that allows the function to be run on any core of the target chip).  

0 Kudos
Ryan1
Beginner
1,566 Views

Thanks so much. I finally got it working after looking at msr.c (I had a silly mistake). Are the cores just numbered 0, 1, 2, 3, 4 etc.? To test, I just used 0. What would I use the functions in smp.c for? Would that be for if I wanted to initialize the counters on more than one core? Is there any difference than just looping through and calling wrmsr_on_cpu for each individual core?

Also, I found that running the program on a virtual machine (VMware), I kept on getting zeros, but when I tested it on my lab computer, I began to get actual results. Is that just because the virtual machine does not virtualize those specific registers? Or is there something I'm missing?

Thank you again!

0 Kudos
McCalpinJohn
Honored Contributor III
1,566 Views

The core performance counters are thread-private, so programming counters only effects the core that you use for the WRMSR commands, and reading the counters only gives you the value for whatever core runs the RDMSR command.   Any counter that you have programmed will only increment due to activity on that specific logical processor.    In user space, the kernel can move processes from core to core at any time.   It is not at all clear to me which core your kernel extension will run on, or whether it might get migrated from one core to another during its execution. 

In user space, the most common tools used to control where a process runs are the command-line tools "taskset" and "numactl", or run-time libraries that use the "sched_setaffinity()" interface.

The Linux kernel performance monitoring infrastructure (usually referred to as "perf events") typically virtualizes the counters, saving and restoring the control registers and counts on context switches and process migrations.  This extra effort (and overhead) is necessary if you don't control where the process runs.

Virtual machines typically must intercept low-level hardware accesses like RDMSR and WRMSR and decide what to do with them.  This is discussed in Chapter 31 of Volume 3 of the Intel Architectures Software Developers Manual (document 325384).

 

0 Kudos
Reply