Software Archive
Read-only legacy content
17060 Discussions

How to configure the Performance Monitoring Units (PMUs)

Xiaoxin_T_
Beginner
294 Views

Hi,

Recently I meet the problem about how to configure the PMUs on Xeon Phi.

According to the document named "Intel® Xeon Phi™ Coprocessor (codename: Knights Corner) Performance Monitoring Units", it requires a configuration tool that has "Ring 0" access to kernel to configure the PMUs on Xeon Phi. VTune Amplifier is able to use these PMUs. However, as we want to have some fine-grained control on the codes that we want to profile, instead of using VTune directly, we would like to collect the PMU data with our own codes that only require "Ring 3" access.

My question is, is there any APIs provided by MPSS that can do the configuration? Third party tools like PAPI require patching the uos’s kernel. However, we believe that the uos should already have the module that can configure the PMUs (otherwise, the VTune cannot work properly). The problem is how to communicate with this kernel module so that we can configure the PMUs by ourselves?

Thank you,

Xiaoxin

 

0 Kudos
2 Replies
Sumedh_N_Intel
Employee
294 Views

Hi, 

I don't know if there are any APIs besides PAPI that you can use to configure PMUs. You can only program the PMU through ring-0 code but you can read the counter information directly from ring-3 code. 

Even with the Intel VTune Amplifier XE, there is no API to allow users to program the PMU in its own way and read counter value back directly. With that said, users can use the PMU through Intel VTune Amplifier XE GUI or command line, but reading the counter values directly on the coprocessor side may become tricky since Intel VTune Amplifier XE will read and unset the counters during collection. 

What exactly do you mean by "fine-grained control"? VTune has a pause/resume API to allow use to control the portion of their code for profiling. Is this the functionality that you were looking for? 

 

0 Kudos
McCalpinJohn
Honored Contributor III
294 Views

I program the counters using a slightly modified version of the "wrmsr.c" code from the "msrtools-1.2" package.  (The modification was a simple change to the path for the /dev/msr* files, which are named differently on Xeon Phi than they are with most Linux distributions).   

This must be run by root on most systems to access the /dev/msr* device driver files, or you can mark the "wrmsr" binary as "setuid root", or you can "chmod" the /dev/msr* files to allow group write permission, then "chgrp" the /dev/msr* files to a group that your user account belongs to.

I usually use a script to set up the counters, read them, run the user job, then read the counters again.

If you don't need to change the counters often, you can execute the "wrmsr" program from inside your code using the "system()" call.    This allows you to keep your program running with normal user permissions and only upgrade to "root" permission when you are actually changing the MSRs.

When I need to inline the MSR reads and/or writes, I pull in the code from "rdmsr.c" and/or "wrmsr.c" and either run the code from the root account or mark the code as "setuid root" and run it from my user account.

You will have to check to see if your kernel is set up to allow execution of the RDPMC instruction in user space.  If this is enabled, then you can read the counters with a simple inline assembly language macro with very low overhead.  If this is not enabled you will have to read the counters using the code from "rdmsr.c" via the /dev/msr* device driver.  This is several orders of magnitude slower.   A simple loadable kernel module can change this bit (CR4.PCE) to allow user-mode access to the RDPMC instruction.   This really should be the default on the system, but it has not been the default on any of the MPSS versions that we have run at TACC.

0 Kudos
Reply