Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

Getting the per slice last level cache miss details

Dutta__Sankha
Beginner
418 Views

Hi 

I am new to performance measurement using counters. I am using 7th generation kabyLake processor and I want to measure the LLC miss per slice for a portion of code very similar to as something mentioned in here. It mention of first setting the MSR for measuring LLC_LOOKUP and then read for each C-Box. So I was wondering if someone could give me some idea as to how MSR would be set and with what value to get he LLC_LOOKUP and then measure for per slice for the given processor. I am trying to use libpfc to achieve the above purpose. However, to me the method of gathering the measurement is not very important. Also I am not sure if any such counter is available for the processor I am using. So that would be helpful to know if I can use the methodology mentioned in the paper. Any help would be greatly appreciated. Thank you.

0 Kudos
9 Replies
Thomas_G_4
New Contributor II
418 Views

There are different ways to get to the counts. The examples are for Intel Skylake, it might be slightly different for Kabylake although the changes between Skylake and Kabylake are minor (for hardware performance monitoring)

- Like in the paper:
Get number of LLC slices (Bits 0-3)
# rdmsr 0x396
Stop all counting in all LLC slices
# wrmsr 0xE01 0x0
Configure LLC_LOOKUP event in all 4 LLC slices of Intel Skylake desktop chips (Bit 20 and 22 are user-mode counting and enable bit, 0x34 selects LLC_LOOKUP and 0x8f specifies "0x80: Any request , 0x1: modified state, 0x2: exclusive state, 0x4: shared state, 0x8: invalid state)
# wrmsr 0x700 0x508f34
# wrmsr 0x710 0x508f34
# wrmsr 0x720 0x508f34
# wrmsr 0x730 0x508f34
Now we need to start all counters (Bit 29 for unfreeze, 0xf: one enable bit per LLC slice
# wrmsr 0xE01 0x2000000f

> Run your code

Stop all counting
# wrmsr 0xE01 0x0
Read all counter registers
# rdmsr 0x706
# rdmsr 0x716
# rdmsr 0x726
# rdmsr 0x736
 

You should set the config registers (0x700, 0x710, 0x720, 0x730) to zero when done

 

- Using perf (/proc/sys/kernel/perf_event_paranoid needs to be 0)
 

perf stat -e  uncore_cbox_0/event=0x34,umask=0x8f/, uncore_cbox_1/event=0x34,umask=0x8f/, uncore_cbox_2/event=0x34,umask=0x8f/, uncore_cbox_3/event=0x34,umask=0x8f/ <executable>

The paper does not mention which umask value they are using. Although 0x80 specifies 'any request', you can also try 0xf0 for (read, write, snoop and any).

Dutta__Sankha
Beginner
418 Views

Hi Thomas

Thank you so much for the comments. I was trying to use the msr tool and I used the rdmsr 0x396 and it gives out an output of 5. My CPU architecture have 4 cores which should have 4 slices. I am not sure if the MSR for getting the number of slices is different for kabylake or actually there are 5 slices which would be really odd. Can you also provide me some information that where can I get the MSR numbers and their details for a particular processor architecture (kabylake in my case). 

Dutta__Sankha
Beginner
418 Views

There is another issue as well. I am trying to execute rdmsr and wrmsr in an inline assembly. I am providing my sample code below:

static inline uint64_t rdmsr()
{
	uint64_t low, high;
	asm volatile (
		"rdmsr"
		: "=a"(low), "=d"(high)
		: "c"(0x10)
	);
	return ((uint64_t)high << 32) | low;
}

int main(int argc,char *argv[]){

	printf("%ld\n",rdmsr())	;
	
	return 0;
}

So I guessed that could be because of the privilege level issue which is also mentioned in this post. But I tried to run even with a root access and it still gave me segmentation fault. A dmesg gave me the following output. 

[ 4859.889760] traps: LLC_RE[2224] general protection ip:4005e5 sp:7ffd11f14030 error:0 in LLC_RE[400000+1000]

So even though I am running at a lower privilege level still it causes protection error. So I was wondering how this can be solved. 

 

Thomas_G_4
New Contributor II
418 Views

Hi,

you can find the register addresses in the Intel SDM Volume 4 Chapter 2. When your Kabylake has 5 CPU cores, you probably also have 5 LLC slices. The Skylake mentioned in my post has 4 cores.

You can use the rdmsr and wrmsr instructions only in ring 0 (kernel) and not in user-space. You have to use either the msr-tools or run as root and use the msr kernel devices (/dev/cpu/*/msr with pread()/pwrite()).

That's why I posted the perf code because you can use it as user. You can also use LIKWID. If you need it in your C program as user, you can also use the perf_event_open system call or the LIKWID library.

Dutta__Sankha
Beginner
418 Views

Hi Thomas 

Thanks for the answer. But as I mentioned in my second comment I have 4 physical cores and I don't understand how the  number of slices output is 5 after inquiring 0x396 representing 5 slices. Or the MSR register value has been changed? I guess that I really don't need it inside my code and so I can use the MSR tool. But I don't understand why there are 5 slices. 

 

Thomas_G_4
New Contributor II
418 Views

The current SDM (January 2019) lists in Table 2-40 (Uncore PMU MSRs Supported by 6th Generation, 7th Generation, and 8th Generation Intel® CoreTM Processors, and Future Intel® CoreTM Processors) four MSR_UNC_CBO_* sections but in MSR_UNC_PERF_GLOBAL_CTRL (0xE01) you have bits for LLC slices 0-4 thus 5 slices. So there are some inconsistencies.

Same but opposite issue for CannonLake (Table 2-43). There you have 8 MSR_UNC_CBO_* sections but in MSR_UNC_PERF_GLOBAL_CTRL (0xE01) you can select only 5 LLC slices.

Dutta__Sankha
Beginner
418 Views

Hi Thomas

I am using the msr kernel  (/dev/cpu/*/msr with pread()/pwrite()) to read the MSR. I am using the core part of the rdmsr/wrmsr from MSRTool to read and write the MSR. Thanks again.

 

Dutta__Sankha
Beginner
418 Views

Hello 

I have few questions as I am working on the problem. I am going through the events (in chapter 19 of 3 B) and msr (chapter 2 of volume 4). However, I didn't see any documentation as what events are applicable to which msr. I believe all the events are not applicable to all the msr and I was wondering if there is a map that maps certain events to their corresponding msr. 

Also in the first reply thomas mentioned that the LLC_LOOKUP is the 0x34 event. However, when I was going through the event list of kaby lake (i7-7700k), I couldn't see any such event listed in the event table of kaby lake. But it is mentioned in the uncore event list (table 19-12) of intel 4th generation cpus. So I was wondering this is applicable to my 7th gen intel and if this is applicable then what other event list that would be applicable to my processor generation and how can I know about it. 

edit:

After googling, I came across this document  describing all the MSR and their detailed description. It also lists the events in chapter 3. I was wondering if the events are applicable for my 7th generation kaby lake processor as well. The reason I am asking is because I conducted the experiment of LLC miss per slice as suggested in the first reply by thomas. But the results that I have been getting makes me doubtful. 

 

Thomas_G_4
New Contributor II
418 Views

Luckily, Intel did a pretty good job with its performance events and you can use most events in any config-counter-pair. Moreover, Intel publishes the event lists as JSON documents, too. See https://download.01.org/perfmon/ . You can check the mapfile.csv which subfolder you have to select for your architecture. I assume your system has the model id 0x8E or 0x9E which points to the SKL (Skylake) folder. There you find an Uncore file which contains LLC_LOOKUP (0x34). Each event has a field 'Counter' which defines on which counter you can measure the event. For LLC_LOOKUP it is "0,1", so both available config-counter-pairs are suitable.

There ist not much difference between Skylake and Kabylake at performance monitoring level, so the document you found should be valid for Kabylake as well.

 

Reply