- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, so for an update, I have downloaded the PCM code and compared it to how I am implementing my performance monitoring. There were some slight difference that I modified to make it match the PCM. I also modified my code so that my rdmsr macros only perform a rdmsr assembly instruction. This has allowed me to perform two reads of the program counter 0 msr back to back and detect no L3 cache miss. Again I am using the 0F.20 performance event. However, every instruction that I execute between the performance checks cause a L3 cache miss. The performance event should only count the number of L3 misses on retired loads. Why does every instruction then cause a L3 cache miss. I implement a large loop and I still get cache misses. The only related instructions that do not cause cache misses are: rdmsr, wrmsr, and rdpmc.
In the VMM i have verified that the cache is enabled, paging is enabled, the PAT indicates WB memory, and the MTRRs indicate WB memory for all user space. My performance counting code is as follows:
(In the VMM before VMLAUNCH)
#define wrmsrl(MSR, val) \\
do { \\
unsigned long eax, edx; \\
eax = (u32)(0x00000000FFFFFFFF & val1); \\
edx = (u32)(val1 >>32); \\
__asm__ __volatile__ ("wrmsr" : : "c" (MSR), "a" (eax), "d" (edx)); \\
} while(0);
#define rdmsr(MSR, eax, edx) \\
__asm__ __volatile__ ("rdmsr" : "=a" (eax), "=d" (edx) : "c" (MSR));
#define IA32_PERF_GLOBAL_CTRL 0x38F
#define PMC0_EN 1UL
#define IA32_PERFEVTSEL0 0x186
#define PMC_UMASK 8
#define PMC_EN (1UL<<22)
#define PMC_USR (1UL <<16)
#define PMC_OS (1UL <<17)
#define IA32_PMC0 0xC1
#define MEM_LOAD_RETIRED_L3_MISS_EVENT 0x0F
#define MEM_LOAD_RETIRED_L3_MISS_UMASK 0x20
unsigned long MSR_val1, MSR_val2;
//disable counters while programming
wrmsrl(IA32_PERF_GLOBAL_CTRL, (u64)0);
//setup the performance event selector for performance counter 0 to count the number of retired loads that miss l3
MSR_val1 = MEM_LOAD_RETIRED_L3_MISS_EVENT | (MEM_LOAD_RETIRED_L3_MISS_UMASK << PMC_UMASK) | PMC_EN | PMC_USR | PMC_OS;
wrmsrl(IA32_PMC0, (u64)0);
wrmsrl(IA32_PERFEVTSEL0, MSR_val1);
//enable the performance counter 0
MSR_val1 = PMC0_EN;
wrmsrl(IA32_PERF_GLOBAL_CTRL, MSR_val1);
(In the VM after VMLAUNCH)
#define READ_SIZE 0x400000
unsigned long cur_eax, cur_edx, pre_eax, pre_edx;
unsigned long MSR_val1;
intn = 0;
int i = 0;
unsigned long *pt;
//memory region to read
pt = (unsigned long *) 0x26100000;
//The two rdmsr reads will indicate that no instruction cache miss has occured.
rdmsr(IA_PMC0, pre_eax, pre_edx);
rdmsr(IA_PMC0, cur_eax, cur_edx);
//The two rdmsr reads will indicate thatone instruction cache miss has occured. Should this be the case, will each instruction between the reads cause a l3 cache miss?
rdmsr(IA_PMC0, pre_eax, pre_edx);\\
n++;
rdmsr(IA_PMC0, cur_eax, cur_edx);
//The two rdmsr reads will indicate that 25165825 instruction cache misses have occured. But I am only ready half the amount of memory available in the L3 cache. I should not be seeing so many cache misses. It seems like no caching is done.
for (n = 0; n < 10; n++)
{
rdmsr(IA_PMC0, pre_eax, pre_edx);
for (i = 0; i < READ_SIZE; i++)
{
temp += pt;
}
rdmsr(IA_PMC0, cur_eax, cur_edx);
printf("Cache misses: %d", (cur_eax | (u64)cur_edx<<32) - (pre_eax | (u64)pre_edx<<32));
}
When I set the CD bit in the CR0 register in the VMM I get the same results for the loop but larger cache miss values for back to back reads.
Please, any suggestions would be helpful.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this still an issue?
If I'm understanding you correctly, you are saying that the above loop gives you:
L2_misses = number_of_loops * READ_SIZE = 10 * 0x400000 = 335,544,320 misses
or
10 occurrences of 33,554,432 misses
Is this correct?
I would expect each of the READ_SIZE loops to fetch 33 MBs of memory and generate 33MB / 64 misses about 524,288 misses.
And, if you reduces READ_SIZE to something that fit into ... say... half of your L3, then I'd expect that your count should go to zero.
But doing things inside a VM throws a whole new wrinke into mix.
You could (sort of) easily check your loop on a standard linux. You might have to use the /dev/cpu/*/msr rdmsr/wrmsr interface. But this would tell you if the issue is something with the test program or something with how VMs do counters (or memory accesses).
Pat

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page