Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Cache Identifier

robchip
Beginner
1,140 Views
Hi everyone!

I'd like to find out which cores share a particular cache. With the 'cpuid' command I found lots of useful information, but I still need some sort of unique cache identifier to really determine which cores use which cache.

Does anyone know how to get this information? Or is there another way to get the information?

Thanks in advance!
Robert
0 Kudos
15 Replies
Roman_D_Intel
Employee
1,140 Views
Robert,

did you trythe cache topology enumeration algorithm/utility described in the article about "Intel 64 Architecture Processor Topology Enumeration".

Roman
0 Kudos
robchip
Beginner
1,140 Views
Hey Roman

Yes I quickly looked at it, but I stumbled over these 'affinity masks'. As far as I understood it, this information is from the operating system, right? (or can this information be obtained from the hardware?)

The problem with this is, that I cannot rely on an operating system to do the work, since the code is for an operating system :)
0 Kudos
Patrick_F_Intel1
Employee
1,140 Views
The affinity masks are used to just get the cpuid information from each cpu.
We have to read the cpuid info from all the cpus.
What are you trying to accomplish?
An OS independent way of figuring out cache sharing?
You can't really get a method that doesn't use some aspect ofan OS.
Or do you wanta way that works on multiple OS's?
Pat
0 Kudos
robchip
Beginner
1,140 Views
Yes, it should be an OS independent way of figuring out cache sharing. Because this code is the part of the operating system that gathers this information...

So there is no way to get this information from the hardware directly? ... Then I guess I have to come up with another technique to determine which caches belong to which core ...





0 Kudos
Patrick_F_Intel1
Employee
1,140 Views

I would say it like this:
I don't see how you can get the cpuid information fromall thecpus without using some facility of the OS to switch your software fromone cpu to another cpu.
Certainly you can get the cpuid info from the cpu you are currently running on without the OS but you need the cpuid info from ALL the cpus.

Most of the enumeration library is windows & linux OS independent and the OS specific code is in util_os.c.
The library code is not "part of the operating system" but util_os.c does call OS routines to move the thread from 1 cpu to the next.
Hope this helps,
Pat

0 Kudos
Patrick_F_Intel1
Employee
1,140 Views
Also, on Windows, system routines like GetLogicalProcessorInformationEx() will detail which cpus share a cache. See http://msdn.microsoft.com/en-us/library/windows/desktop/dd405488%28v=vs.85%29.aspx
0 Kudos
robchip
Beginner
1,140 Views
Yes the switching is necessary and this is already done!

Edit:
Thanks, then I'll have a closer look at the enumeration algorithm since it can be executed independently!
0 Kudos
robchip
Beginner
1,140 Views
Hi again

I looked at the topology enumeration algorithm provided by Intel. I think I understood the basic concept, but there are still some things that work incorrectly.
For example I wrote some lines to gather the information about a cache at a specific level (see below).
The log_roundToNearestPof2 performs the same operation as described in the documentation (and cpuid just calls CPUID and stores the values of all registers in the parameters).
This piece of code is then executed on all levels (subLevelIndex) and on all processors.



[cpp]uint32_t eax, ebx, ecx, edx; eax = 1; ecx = 0; cpuid(&eax, &ebx, &ecx, &edx); const uint8_t initialAPICID = 0xff & (ebx >> 24); eax = 4; ecx = subLevelIndex; cpuid(&eax, &ebx, &ecx, &edx); const uint8_t levelType = 0xf & eax; const char* levelName[] = {"Invalid", "Data Cache ", "Instruction Cache", "Unified Cache"}; const uint16_t cacheMaskWidth = log_roundToNearestPof2(((eax >> 14) & 0xfff) + 1); const uint32_t mask = ~((-1) << cacheMaskWidth); const uint8_t threadsSharingCache = ((eax >> 14) & 0xfff) + 1; const uint32_t cacheID = mask & initialAPICID; printf("Level: %d (%s),t %d threads/cache, tCache ID = %dn", levelType, levelName[levelType], threadsSharingCache, cacheID);[/cpp]

Does anyone see the where the problem lies in this code?

Thanks in advance!
Robert
0 Kudos
Patrick_F_Intel1
Employee
1,140 Views

Hello Robert,
Can you give us a clue?
Perhaps include the output?
Thanks,
Pat

0 Kudos
robchip
Beginner
1,140 Views
Sorry forgot all about that.
I executed the code on all 8 cores (with taskset) on my linux system. On Each core the code was executed for the subLevelIndex's 0 .. 3.

This is the output

[bash]Running on core 0 Level: 1 (Data Cache ), 2 threads/cache, Cache ID = 0 Level: 2 (Instruction Cache), 2 threads/cache, Cache ID = 0 Level: 3 (Unified Cache), 2 threads/cache, Cache ID = 0 Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 0 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 1 Level: 1 (Data Cache ), 2 threads/cache, Cache ID = 1 Level: 2 (Instruction Cache), 2 threads/cache, Cache ID = 1 Level: 3 (Unified Cache), 2 threads/cache, Cache ID = 1 Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 2 Level: 1 (Data Cache ), 2 threads/cache, Cache ID = 0 Level: 2 (Instruction Cache), 2 threads/cache, Cache ID = 0 Level: 3 (Unified Cache), 2 threads/cache, Cache ID = 0 Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 2 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 3 Level: 1 (Data Cache ), 2 threads/cache, Cache ID = 1 Level: 2 (Instruction Cache), 2 threads/cache, Cache ID = 1 Level: 3 (Unified Cache), 2 threads/cache, Cache ID = 1 Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 3 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 4 Level: 1 (Data Cache ), 2 threads/cache, Cache ID = 0 Level: 2 (Instruction Cache), 2 threads/cache, Cache ID = 0 Level: 3 (Unified Cache), 2 threads/cache, Cache ID = 0 Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 5 Level: 1 (Data Cache ), 2 threads/cache, Cache ID = 1 Level: 2 (Instruction Cache), 2 threads/cache, Cache ID = 1 Level: 3 (Unified Cache), 2 threads/cache, Cache ID = 1 Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 6 Level: 1 (Data Cache ), 2 threads/cache, Cache ID = 0 Level: 2 (Instruction Cache), 2 threads/cache, Cache ID = 0 Level: 3 (Unified Cache), 2 threads/cache, Cache ID = 0 Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 6 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 7 Level: 1 (Data Cache ), 2 threads/cache, Cache ID = 1 Level: 2 (Instruction Cache), 2 threads/cache, Cache ID = 1 Level: 3 (Unified Cache), 2 threads/cache, Cache ID = 1 Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 7 [/bash]


When I execute cpu_topology (the enumeration algorithm provided by intel), I get the following output:

[bash] Software visible enumeration in the system: Number of logical processors visible to the OS: 8 Number of logical processors visible to this process: 8 Number of processor cores visible to this process: 4 Number of physical packages visible to this process: 1 Hierarchical counts by levels of processor topology: # of cores in package 0 visible to this process: 4 . # of logical processors in Core 0 visible to this process: 2 . # of logical processors in Core 1 visible to this process: 2 . # of logical processors in Core 2 visible to this process: 2 . # of logical processors in Core 3 visible to this process: 2 . Affinity masks per SMT thread, per core, per package: Individual: P:0, C:0, T:0 --> 1 P:0, C:0, T:1 --> 2 Core-aggregated: P:0, C:0 --> 3 Individual: P:0, C:1, T:0 --> 4 P:0, C:1, T:1 --> 8 Core-aggregated: P:0, C:1 --> c Individual: P:0, C:2, T:0 --> 10 P:0, C:2, T:1 --> 20 Core-aggregated: P:0, C:2 --> 30 Individual: P:0, C:3, T:0 --> 40 P:0, C:3, T:1 --> 80 Core-aggregated: P:0, C:3 --> c0 Pkg-aggregated: P:0 --> ff APIC ID listings from affinity masks Affinity mask 00000001 - apic id 0 Affinity mask 00000002 - apic id 1 Affinity mask 00000004 - apic id 2 Affinity mask 00000008 - apic id 3 Affinity mask 00000010 - apic id 4 Affinity mask 00000020 - apic id 5 Affinity mask 00000040 - apic id 6 Affinity mask 00000080 - apic id 7 Package 0 Cache and Thread details Box Description: Cache is cache level designator Size is cache size OScpu# is cpu # as seen by OS Core is core#[_thread# if > 1 thread/core] inside socket AffMsk is AffinityMask(extended hex) for core and thread CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache CmbMsk will differ from AffMsk if > 1 hw_thread/cache Extended Hex replaces trailing zeroes with 'z#' where # is number of zeroes (so '8z5' is '0x800000') L1D is Level 1 Data cache, size(KBytes)= 32, Cores/cache= 2, Caches/package= 4 L1I is Level 1 Instruction cache, size(KBytes)= 32, Cores/cache= 2, Caches/package= 4 L2 is Level 2 Unified cache, size(KBytes)= 256, Cores/cache= 2, Caches/package= 4 L3 is Level 3 Unified cache, size(KBytes)= 6144, Cores/cache= 8, Caches/package= 1 +-----------+-----------+-----------+-----------+ Cache | L1D | L1D | L1D | L1D | Size | 32K | 32K | 32K | 32K | OScpu#| 0 1| 2 3| 4 5| 6 7| Core |c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1| AffMsk| 1 2| 4 8| 10 20| 40 80| CmbMsk| 3 | c | 30 | c0 | +-----------+-----------+-----------+-----------+ Cache | L1I | L1I | L1I | L1I | Size | 32K | 32K | 32K | 32K | +-----------+-----------+-----------+-----------+ Cache | L2 | L2 | L2 | L2 | Size | 256K | 256K | 256K | 256K | +-----------+-----------+-----------+-----------+ Cache | L3 | Size | 6M | CmbMsk| ff | +-----------------------------------------------+ [/bash]

I hope this helps!
0 Kudos
Patrick_F_Intel1
Employee
1,140 Views
Thanks Robchip,
Your output looks reasonable.
But, as to whetherthe code will work on allIntel chips, I would have to go through the cpu_topology code, extract out the relevant lines, and compare it to what you've done.
I don't have the time to go through the code like this right now.
If you've extracted out the relevant code from the library correctly then it should work.
Sorry to not be more helpful,
Pat
0 Kudos
robchip
Beginner
1,140 Views
Hey Pat

thanks for the answer!
You said, that the output looks reasonable - but then I have trouble understanding it:

how do I have to interpret the last line of each core:

[bash]Running on core 0 ... Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 0 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 1 ... Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 2 ... Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 2 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 3 ... Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 3 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 4 ... Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 4 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 5 ... Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 6 ... Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 6 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Running on core 7 ... Level: 3 (Unified Cache), 16 threads/cache, Cache ID = 7 [/bash]
shouldn't this cache ID always be the same since there is only one L3 cache shared by all threads?
0 Kudos
Patrick_F_Intel1
Employee
1,140 Views
Yourcache IDprobably should be the same.
Are you doing the same code method as in the cpu_topology library?
If not, why not just use the library?
Pat
0 Kudos
robchip
Beginner
1,140 Views
Hmm, I looked at the code but I'm only querying the hardware not interpreting the information.
Yeah - using the library is a good idea, of course - but at the moment I'd just like to find the bug in the code :)

Thanks a lot for your help!
0 Kudos
robchip
Beginner
1,140 Views
Hey everyone

just for completeness I'd like to post the solution to the problem:
the bug was in line 18 of the original code. The cacheId is calculated differently:

[cpp]const uint32_t cacheID = initialAPICID & (-1 ^ mask)[/cpp]
that way every cacheID is unique.

Thanks everyone for the help!
Robert
0 Kudos
Reply