Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Address of all request made to L@/L3

narran
Beginner
416 Views

I am using an intel xeon x5675 -westmere cpu which has HPCs.How possibily can i get the address of cache lines(using MESI) for which there is an access request from a process/thread. I need to find out the sharing patterns of threads( or just find out whether they share data from a cache)

0 Kudos
11 Replies
Bernard
Valued Contributor I
416 Views

Can you consult Xeon e5 2600 Uncore Guide?

0 Kudos
Patrick_F_Intel1
Employee
416 Views

Hello Narran,

I think the only way you can get this sort of info is by sampling the system with Intel VTune using the PEBS (precise event based sampling) events. I have a question into the VTune folks to see how much info VTune reports from PEBS.

The PEBS events are described in the SDM vol 3. Search for PEBS. Table 18-21 shows the PEBS events. You would have to see if these events will satisfy your needs.

If really want EVERY load and store, then I don't think that is possible. And if you are trying to do real-time scheduling (which would imply analysis of data access patterns in real-time) then you would need to have somthing like PEBS data collection/analysis built in to your kernel. Ouch.

Pat

0 Kudos
Bernard
Valued Contributor I
416 Views

Hi Narran,

By looking at the doumentation provided by Pat it seems that there is no any information about cache line address.I was able to find only linear addresses of data source and linear address of data store.

0 Kudos
Patrick_F_Intel1
Employee
416 Views

I'm pretty sure it can be done. I think I've seen it done inside Intel. It might just be at the node level though (which threads are hitting on the LLC of a different socket).

I have some questions in to the author of the utility and I'll get back when he responds.

Pat

0 Kudos
Patrick_F_Intel1
Employee
416 Views

Hello Narran,

I'm guessing you are looking a linux primarily. Here is a reference to the NumaTop utility (developed by some Intel folks).  See https://01.org/numatop . Below is the description of the utility. It sounds like it does similar things to what you'd like to do. They use the PEBS events in their analysis. Source code is available on the website.

Pat

Most modern systems use a Non-Uniform Memory Access (NUMA) design for multiprocessing. In NUMA systems, memory and processors are organized in such a way that some parts of memory are closer to a given processor, while other parts are farther from it. A processor can access memory that is closer to it much faster than the memory that is farther from it. Hence, the latency between the processors and different portions of the memory in a NUMA machine may be significantly different.

NumaTOP is an observation tool for runtime memory locality characterization and analysis of processes and threads running on a NUMA system. It helps the user characterize the NUMA behavior of processes and threads and identify where the NUMA-related performance bottlenecks reside. The tool uses Intel performance counter sampling technologies and associates the performance data with Linux system runtime information, to provide real-time analysis in production systems.

The tool can be used to:

  • Characterize the locality of all running processes and threads to identify those with the poorest locality in the system.
  • Identify the “hot” memory areas, report average memory access latency, and provide the location where accessed memory is allocated. Note: A “hot” memory area is where process/thread(s) accesses are most frequent. NumaTOP has a metric called “ACCESS%” that specifies what percentage of memory accesses are attributable to each memory area.
  • Provide the call-chain(s) in the process/thread code that accesses a given hot memory area.
  • Provide the call-chain(s) when the process/thread generates certain counter events. The call-chain(s) helps to locate the source code that generates the events.
  • Provide per-node statistics for memory and CPU utilization. Note: A node is a region of memory in which every byte is the same distance from each CPU.
  • Show, using a user-friendly interface, the list of processes/threads sorted by some metrics (by default, sorted by CPU utilization), with the top process having the highest CPU utilization in the system and the bottom one having the lowest CPU utilization. Users can also use hotkeys to resort the output by these metrics: Remote Memory Accesses (RMA), Local Memory Accesses (LMA), RMA/LMA ratio, Cycles Per Instruction (CPI), and CPU utilization.

NumaTOP is a GUI tool that periodically tracks and analyzes the NUMA activity of processes and threads and displays useful metrics. Users can scroll up/down by using the up or down key to navigate in the current window and can use several hot keys, shown at the bottom of the window, to switch between windows or to change the running state of the tool. For example, hotkey 'R' refreshes the data in the current window.

0 Kudos
narran
Beginner
416 Views

 Thanks Patrick and iliyapolak. As uisual you people save me. I will try your suggestions and report my observation

0 Kudos
Bernard
Valued Contributor I
416 Views

Hi Pat,

is processor cache physical implementation partialy visible to the programmer?

0 Kudos
Patrick_F_Intel1
Employee
416 Views

Hey Illyapolak,

It depends on what you mean by 'processor cache physical implementation'. The size, cache lines size, etc is available via CPUID. Below is the decoded cache output from cpuid for my laptop:

Data TLB: 4-MB Pages, 4-way set associative, 32 entries
Instruction TLB: 4-KB Pages, 4-way set associative, 128 entries
Instruction TLB: 4-MB Pages, 4-way set associative, 4 entries; Instruction TLB: 2-MB Pages, 4-way set associative, 8 entries
L1 Data TLB: 4-MB pages, 4-way set associative, 16 entries
L1 Data TLB: 4-KB pages, 4-way set associative, 16 entries
64-Byte Prefetching
1st-level data cache: 32-KB, 8-way set associative, 64-byte line size
Data TLB: 4-KB Pages, 4-way set associative, 256 entries
1st-level instruction cache: 32-KB, 8-way set associative, 64-byte line size
2nd-level cache: 4MB, 16-way set associative, 64-byte line size

Explicit cache info:
Cache Level= 1, Data cache, self-initializing= yes
Cache Level= 1, Data cache, fully associative= no
Cache Level= 1, Data cache, max number of threads sharing this cache= 1
Cache Level= 1, Data cache, Maximum of addressable ID for processor cores in the physical package= 2
Cache Level= 1, Data cache, system coherency line size= 64
Cache Level= 1, Data cache, Physical Line Partitions= 1
Cache Level= 1, Data cache, Ways of associativity= 8
Cache Level= 1, Data cache, Number of sets= 64
Cache Level= 1, Data cache, Cache size= 32 KBytes, 32768 bytes
Cache Level= 1, Instruction cache, self-initializing= yes
Cache Level= 1, Instruction cache, fully associative= no
Cache Level= 1, Instruction cache, max number of threads sharing this cache= 1
Cache Level= 1, Instruction cache, Maximum of addressable ID for processor cores in the physical package= 2
Cache Level= 1, Instruction cache, system coherency line size= 64
Cache Level= 1, Instruction cache, Physical Line Partitions= 1
Cache Level= 1, Instruction cache, Ways of associativity= 8
Cache Level= 1, Instruction cache, Number of sets= 64
Cache Level= 1, Instruction cache, Cache size= 32 KBytes, 32768 bytes
Cache Level= 2, Unified cache, self-initializing= yes
Cache Level= 2, Unified cache, fully associative= no
Cache Level= 2, Unified cache, max number of threads sharing this cache= 2
Cache Level= 2, Unified cache, Maximum of addressable ID for processor cores in the physical package= 2
Cache Level= 2, Unified cache, system coherency line size= 64
Cache Level= 2, Unified cache, Physical Line Partitions= 1
Cache Level= 2, Unified cache, Ways of associativity= 16
Cache Level= 2, Unified cache, Number of sets= 4096
Cache Level= 2, Unified cache, Cache size= 4096 KBytes, 4194304 bytes

0 Kudos
Bernard
Valued Contributor I
416 Views

Pat thanks for the detailed answer.

I meant physical addresses of the cache lines.I suppose that such a information is not visible to the programmer and it is used by hardware/microcode.

0 Kudos
Patrick_F_Intel1
Employee
416 Views

Hello Illyapolak,

The data returned by the PEBS load and store events includes the linear address of the load/store. This is not quite the physical address but it is the physical address+paging table info.

Pat

 

0 Kudos
Bernard
Valued Contributor I
416 Views
Hi Pat, I have found the internal structure of cache in wikipedia article. Web link://en.wikipedia.org/wiki/L1_cache
0 Kudos
Reply