Solved: Thank you John for the

Yonghong_Y_ · ‎06-26-2017

Dear all,

I would like to calculate the total number of bytes being accessed by a function of a binary library, and the number of bytes that will need to be accessed from memory (cache miss), using hardware counters. I am thinking of using L1 D$ access, which count the number of L1 access, and then assuming all data access are 64-bit on a 64-bit machine. I can multiple to calculate the size. Is this correct or good way to do it? Also what is the easiest way to calculate the number of bytes being read/write from memory?

Thank you!

Yonghong

McCalpinJohn · ‎06-28-2017

Yes, with inline performance counters on an otherwise idle system it is possible to get very good measurements of DRAM traffic and other uncore events. Due to the nature of writeback caches, the DRAM write traffic that you see during an interval will contain some data that was written before the measurement started, and some of the writes that take place during the interval won't generate DRAM traffic until after you stop measuring. To avoid confusion due to these delayed writebacks, it is often necessary to carefully examine (and possibly instrument) the code before and after your primary region under test.

You don't usually need to stop/start the counters -- just read them before and after the region of interest and take the difference. Most of the counters are 48 bits wide, so if the ending value is smaller than the beginning value you just add pow(2,48) to get the correct difference. (Some counters are 40 bits wide, and the DRAM counters on the Core i3/i5/i7 systems are 32 bits wide, so you need to check the documentation for the specific processor.) The only time that there is a strong incentive to stop (or "freeze") the counters is if you know that your monitoring software is going to take a lot of time doing something -- typically when its buffers become full and it needs to dump a lot of results to disk. My code is never set up this way -- I hold the counters in memory until the end of the job, then process & write the results after the program under test has completed all of its work -- but some more general tools need to operate with finite-sized buffers and should stop or freeze the counters during these processing steps to avoid contaminating the results. (This is not a perfect approach, since data written to the cache by your code may end up getting written back to DRAM while the performance monitoring code is processing its data. You don't want to ignore these counts, but you don't want to count activities that are due *only* to the performance monitoring code. The best approach is to keep the performance monitoring code as tight and lean as possible, and defer processing until after the process under test (or the section(s) that you are interested in) have completed execution.

View solution in original post

McCalpinJohn · ‎06-27-2017

The availability of counters to measure cache and memory access depends on the specific processor model you are working with.

For the Xeon E5/E7 v1/v2/v3/v4 there are counters in the Memory Controllers that count the number of read and write accesses to the DRAMs. It is not possible to identify what process or IO activity causes these accesses, but on a dedicated system the counts are very close to expected values. These counters are described in the "uncore performance monitoring" manuals for each product generation, and the counters can be accessed using Intel's Amplifier XE (aka "VTune"), or by other performance monitoring tools such as "perf", "likwid", the recently discontinued "Intel Performance Counter Monitor" or its replacement the Processor Counter Monitor (https://github.com/opcm/pcm), etc....

For the Core i3/i5/i7 processors there are also DRAM counters in the memory controllers, but these are not supported by very much software. Details are provided at https://software.intel.com/en-us/articles/monitoring-integrated-memory-controller-requests-in-the-2nd-3rd-and-4th-generation-intel.

Looking at accesses to the caches is much more difficult. Most recent processors can give accurate counts of load operations and store operations, but those operations can be accessing 8 bits, 16 bits, 32 bits, 64 bits, 80 bits, 128 bits, 256 bits, or 512 bits. Unless you know the sizes in advance, the results may not be helpful. Outside of the L1 Data Cache, all accesses to normal cacheable memory are full cache lines, but many of the cache-related performance counter events have subtleties in their definitions (e.g., counting only "demand" load accesses and not "prefetch" load accesses), or they are unclear about whether they count retried transactions or not, or they have significant overcounting or undercounting bugs, or all of the above. It is particularly difficult to derive useful information from the L2 counters on most recent Intel processors. For the Xeon E5/E7 v1/v2/v3/v4 processors, I have had pretty good results with the L3 counters in the uncore, but the L3 counts available using the core performance counters have significant limitations and/or bugs.

Yonghong_Y_ · ‎06-27-2017

Thank you John for the information.

So getting size of memory access is possible per process, or even per code region if I am able to enclose the interested region with start/stop counter call for related uncore counters. For reliably getting size of total memory access (hit or miss in the cache) is basically very hard.

I may be able to do initial static binary analysis to identify the load/store instructions and related sizes and then count to see. My intention is not to use instruction tracing, but still get those info with high accuracy. Need to think a bit more on this to see whether this is really what I need.

Thank you again.

Yonghong

McCalpinJohn · ‎06-28-2017

Yes, with inline performance counters on an otherwise idle system it is possible to get very good measurements of DRAM traffic and other uncore events. Due to the nature of writeback caches, the DRAM write traffic that you see during an interval will contain some data that was written before the measurement started, and some of the writes that take place during the interval won't generate DRAM traffic until after you stop measuring. To avoid confusion due to these delayed writebacks, it is often necessary to carefully examine (and possibly instrument) the code before and after your primary region under test.

You don't usually need to stop/start the counters -- just read them before and after the region of interest and take the difference. Most of the counters are 48 bits wide, so if the ending value is smaller than the beginning value you just add pow(2,48) to get the correct difference. (Some counters are 40 bits wide, and the DRAM counters on the Core i3/i5/i7 systems are 32 bits wide, so you need to check the documentation for the specific processor.) The only time that there is a strong incentive to stop (or "freeze") the counters is if you know that your monitoring software is going to take a lot of time doing something -- typically when its buffers become full and it needs to dump a lot of results to disk. My code is never set up this way -- I hold the counters in memory until the end of the job, then process & write the results after the program under test has completed all of its work -- but some more general tools need to operate with finite-sized buffers and should stop or freeze the counters during these processing steps to avoid contaminating the results. (This is not a perfect approach, since data written to the cache by your code may end up getting written back to DRAM while the performance monitoring code is processing its data. You don't want to ignore these counts, but you don't want to count activities that are due *only* to the performance monitoring code. The best approach is to keep the performance monitoring code as tight and lean as possible, and defer processing until after the process under test (or the section(s) that you are interested in) have completed execution.

Calculate number of bytes being accessed and from memory