Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Evaluating dfferent software prefetch schemes SandyBridge and later processors

andy-nisbet
Beginner
462 Views
Hello,
I'm experimenting with tuning a few different approaches for software prefetching and would benefit from some information or advice. For example, how can I count occurence of load/prefetch instruction issue stalls due to full load buffers in the micro-architecture (LD_BLOCKS.ALL_BLOCK?). I presume prefetch requests go through the load buffers? I'd like to measure the number of L1 (*), L2 (L2_DATA_RQSTS. DEMAND.*) and LLC (*) misses due to demand loads in order to try to determine which software prefetching scheme is better. I am happy to use PAPI/raw MSR approaches or an Intel Amplifier based method.

ie I'm doing some tuning of software prefetching and am seeking a performance counter approach to identify potential issues.
I know I can probably turn off hardware prefetching in the BIOS but have not tried this yet in my runs.

Advice on the specific events (*) to count and/or calculations using multiple events that can be performed would be really useful. I'd like to separate demand from prefetch misses in order that I can try to tune my scheme.

Thanks,

Andy
0 Kudos
3 Replies
TimP
Honored Contributor III
462 Views
Amplifier has preset event groups for analyzing memory access questions. I would take those as a starting point.
Intel compilers have built-in schemes to generate software prefetch for single level indirection which is one of the more common situations where the hardware can't do the job without help but the compiler can help.
Large stride with frequent DTLB misses is a situation where you may find benefit from such investigations.
0 Kudos
andy-nisbet
Beginner
462 Views
Quoting TimP (Intel)
Amplifier has preset event groups for analyzing memory access questions. I would take those as a starting point.
Intel compilers have built-in schemes to generate software prefetch for single level indirection which is one of the more common situations where the hardware can't do the job without help but the compiler can help.
Large stride with frequent DTLB misses is a situation where you may find benefit from such investigations.

Many thanks, I'll recheck Amplifier, but I had hoped someone in the community had done this kind of analysis before and could help with a pointer to a paper, or a list of appropriate custom events/MSRs to track in order to create a custom event group for Amplifier, or for use via PAPI etc.

I should have mentioned my application is parallel, irregular (sparse) and iterative (but with repeatable access patterns), so we want to tune the prefetching performed --- the software prefetching may be ok in intel compilers on this problem, or it may not, but we want to find/evaluate the "optimal". One of my goals is to evaluate useful/useless prefetching and the effects of disabling the hardware prefetcher.

I've seen plenty of papers on the topic, but very few the clearly explain the details or the nuts and bolts of their evaluation (unless they use simulation/Simics and then it's clear how they can get their numbers as whatever counters are necessary can be instantiated). Of course, the goal is to lower execution time (and potentially power) so we can always measure that ...

Thanks,

Andy
0 Kudos
McCalpinJohn
Honored Contributor III
462 Views
Disabling the hardware prefetchers is definitely the easiest way to determine which fraction of cache misses are due to "normal" cache replacement behavior and which are due to either failures of prefetchers to get data into the cache on time or to evictions of useful data by prefetches. Your BIOS may or may not provide options to separately enable/disable the two L1 data prefetchers and the two L2 prefetchers. Using the terminology of the Intel SW Optimization manual (document 248966, version 026, section 2.1, specifically table 2-7), it is probably safe to assume that L1 hardware prefetches use "Line Fill Buffers", but probably don't use "Load buffers" or "Store buffers". The former handles the transfer of the data into the L1 Data Cache, while the latter are used to route the requested parts of the cache line back into the registers and out-of-order execution engine and to maintain (the appearance of) sequential execution of memory references. Unfortunately there does not appear to be a repository of "collected wisdom" on the detailed interpretation of the performance counter events for any current microprocessor, especially not something as new as Sandy Bridge. Every once in a while I think that I am going to have time to start such a repository, but it never quite seems to happen. For Sandy Bridge, Event 48h counts the number of outstanding L1D misses and so it should provide a number that is highly correlated with the average number of Line Fill Buffers in use. Event A2h counts stall cycles due to lack of load buffers or lack of store buffers. Events D1h, D2h, F0h, F1h, F2h look useful for measuring L1 and L2 hits and misses, while the offcore response counters can be used to differentiate sources of data for L2 misses of different types. The biggest problem with most systems is figgering out which events "overcount" due to retries and whether or not these retried events actually impair performance. I usually run a bunch of carefully configured microbenchmarks and compare the counts with expectations, but have not had time to do this on Sandy Bridge -- still working on the infrastructure required to get access to the !@*&^%# uncore performance counters that were just moved from MSRs to PCI configuration space.
0 Kudos
Reply