I ran microarchitecture analysis on 8280 processor and i am looking for usage metrics related to cache utilization like - L1,L2 and L3 Hit/Miss rate (total L1 miss/total L1 requests ...., total L3 misses / total L3 requests) for the overall application. I was unable to see these in the vtune GUI summary page and from this article it seems i may have to figure it out by using a "custom profile".
From the explanation here (for sandybridge) , seems we have following for calculating "cache hit/miss rates" for demand requests-
Demand Data L1 Miss Rate => cannot calculate.
Demand Data L2 Miss Rate =>
(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>
(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)
Demand Data L3 Miss Rate =>
L3 demand data misses / (sum of all types of demand data L3 requests) =>
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)
Q1: As this post was for sandy bridge and i am using cascadelake, so wanted to ask if there is any change in the formula (mentioned above) for calculating the same for latest platform and are there some events which have changed/added in the latest platform which could help to calculate the -
- L1 Demand Data Hit/Miss rate
- L1,L2,L3 prefetch and instruction Hit/ Miss rate
also, in this post here , the events mentioned to get the cache hit rates does not include ones mentioned above (example MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS)
amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.REF_TSC,MEM_LOAD_UOPS_RETIRED.L1_HIT_PS,MEM_LOAD_UOPS_RETIRED.L1_MISS_PS,MEM_LOAD_UOPS_RETIRED.L3_HIT_PS,MEM_LOAD_UOPS_RETIRED.L3_MISS_PS,MEM_UOPS_RETIRED.ALL_LOADS_PS,MEM_UOPS_RETIRED.ALL_STORES_PS,MEM_LOAD_UOPS_RETIRED.L2_HIT_PS:sa=100003,MEM_LOAD_UOPS_RETIRED.L2_MISS_PS -knob collectMemBandwidth=true -knob dram-bandwidth-limits=true -knob collectMemObjects=true
Q2: what will be the formula to calculate cache hit/miss rates with aforementioned events ?
Q3: is it possible to get few of these metrics (like MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS,... ) from the uarch analysis 's raw data which i already ran via -
mpirun -np 56 -ppn 56 amplxe-cl -collect uarch-exploration -data-limit 0 -result-dir result_uarchexpl -- $PWD/app.exe
So, the following will the correct way to run the custom analysis via command line ? -
mpirun -np 56 -ppn 56 amplxe-cl -collect-with runsa -data-limit 0 -result-dir result_cacheexpl -knob event-config=MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS,MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS,MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS,L2_RQSTS.ALL_DEMAND_DATA_RD,MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS,CPU_CLK_UNHALTED.REF_TSC,MEM_LOAD_UOPS_RETIRED.L1_HIT_PS,MEM_LOAD_UOPS_RETIRED.L1_MISS_PS,MEM_LOAD_UOPS_RETIRED.L3_HIT_PS,MEM_LOAD_UOPS_RETIRED.L3_MISS_PS,MEM_UOPS_RETIRED.ALL_LOADS_PS,MEM_UOPS_RETIRED.ALL_STORES_PS,MEM_LOAD_UOPS_RETIRED.L2_HIT_PS:sa=100003,MEM_LOAD_UOPS_RETIRED.L2_MISS_PS -- $PWD/app.exe
(please let me know if i need to use more/different events for cache hit calculations)
Q4: I noted that to calculate the cache miss rates, i need to get/view data as "Hardware Event Counts", not as "Hardware Event Sample Counts".https://software.intel.com/en-us/forums/vtune/topic/280087 How do i ensure this via vtune command line? as I generate summary via -
vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Let me know if i need to use a different command line to generate results/event values for the custom analysis type.
I was able to get values of following events with the mpirun statement mentioned in my previous post -
Event summary ------------- Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample Precise:Self --------------------------- ------------------------- -------------------------------- ----------------- ------------ CPU_CLK_UNHALTED.REF_TSC 339611609416650 3396111 2000003 False L2_RQSTS.ALL_DEMAND_DATA_RD 7106609797548 1544892 200003 False
came across the list of supported events on skylake (hope it will be same for cascadelake) here
Seems most of the events mentioned in post (for cache hit/miss rate) are not valid for cascadelake platform.
Which events could i use for cache miss rate calculation on cascadelake?
The web pages at https://download.01.org/perfmon/index/ don't expose the differences between client and server processors cleanly. The Xeon Platinum 8280 is a "Cascade Lake Xeon" with performance monitoring events detailed in the files in https://download.01.org/perfmon/CLX/
The list of events you point to for "Skylake" (https://download.01.org/perfmon/index/skylake.html) look like Skylake *Client* events, but I only checked a few. The Skylake *Server* events are described in https://download.01.org/perfmon/SKX/
These files provide lists of events with full detail on how they are invoked, but with only a few words about what the events mean.
For more descriptions, I would recommend Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual -- document 325384. The SW developer's manuals can be found at https://software.intel.com/en-us/articles/intel-sdm
Chapter 19 provides lists of the events available for each processor model. These tables have less detail than the listings at 01.org, but are easier to browse by eye. The lists at 01.org are easier to search electronically (in part because searching PDFs does not work well when words are hyphenated or contain special characters) and the lists at 01.org provide full details on how to use some of the trickier features, such as the OFFCORE_RESPONSE counters.
One question that needs to be answered up front is "what do you want the cache miss rates for?". The MEM_LOAD_UOPS_RETIRED events indicate where the demand load found the data -- they don't indicate whether the cache line was transferred to that location by a hardware prefetch before the load arrived. So these events are good at finding long-latency cache misses that are likely to cause stalls, but are not useful for estimating the data traffic at various levels of the cache hierarchy (unless you disable the hardware prefetchers).
I'll go through the links shared and will try to to figure out the “overall” misses (which includes both instructions and data ) at various cache hierarchy/levels - if possible .
I believe i have Cascadelake server as per lscpu (Intel(R) Xeon(R) Platinum 8280M) .
After my previous comment, i came across a blog. So taking cues from the blog, i used following PMU events
MEM_LOAD_RETIRED.FB_HIT_PS,MEM_LOAD_RETIRED.FB_HIT MEM_LOAD_RETIRED.L1_MISS,MEM_LOAD_RETIRED.L1_HIT, MEM_LOAD_RETIRED.L1_MISS_PS,MEM_LOAD_RETIRED.L1_HIT_PS MEM_LOAD_RETIRED.L2_MISS,MEM_LOAD_RETIRED.L2_HIT, MEM_LOAD_RETIRED.L2_MISS_PS,MEM_LOAD_RETIRED.L2_HIT_PS MEM_LOAD_RETIRED.L3_MISS,MEM_LOAD_RETIRED.L3_HIT MEM_LOAD_RETIRED.L3_MISS_PS,MEM_LOAD_RETIRED.L3_HIT_PS
and used following formula (also mentioned in blog)
L1 miss rate = (HIT_LFB + L1_MISS) / (HIT_LFB + L1_MISS + L1_HIT) L1 hit rate = (L1_HIT) / (HIT_LFB + L1_MISS + L1_HIT) L2 miss rate = L2_MISS / L1_MISS L2 hit rate = L2_HIT / L1_MISS Local L3 miss rate = L3_MISS / L2_MISS Local L3 hit rate = L3_HIT / L2_MISS Global L3 miss rate = L3_MISS / L1_MISS Global L3 hit rate = L3_HIT / L1_MISS
Is this the correct method to calculate the (data – demand loads,hardware & software prefetch) misses at various cache levels?
Though what i look for i the overall utilization of a particular level of cache (data + instruction) while my application was running.
In aforementioned formula, i am not using events related to capture instruction hit/miss data
in this https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
i just glanced over few topics and saw .
L1 Data Cache Miss Rate= L1D_REPL / INST_RETIRED.ANY
L2 Cache Miss Rate= L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY
but can't see L3 Miss rate formula.
Thanks for the pointer to Hadi's blog -- it is a very interesting read....
The MEM_LOAD_RETIRED PMU events will only increment due to the activity of load operations -- not code fetches, not store operations, and not hardware prefetches. So the formulas based on those events will only relate to the activity of load operations.
Looking at the other primary causes of data motion through the caches:
- Store operations: Stores that miss in a cache will generate an RFO ("Read For Ownership") to send to the next level of the cache. This looks like a read, and returns data like a read, but has the side effect of invalidating the cache line in all other caches and returning the cache line to the requester with permission to write to the line. Depending on the structure of the code and the memory access patterns, these "store misses" can generate a large fraction of the total "inbound" cache traffic. After the data in the cache line is modified and re-written to the L1 Data Cache, the line is eligible to be victimized from the cache and written back to the next level (eventually to DRAM). This accounts for the overwhelming majority of the "outbound" traffic in most cases.
- Hardware prefetch: Note again that these counters only track where the data was when the load operation found the cache line -- they do not provide any indication of whether that cache line was found in the location because it was still in that cache from a previous use (temporal locality) or if it was present in that cache because a hardware prefetcher moved it there in anticipation of a load to that address (spatial locality).
- Software prefetch: Hadi's blog post implies that software prefetches can generate L1_HIT and HIT_LFB events, but they are not mentioned as being contributors to any of the other sub-events. (I would guess that they will increment the L1_MISS counter on misses, but it is not clear whether they increment the L2/L3 hit/miss counters.). Although software prefetch instructions are not commonly generated by compilers, I would want to doublecheck whether the PREFETCHW instruction (prefetch with intent to write, opcode 0f 0d) is counted the same way as the PREFETCHh instruction (prefetch with hint, opcode 0f 18).
- There are many other more complex cases involving "lateral" transfer of data (cache-to-cache). These are usually a small fraction of the total cache traffic, but are performance-critical in some applications.
- Streaming stores are another special case -- from the user perspective, they push data directly from the core to DRAM. (If the corresponding cache line is present in any caches, it will be invalidated.). This traffic does not use the storage of the caches, but it does access the cache tags and it does use the same data pathways that the caches use.
These counters and metrics are definitely helpful understanding where loads are finding their data. This is important because long-latency load operations are likely to cause core stalls (due to limits in the out-of-order execution resources).
These counters and metrics are not helpful in understanding the overall traffic in and out of the cache levels, unless you know that the traffic is strongly dominated by load operations (with very few stores). This almost always requires that the hardware prefetchers be disabled as well, since they are normally very aggressive.