Hello John, thank you for the

Rinaldi__Fabio · ‎10-21-2018

I'm testing the performance of different data structures on an algorithm which is strongly memory bound. I'm able to obtain different execution times depending on the choice of the DS (SoA, AoS or a hybrid between the two).
At the moment I'm trying to profile the behaviour of cache usage for each data structure, hoping to find a correlation with the execution time of the algorithm. Due to lack of experience in the field I'm not sure how to collect and interpret relevant data.

I'm using an Intel Xeon E5-2630v3, therefore I've followed this tutorial published by Intel. For each data structure I've collected the following data:

% of cycles spent on memory access (Local DRAM): (193 * MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM) / CPU_CLK_UNHALTED.THREAD
% of cycles spent on local LLC Hits: ( 47 * MEM_LOAD_UOPS_RETIRED.L3_HIT_PS ) / CPU_CLK_UNHALTED.THREAD
DTLB Overhead: ((DTLB_LOAD_MISSES.STLB_HIT_4K * 7) +(DTLB_LOAD_MISSES.STLB_HIT_2M * 7) + DTLB_LOAD_MISSES.WALK_DURATION) / CPU_CLK_UNHALTED.THREAD

Does "% of cycles spent on memory access (Local DRAM)" refer to memory accesses due to misses in L3 cache only? Or does it take in consideration other events that require memory reads?
What does "DTLB Overhead" represent?

Thank you,
Fabio.

McCalpinJohn · ‎10-22-2018

The first metric ("% of cycles spent on memory access") counts only load operations that miss in the L3 cache. For data that is prefetched into the L3 cache, the load should hit in the L3, and the the second metric ("% of cycles spent on local LLC hits") applies. For data that is prefetched into the L2 cache, the accesses are assumed to be "free". Under heavy loads, the assumption of fixed latency for memory accesses and for L3 hits becomes less accurate, but this is a good start. It is also possible to stall on stores that miss in the caches, but it is not as easy to estimate the cost for these cases.

The metric "DTLB Overhead" attempts to estimate the amount of time spent in data address translations that miss in the Translation Lookaside Buffer. The "Data TLB" can hold 64 address translation entries for 4KiB pages and can hold 32 address translation entries for 2MiB pages. (Intel's documentation is never clear about whether these are separate entries, or whether the 4KiB and 2MiB translations compete for the same entries in the TLB.) If the translation is not in the Data TLB, the hardware looks in the "Secondary TLB" (which holds 1024 entries), which adds a 7 cycle delay to the load. This can often be overlapped with other work by the out-of-order execution engine, so many performance models don't include this term. If the translation is not in the STLB, the hardware "Page Table Walker" is invoked, which looks for the address translation in memory. The cost of these "Page Table Walks" is variable, but the performance counter "DTLB_LOAD_MISSES.WALK_DURATION" tracks the total number of cycles. Page Table Walks are generally considerably more expensive than STLB hits, so these do typically result in processor stalls (because the out-of-order execution engine runs out of work to do).

Rinaldi__Fabio · ‎10-24-2018

Hello John, thank you for the detailed answer.

The metric "DTLB Overhead" attempts to estimate the amount of time spent in data address translations that miss in the Translation Lookaside Buffer

Is the "amount of time" the percentage of cycles spent on data address translations that miss in the Translation Lookaside Buffer? If yes, would it make sense to sum it with the % of cycles spent on memory access as a metric of comparison between the 3 data structures?

McCalpinJohn · ‎10-25-2018

The "DTLB Overhead" metric as defined above is a fraction of the total cycles. Whether these cycles overlap with the memory access cycles of the first metric is much harder to say. The events don't double-count -- memory accesses that are part of Page Table Walks do not show up in the "% of cycles spent on local memory access" metric. On the other hand, these cycles can very easily overlap -- e.g., a load that misses in the L3 cache followed by a load that misses in the TLB and STLB would be expected to have a lot overlap in the cycle counts.

In practice, I would start by saying that the DTLB metric should be very small on recent processors (including your Xeon E5 v3). If it is more than 10%, it should be addressed as an independent performance problem. If it is less than 5%, it is probably not worth the effort to look at the details.

Compare cache performance of different data structures for the same algorithm