performance benefit in using large pages

Chris_A_ · ‎02-17-2016

I'm trying to get to the bottom of what the performance benefit is in using large memory pages, there is a very good discussion thread on this here:

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/332852

This covers large memory pages from a TLB angle, specifically the fact that it allows the TLB to cover a larger area of memory and reduces the DLTB mapping structure ( some form of tree ? ) by one level.

However, I would like some guidance as to how this helps with regard to memory prefetching and a reduction of in CPU cache random access once the memory page cache lines are loaded into the CPU cache.

Thanks in anticipation of whoever may help me get to the bottom of this.

Chris

TimP · ‎02-17-2016

I suspect further detail will involve VTune analysis of your own applications and hardware targets, if you think it worth while. IA supports in-cache DTLB miss (DTLB miss doesn't necessarily invalidate data cache). It still may be a performance issue, even if it is confined to TLB page walk performance.

I accepted gratefully the introduction of transparent huge pages when they doubled the performance of some large stride cases, without wanting to learn lower level detail, as that seemed to do all that could be hoped for.

Intel put medium page support into MIC KNC hardware but seems to have given up on putting it to use. So I would think even experts have difficulty generalizing on these questions.

McCalpinJohn · ‎02-17-2016

Large pages won't make any difference in hardware prefetching, since the hardware prefetchers only work within 4KiB pages. (There is one exception -- the "next page prefetcher" in Ivy Bridge and Haswell crosses page boundaries, but it is not a very aggressive prefetcher -- it looks like it is there primarily to get the 4KiB page table walk done in advance of the page crossing done by the demand accesses.)

Software prefetches are only significantly effected by large pages on Xeon Phi (which drops software prefetches that require page table walks).

Large pages typically make the caches behave more repeatably by eliminating "page colors". The L1 Data Cache has no page colors -- every 4KiB page maps to the same sets in the L1 cache. The 256KiB L2 cache in most recent Intel processors has 8 colors. Derivation: 256Ki = 18 address bits. Subtract 3 address bits for 8-way associativity and 12 address bits that are unchanged in the virtual to physical translation. This leaves 3 address bits that are used to index into the L2 cache that are changed in the virtual to physical address translation. Typically this translation is pseudo-random, but that just means that you can't control when it will cause you trouble.... (Note that the associativity of the L2 cache is decreased to 4-way in the Skylake client core, giving 16 page colors.)

Large (2MiB) pages leave the bottom 21 bits untranslated, so all of the bits used to index into the L2 cache are visible to the user (they are the same as the virtual address bits). This often allows you to use ~100% of the L2 cache without causing conflicts, while 4KiB pages will very seldom map uniformly across the 8 colors to allow use of all of the L2 cache.

One downside to large pages in the L2 cache is that if you have a cache conflict associated with a stride of 8KiB or 16KiB, large pages will guarantee that the conflict happens every time. In this case the address randomization of the page color bits using 4KiB pages would spread the conflicting accesses over 2 or 4 different locations in the L2 cache.

The L3 is similar, but there are more configuration options. For the Xeon E5 chips the hash of physical addresses to L3 slices is not published, so there is some uncertainty about which bits are used to compute the target L3 slice number. For a Xeon E5-2680 (8 core, 20 MiB L3), the L3 cache is 20-way set-associative, meaning that at least 1MiB (20 bits) of addresses are used to select the index in the L3 cache. I have not checked, but it seems likely that some additional bits are hashed in to provide more effective randomization of L3 accesses. If this is the case, then you will still have page colors (i.e., translated address bits used to index into the cache), but probably not very many of them.