How Xeon Phi divides address space with distributed L2

Jun_Hyun_S_ · ‎08-24-2015

Hello,

I have been working with Knights Corner platform for some time. Like they do with libnuma and DPDK, I have been wondering if I could write a cache and memory controller-aware memory allocation code for Xeon Phi. Last time I asked, I didn't get much information on the subject (https://software.intel.com/en-us/comment/1799811#comment-1799811), but then I came across this while browsing through the datasheet (http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-phi-coprocessor-datasheet.pdf).

"Communication around the ring follows a Shortest Distance Algorithm (SDA). Coresident with each core structure is a portion of a distributed tag directory. These tags are hashed to distribute workloads across the enabled cores. Physical addresses are also hashed to distribute memory accesses across the memory controllers."

I believe a full description of this scheme is the answer I'm seeking.

If someone from Intel could provide me with a more detailed explanation, or where I could find one, I would be very grateful. More honestly, I NEED to know this.
For instance,
a) how does it hash physical addresses? Does it divide 40-bit physical address space by cache line size (64B) and distribute 0x400000000 (2^34) cache lines to DTD by performing a modulo operation on ordinal number of each cache line?
b) Is L2 address space segmentation in PA space somewhat preserved in VA space as well? For instance, would every 60th cache line belong in a specific core's L2 tag directory?
c) which one of the following does "enabled cores" mean? i) all on-board cores, ii) cores with any executing threads, iii) cores not disabled by some means I'm not aware of. If iii) is the case, how do you disable the core?

For your information, I am currently using Xeon Phi 5110P, and could possibly be purchasing/using more 31S1Ps.

Thank you for your attention.

Jun

James_C_Intel2 · ‎08-24-2015

None of the information you are requesting is published (or even available to the software groups inside Intel!)

McCalpinJohn · ‎08-24-2015

There are two hashes here -- (1) the mapping of cache line addresses to distributed duplicate tag directory and (2) the mapping of cache line addresses from "physical address" to [memory controller,channel,row,column].

Neither mapping is documented. The mapping of cache line addresses to duplicate tag directories is likely to be extremely difficult to derive due to the lack of performance counters in the DTDs.

I have derived the mapping of addresses to memory controller & channel using the Memory Controller (GBox/FBox) performance counters and a test code that systematically loaded from each address many times and then checked to see which memory controller counter incremented. By measuring the latency from different cores and using the die photo (available from http://newsroom.intel.com/docs/DOC-3126), the relative positions of the cores and memory controllers on the ring could be determined.

The short answer is that when ECC is enabled (the only case I tested), the "physical" address is divided into "blocks" of 62 cache lines, which are then interleaved around the 8 memory controllers and 2 channels per controller using a fixed repeating pattern. The other 2 cache lines in each 4KiB range are used to hold the ECC data. (The ratio is clear from the output of "cat /proc/meminfo" which shows that the total memory installed is 31/32 of the nominal memory size.)

The effect is that contiguous virtual addresses map to a single memory controller for anywhere between 2 and 62 cache lines before jumping another controller. The reason for the wide range is that virtual memory is (by default) mapped into 4KiB pages, which intersect the 62-cache-line DRAM blocks at any possible (even) offset. With large (2MiB) pages, the memory will be mapped to the various memory controllers in 62-cache-line blocks, except for the beginning and end of the page, where the page addresses are (typically) not aligned with the 62-cache-line blocks used by the memory controllers.

This mapping scheme allows contiguous accesses to exploit spatial locality in the ECC bits stored in the 2 hidden cache lines of each 4KiB DRAM page. If contiguous cache lines were interleaved directly among the 16 channels, a stream of accesses would have to be at least 32 cache lines long just to use the ECC data for each page twice. Another nice feature of the scheme is that the base of each 4 KiB (virtual) page is hashed around the 16 DRAM channels -- so you can have lots of short page-aligned data structures and they won't all be mapped to a single DRAM channel.

Evan_P_Intel · ‎08-24-2015

As far as the meaning of "enabled cores"--you'll recall that there are a few different SKUs of Xeon Phi with different numbers of cores (e.g. the 3120A has 57, the 7120P has 61), while the die photo linked by John clearly has 62 cores. It's this discrepancy to which that language is referring (i.e. the 3120A has 57 "enabled cores").

McCalpinJohn · ‎08-24-2015

A few more comments....

(1) The L2 cache on Xeon Phi is not "distributed" -- it is a standard private L2 cache. A core can only allocate into its own private L2 cache and only queries its own private L2 cache on an L1 cache miss.

On an L2 cache miss, the other L2 caches have to be queried, since one of them might have a modified version of the cache line requested. The aggregate bandwidth of the Xeon Phi is too high to snoop all 60 of the other L2 caches, so Intel added a Distributed Duplicate Tag structure and mapped each physical cache line address to exactly one of these Distributed Tag Directories. This is similar in principle to the directories used in large ccNUMA systems, but it differs in that it explicitly trades away locality to reduce "hot spots" in directory lookups.

Any cache-coherent system has to have some mechanism (e.g. snooping or directories) to locate modified copies of cache lines so that each processor will only be able to access the current version of the data. In Xeon Phi the directories are also used to find copies of *unmodified* data in other L2 caches and copy such data to the requesting processor's cache. This can be done either to reduce the latency of getting shared data or to reduce memory utilization by getting the data from another cache instead of from memory. In Xeon Phi the latency of shared interventions is not a lot lower than the latency of DRAM accesses, so the motivation was probably to reduce the DRAM access rate.

(2) Although it would be extremely challenging to derive the arithmetic function used to map physical addresses to Distributed Tag Directories, it is quite easy to measure the latency required for two cores to "ping-pong" updates to any specific address. My observations suggest that the 64 cache lines in each 4 KiB page are distributed all the way around the ring, so you only need to test 64 cache lines to be confident that you have seen (effectively) the full range of possible latencies.

So for any application that is doing explicit "producer-consumer" or other explicit shared-memory synchronization, you can quickly find an address that gives a good latency for interactions between any pair of cores. Under normal circumstances the page mapping will not be changed once the page is instantiated, but it is probably a good idea to "pin" the page anyway, using either the "mlock()" call or the MAP_LOCKED option to "mmap()".

Cores on the Xeon Phi appear to be numbered consecutively along the ring, so core numbers that are close also correspond to cores that are close together. For adjacent core numbers, I have seen a ratio of best-case to worst-case "ping-pong" latency of almost 3:1. Presumably the best case latency occurs when the Distributed Tag Directory for a given address is co-located with one of the cores and the worse case latency occurs when the Distributed Tag Directory for a given address is all the way on the other side of the ring from the two cores.

For cores that are farther apart, the ratio of "best case" to "worst case" latency decreases, with cores all the way across the ring from each other showing almost constant "ping pong" latency for varying addresses. This constant latency is almost twice the "best case" latency for adjacent cores, so there is a potential benefit in exploiting core-number locality in codes. Interestingly, the *average* latency (averaged over all addresses) is almost independent of the relative position of the cores on the ring.

Jun_Hyun_S_ · ‎08-26-2015

Dr. McCalpin,

Thank you for your very comprehensive summary of what you have evaluated and surmised so far.

I am currently building a kernel module that exposes a character device that access MIC performance counters from host (with MMIO register addresses from system developer's guide) and confirm your findings by myself. Needless to say, this is reinventing the wheel, but I need it verified on my own terms as it is part of my ongoing research.

The next step for me would be to add memory controller-aware allocation scheme to the kernel module where the module, upon userspace request with memory size and cpuset-like affinity mask as parameters, would return the beginning userspace VA of the VA-contiguous memory region of size not less than the amount requested. (hints as to whether this allocation was optimal or simply a best guess under given affinity configuration would be useful) I also intend to implement userspace microbenchmarks running on Xeon Phi that would give ping-pong latency between shared cache line mapped to a specific memory controller which could hopefully give us hints as to which TD it is referring. Remote TD access would obviously cause an extra multi-hop message in the address ring and the acknowledgement ring of bi-directional ring interconnect.

The challenging part for me would be to find out whether kernel level mallocs can be done with a physical address requirement, and to find out whether pages that are fragmented at cache-line granularity across different memory controllers can be mapped to contiguous virtual address space. Also, I wonder if TSC precision is sufficient to detect subtle latency differences between configurations.

Again, thank you for your valuable input. I would be posting notable discoveries here as I progress.

I also thank Mr. Powers for helpful comment.

Jun

McCalpinJohn · ‎08-26-2015

For reading the memory controller performance counters I just opened /dev/mem (as root) and used mmap() to provide a pointer to the address range used by the memory controller control registers and counters. This allowed me to do all my experiments in user space, which I find much easier than doing code development in kernel space.

You may need a kernel module to set CR4.PCE so that you can execute the RDPMC instruction in user space. I did not use it for these tests, but most of my test codes use this approach.

To determine the locations of the memory controllers relative to the cores, I ran a version of the "lat_mem_rd" pointer-chasing code that with a pointer chain composed of addresses from a single memory controller. The RDTSC instruction on Xeon Phi is very fast (~5 cycles), so it can be used to time the latency of individual loads -- you just need to be careful that the array used to hold the timing data is already "dirty" in the L1 cache before you start.

The 62-cache-line blocks extend through the entire physical address range, so 4 KiB virtual pages will have a uniform distribution of 2, 4, 6, ..., 62 cache lines associated with a single memory controller. The next block of 62 cache lines will be placed on a memory controller that is not adjacent to the current controller -- there is a repeating hash of 8 blocks across channel 0 of each of the eight memory controllers, then the next 8 blocks is hashed to channel 1 of the same permutation of memory controllers, then the pattern is repeated.

So if you want 90% of the data to be associated with a particular controller, you only have a small fraction of the 4 KiB pages to work with -- those with 58, 60, or 62 lines allocated to the target controller. From the die photo (and my results), it looks like the memory controllers are placed along the ring in pairs, so from a locality perspective you should be able to treat the system as having 4 memory domains. Using the numbering from the memory controller performance counters, the pairs are (0,1),(2,3),(4,5),(6,7).

On the system I tested (Xeon Phi SE10P) the memory controller mapping was pretty simple. Just take the physical address, divide by 62, and look up the memory controller and channel from the low-order 4 bits of the result. Note that none of the results below transition from one memory controller to a physically adjacent memory controller. It is relatively easy to turn the hash in the table below into a bit function, but I find it easier to just use a table lookup.

select = (PhysAddr/62) & 0xF

select     MemCntrl    Channel

   0                3                0

   1                4                0

   2                2                0

   3                5                0

   4                1                0

   5                6                0

   6                0                0

   7                7                0

   8                3                1

   9                4                1

   A                2                1

   B                5                1

   C                1                1

   D                6                1

   E                0                1

   F                7                1