Solved: Topology, Bandwidth, and Latency of the Cache Coherence Network

Chris_W_1 · ‎05-26-2017

Hello,

I'm trying to build a "topology graph" for an Intel CPU (in this case an i7-3770, but I may need to do the same for other CPUs), where each logical core is represented by a vertex, and edges indicate communication channels between cores. These edges should be labeled with bandwidth and latency values.

I can see two ways of generating such a graph. The first would be to assume that the graph is fully connected, and then obtain edge labels by using cache access latency and bandwidth values (e.g. logical cores sharing the same physical core would be connected via the L1 cache, and logical cores on different physical cores would have edge labels based on L3 access times).

The second method, which seems like it would be more representative of the reality of the hardware, would be to use information concerning the cache coherence network when building the graph. From my understanding, Intel chips use a ring topology for this network, which is quite different than the fully-connected network I would have to use if I followed the first method. However, I have been unable to find latency and bandwidth values for the data lines (as opposed to channels reserved for management messages) of the cache coherence network. Is this data publicly available and, if so, where might I find it?

This second approach raises the question of what values to use when labeling edges between logical cores that share the same physical core. My first guess would be to use the L1 cache latency and bandwidth, but I'm not sure if that would be the best solution. Does anyone have any suggestions on what might be a better source of values for these edges?

Thanks,

Chris Wailes

McCalpinJohn · ‎05-30-2017

Recent Intel architectures are not directly amenable to such a topological description. For the ring-based Xeon E5 v1, v2, v3, v4, every physical address is mapped (via an undocumented hash) to one of the L3 slices, and mapped (via partially documented mechanisms) to a Home Agent/Memory Controller, and then to a DRAM channel. The cache coherence protocol is not documented directly, but there is some information implicit in the system configuration registers (described in volume 2 of the datasheet for each product) and a lot of information implicit in the uncore performance counter documentation.

At the very least, the communication between one core and the L1 Data Cache of another core depends on not only the relative positions of the cores, but also on the position of the L3 slice (the "CBo") that owns the physical address being accessed. If "N" is the number of cores (which is usually, but not always) equal to the number of L3 slices, then this expands the N^2 relations to N^3 relations. With careful study of the uncore performance counters and carefully constructed and executed microbenchmarks, it is possible to determine which links are used for each triplet (Requesting Core, Core with data in L1/L2, L3 slice owning address). For a single 12-core Xeon E5 v3 processor, the tests required to determine this mapping generated about one million pieces of data:

12 different cores to request the data
12 different cores whose L1 cache holds a modified copy of the data
12 different L3 slices that can own the physical address being accessed
16 different uncore "boxes" to measure traffic (12 CBos + 4 SBos)
4 performance counters per uncore box
10 trials
12 x 12 x 12 x 16 x 4 x 10 = 1,105,920 measurements

First, I have to determine how cache line addresses are mapped to L3 slices. So I allocate a single 2 MiB page and pin it. For each of the 32,768 cache lines in the page, the code performs a few thousand repeated accesses to the same address (separated by CLFLUSH instructions), and uses the L3 CBo access counters to determine which CBo "owns" the physical address in question. The code includes a number of "sanity checks" on these performance counter measurements, and repeats the measurement loop if any of these checks fail. This routine determines the mapping of each of the 32,768 lines to the 12 L3 slices in about 5 seconds. For each of the 12 L3 slices, the routine returns a vector of array indices for the first 8-Byte-aligned element in a cache line mapped to that L3 slice. On the system I tested, 8 of the 12 L3 slices have 2688 cache lines mapped from the 2MiB page, while 4 of the 12 L3 slices have 2816 cache lines. This is a surprisingly large difference in loading (4.76%). Presumably different 2 MiB pages have different "overloaded" pages so that the mapping is uniform in a global sense.

Given these lists of addresses mapping to a single L3 slice, one can then perform standard pointer-chasing latency tests with full control over the data placement. It is still important to disable the hardware prefetchers before doing these pointer-chasing tests, since an average of about 5 consecutive entries in the list will be mapped to the same 4KiB page (4KiB = 64 cache lines, 64 cache lines / 12 L3 slices = 5.333 lines/slice). One can also perform bandwidth measurements using the same lists, but the limitations of the processor make the results difficult to interpret. Each core can only support about 10 outstanding L1 Data Cache misses, but more than 10 concurrent transfers are required to fully tolerate the L3 cache latency. In rough numbers, the peak bandwidth is 32 Bytes/cycle and the L3 load latency (for clean data -- i.e., not dirty in another L1/L2) is about 42 cycles, giving a latency-bandwidth product of 1344 Bytes, or 21 cache lines. The only way for a single core to get more that 10 cache lines in flight is by way of the L2 hardware prefetchers, but these won't limit their fetches to lines from the target L3 cache -- they will fetch from any address that looks like it is an extension of a "stream" of requests. So you might get more bandwidth for the loads from the target L3 or you might get less. I would guess the latter. Without HW prefetch, the best you can get is 10 cache lines per latency, or about 640 Bytes/42 cycles = 15.2 Bytes/cycle. But this is a concurrency-limited value and does not tell you anything about the actual bandwidth.

I did not run either the latency or bandwidth tests, but I did run a set of load tests for lines mapped to a single L3 while measuring the ring traffic at all 12 CBos and all 4 SBos. This enabled me to determine the mapping of the core numbers and the L3 CBo numbers to the various boxes on the topology diagram -- e.g., figure 1-2 of the Xeon E5 v3 Uncore Performance Monitoring Guide. The mapping is not even remotely intuitive, and the mapping of core numbers to physical locations is completely unrelated to the mapping of CBo numbers to physical locations.

All of this is straightforward, if unpleasantly labor-intensive. Unfortunately the general situation is worse. I did my tests on a 12-core Xeon E5 v3, based on a 12-core die -- so all cores were active and all L3 slices were active. Subsequent experiments on dies with disabled cores and L3 slices (e.g., a Broadwell 14-core processor based on a 15-core die) make it clear that there is another level of mapping underneath the core APIC numbers and the CBo numbers (implied by the MSR addresses used to access the CBo). There are PCI configuration space registers showing "bit maps" of enabled cores and enabled L3 slices. These show 14 enabled cores and 14 enabled L3 slices on each chip, but the disabled L3 slice was in any of seven different positions across the set of 20 Xeon E5 v4 processors that I tested. But the L3 CBos are still accessed (via MSRs) as if they were numbered 0-13 --- i.e., the missing L3 slice is "mapped over". It looks like the same thing happens with cores, but I don't have that data at hand. The net of all this is that there appears to be an unknown mapping function for a core's physical position to its APIC number and an unknown mapping function for an L3 CBo's physical position to its implied slice number --- so without further information, every single chip needs to be reverse-engineered independently. Maybe after doing a number of these, it might be possible to derive some rules for understanding how the enabled/disabled bit maps relate to the physical layout, but it seems like an awful lot of work, and it would be difficult to know whether all of the patterns might be swept away by a BIOS change.

View solution in original post

McCalpinJohn · ‎05-30-2017

Recent Intel architectures are not directly amenable to such a topological description. For the ring-based Xeon E5 v1, v2, v3, v4, every physical address is mapped (via an undocumented hash) to one of the L3 slices, and mapped (via partially documented mechanisms) to a Home Agent/Memory Controller, and then to a DRAM channel. The cache coherence protocol is not documented directly, but there is some information implicit in the system configuration registers (described in volume 2 of the datasheet for each product) and a lot of information implicit in the uncore performance counter documentation.

At the very least, the communication between one core and the L1 Data Cache of another core depends on not only the relative positions of the cores, but also on the position of the L3 slice (the "CBo") that owns the physical address being accessed. If "N" is the number of cores (which is usually, but not always) equal to the number of L3 slices, then this expands the N^2 relations to N^3 relations. With careful study of the uncore performance counters and carefully constructed and executed microbenchmarks, it is possible to determine which links are used for each triplet (Requesting Core, Core with data in L1/L2, L3 slice owning address). For a single 12-core Xeon E5 v3 processor, the tests required to determine this mapping generated about one million pieces of data:

12 different cores to request the data
12 different cores whose L1 cache holds a modified copy of the data
12 different L3 slices that can own the physical address being accessed
16 different uncore "boxes" to measure traffic (12 CBos + 4 SBos)
4 performance counters per uncore box
10 trials
12 x 12 x 12 x 16 x 4 x 10 = 1,105,920 measurements

First, I have to determine how cache line addresses are mapped to L3 slices. So I allocate a single 2 MiB page and pin it. For each of the 32,768 cache lines in the page, the code performs a few thousand repeated accesses to the same address (separated by CLFLUSH instructions), and uses the L3 CBo access counters to determine which CBo "owns" the physical address in question. The code includes a number of "sanity checks" on these performance counter measurements, and repeats the measurement loop if any of these checks fail. This routine determines the mapping of each of the 32,768 lines to the 12 L3 slices in about 5 seconds. For each of the 12 L3 slices, the routine returns a vector of array indices for the first 8-Byte-aligned element in a cache line mapped to that L3 slice. On the system I tested, 8 of the 12 L3 slices have 2688 cache lines mapped from the 2MiB page, while 4 of the 12 L3 slices have 2816 cache lines. This is a surprisingly large difference in loading (4.76%). Presumably different 2 MiB pages have different "overloaded" pages so that the mapping is uniform in a global sense.

Given these lists of addresses mapping to a single L3 slice, one can then perform standard pointer-chasing latency tests with full control over the data placement. It is still important to disable the hardware prefetchers before doing these pointer-chasing tests, since an average of about 5 consecutive entries in the list will be mapped to the same 4KiB page (4KiB = 64 cache lines, 64 cache lines / 12 L3 slices = 5.333 lines/slice). One can also perform bandwidth measurements using the same lists, but the limitations of the processor make the results difficult to interpret. Each core can only support about 10 outstanding L1 Data Cache misses, but more than 10 concurrent transfers are required to fully tolerate the L3 cache latency. In rough numbers, the peak bandwidth is 32 Bytes/cycle and the L3 load latency (for clean data -- i.e., not dirty in another L1/L2) is about 42 cycles, giving a latency-bandwidth product of 1344 Bytes, or 21 cache lines. The only way for a single core to get more that 10 cache lines in flight is by way of the L2 hardware prefetchers, but these won't limit their fetches to lines from the target L3 cache -- they will fetch from any address that looks like it is an extension of a "stream" of requests. So you might get more bandwidth for the loads from the target L3 or you might get less. I would guess the latter. Without HW prefetch, the best you can get is 10 cache lines per latency, or about 640 Bytes/42 cycles = 15.2 Bytes/cycle. But this is a concurrency-limited value and does not tell you anything about the actual bandwidth.

I did not run either the latency or bandwidth tests, but I did run a set of load tests for lines mapped to a single L3 while measuring the ring traffic at all 12 CBos and all 4 SBos. This enabled me to determine the mapping of the core numbers and the L3 CBo numbers to the various boxes on the topology diagram -- e.g., figure 1-2 of the Xeon E5 v3 Uncore Performance Monitoring Guide. The mapping is not even remotely intuitive, and the mapping of core numbers to physical locations is completely unrelated to the mapping of CBo numbers to physical locations.

All of this is straightforward, if unpleasantly labor-intensive. Unfortunately the general situation is worse. I did my tests on a 12-core Xeon E5 v3, based on a 12-core die -- so all cores were active and all L3 slices were active. Subsequent experiments on dies with disabled cores and L3 slices (e.g., a Broadwell 14-core processor based on a 15-core die) make it clear that there is another level of mapping underneath the core APIC numbers and the CBo numbers (implied by the MSR addresses used to access the CBo). There are PCI configuration space registers showing "bit maps" of enabled cores and enabled L3 slices. These show 14 enabled cores and 14 enabled L3 slices on each chip, but the disabled L3 slice was in any of seven different positions across the set of 20 Xeon E5 v4 processors that I tested. But the L3 CBos are still accessed (via MSRs) as if they were numbered 0-13 --- i.e., the missing L3 slice is "mapped over". It looks like the same thing happens with cores, but I don't have that data at hand. The net of all this is that there appears to be an unknown mapping function for a core's physical position to its APIC number and an unknown mapping function for an L3 CBo's physical position to its implied slice number --- so without further information, every single chip needs to be reverse-engineered independently. Maybe after doing a number of these, it might be possible to derive some rules for understanding how the enabled/disabled bit maps relate to the physical layout, but it seems like an awful lot of work, and it would be difficult to know whether all of the patterns might be swept away by a BIOS change.

Gregg_S_Intel · ‎05-30-2017

> it would be difficult to know whether all of the patterns might be swept away by a BIOS change

Indeed, the patterns very well could be swept away by a BIOS change. This is a primary reason why core layout is not published and the recommendation is not to rely on a core's location for tuning.

Chris_W_1 · ‎06-06-2017

Thank you both for your responses. This clarified a lot of things for me, and I now think modeling the processor as a mesh (with latency and bandwidth values determined through benchmarking) is the best method for my research.