Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Topics
- Intel® Moderncode for Parallel Architectures
- Physical Address Mapping in a Non-Power of Two Cache

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Waqar_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-27-2016
09:45 AM

130 Views

Hi,

I am trying to figure out the physical address mapping in my platform to implement cache-coloring in Linux system. The platform I am using has Intel Xeon E5-2608L V3 processor which has 15 MB - 20 way set associative L3 (Last Level) cache. 15-MB is not a power of two so I am having some trouble in figuring out the set-index bits in physical address of a memory access for this cache. Below is the detail of calculation I am doing:

sytem_page_size_kb = 4 llc_size_kb = 15360 llc_ways = 20 llc_set_size_bytes = 64 llc_sets = llc_size_kb / llc_ways / llc_set_size_bytes // This gives us llc_sets = 12 x 1024 llc_colors = llc_size_kb / llc_ways / system_page_size_kb // This gives us llc_colors = 192

Thus the LLC has 12K sets and 192 colors. Here I run into trouble.

For a 32-bit memory address, the bits can be divided in the following way:

Physical Address 31 11 0 -------------------------------------------------------- | Page Frame Number | Page Offset = lg(page_size)| -------------------------------------------------------- | Color | | Selection | | bits | | lg(192)? | -------------------------------------------------------- | Tag Bits | Index Bits = lg(num_sets) | Block Offset| -------------------------------------------------------- 31 (19 or 20)? lg(12K) = 13.58 6 0 Address Mapping in Cache

**Question-1:**

How can I determine the exact number of set index bits in the LLC?

**Question-2(a):**

If I assume cache-slicing with each processor (The system has 6 physical cores) having its dedicated cache-slice, I get 192 / 6 = 32 colors available for cache-coloring which is a power of two but much smaller than 192 colors. If my guess about cache-slicing is correct, is it true that there are only 32 cache-colors available in this system?

**Question-2(b):**

And in this case, do I need to use bits 12,13,14,15,16 for cache-coloring?

I will be grateful for help!

Thanks!

Link Copied

Accepted Solutions

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-02-2016
04:47 PM

130 Views

The L2 of 1536KiB is six separate private L2 caches of 256 KiB each. As far as I know they are indexed exactly like you would expect an 8-way associative 256KiB cache to be indexed, with 8 colors.

At the L3 level it is much harder to understand how to think about the cache allocation in terms of "colors". Each cache line address is mapped to exactly one L3 slice, which is 20-way set-associative for that address. But the "zero line" in each 4KiB page can be mapped to any of the 6 slices, depending on *all* of the upper address bits. So depending on these upper address bits, the effective associativity of any particular offset in a set of 4KiB pages will be 20-way, 40-way, 60-way, 80-way, 100-way, or 120-way, depending on how many L3 slices the different pages are mapped to. The hash is reasonably uniform, but without a closed-form expression, it is not possible to get the kind of deterministic behavior that page coloring is intended to provide. It may reduce variability and increase effective cache capacity to color on bits 16:12, but because of the undocumented L3 slice select, this would be a modest improvement in the statistics rather than a full elimination of cache conflicts.

I have been experimenting with the distribution of lines using 2MiB pages on a 12-core Xeon E5 v3 system. For the 32768 cache lines in each 2 MiB, I typically see that either (1) 8 of the 12 L3 slices have 1.6% fewer lines than the expected value of 32768/12 and 4 of the 12 slices have 3.2% more lines than the expected value, or (2) 8 of the 12 slices have 1.6% more lines than the expected value and 4 slices have 3.2% fewer lines than expected. Random 4KiB pages may have, in aggregate, either more or less variability than this.

These tests only show the bulk allocation of lines to L3 slices, and don't say anything about where in the 256KiB (4096 cache line) address space these addresses land. It is easy enough to test any specific hypothesis about the mapping of cache lines to the congruence classes in each L3 slice, but if the easy guesses don't prove to be correct, the number of other possible hashes would need to be considered is nearly unbounded.

5 Replies

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-30-2016
09:59 AM

130 Views

The hash of physical addresses to L3 "slice" is not documented, and it requires a fair amount of work to determine the hash using the CBo performance counters. One paper that reports investigation for processors with a power-of-two number of L3 slices is

http://www.s3.eurecom.fr/docs/raid15_maurice.pdf

Each of the "slices" of the L3 cache is 2.5 MiB, 20-way associative, which corresponds to 128 4KiB pages -- so at least part of the calculation is straightforward. Unfortunately, the 64 cache lines in each 4KiB physical page are going to be hashed across the 6 L3 slices using the undocumented hash. 64 items taking one of 6 values allows for 64 billion combinations. I don't know how many of these combinations are actually generated by the address hash, but it is likely to be a large number. It may be possible to use the results of the paper above (Table 3) to determine how many different combinations of L3 slices are used across the set of 4KiB physical addresses available in your system.

Another consideration is the difference between the L2 and L3 mappings. Page coloring for the L3 cache won't be the same as page coloring for the L2 cache. A page coloring allocator for the Level 2 cache would only need to deal with 8 colors (256KiB/8-ways/4KiB), and there appears to be no undocumented hash on these addresses.

Waqar_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-01-2016
10:56 PM

130 Views

Waqar_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-01-2016
11:08 PM

130 Views

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-02-2016
04:47 PM

131 Views

The L2 of 1536KiB is six separate private L2 caches of 256 KiB each. As far as I know they are indexed exactly like you would expect an 8-way associative 256KiB cache to be indexed, with 8 colors.

At the L3 level it is much harder to understand how to think about the cache allocation in terms of "colors". Each cache line address is mapped to exactly one L3 slice, which is 20-way set-associative for that address. But the "zero line" in each 4KiB page can be mapped to any of the 6 slices, depending on *all* of the upper address bits. So depending on these upper address bits, the effective associativity of any particular offset in a set of 4KiB pages will be 20-way, 40-way, 60-way, 80-way, 100-way, or 120-way, depending on how many L3 slices the different pages are mapped to. The hash is reasonably uniform, but without a closed-form expression, it is not possible to get the kind of deterministic behavior that page coloring is intended to provide. It may reduce variability and increase effective cache capacity to color on bits 16:12, but because of the undocumented L3 slice select, this would be a modest improvement in the statistics rather than a full elimination of cache conflicts.

I have been experimenting with the distribution of lines using 2MiB pages on a 12-core Xeon E5 v3 system. For the 32768 cache lines in each 2 MiB, I typically see that either (1) 8 of the 12 L3 slices have 1.6% fewer lines than the expected value of 32768/12 and 4 of the 12 slices have 3.2% more lines than the expected value, or (2) 8 of the 12 slices have 1.6% more lines than the expected value and 4 slices have 3.2% fewer lines than expected. Random 4KiB pages may have, in aggregate, either more or less variability than this.

These tests only show the bulk allocation of lines to L3 slices, and don't say anything about where in the 256KiB (4096 cache line) address space these addresses land. It is easy enough to test any specific hypothesis about the mapping of cache lines to the congruence classes in each L3 slice, but if the easy guesses don't prove to be correct, the number of other possible hashes would need to be considered is nearly unbounded.

Waqar_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-02-2016
05:14 PM

130 Views

Thank you very much for the detailed response. I understand your point now.

For more complete information about compiler optimizations, see our Optimization Notice.