Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Physical Address Mapping in a Non-Power of Two Cache

Waqar_A_
Beginner
1,715 Views

Hi,

    I am trying to figure out the physical address mapping in my platform to implement cache-coloring in Linux system. The platform I am using has Intel Xeon E5-2608L V3 processor which has 15 MB - 20 way set associative L3 (Last Level) cache. 15-MB is not a power of two so I am having some trouble in figuring out the set-index bits in physical address of a memory access for this cache. Below is the detail of calculation I am doing:

sytem_page_size_kb = 4
llc_size_kb = 15360
llc_ways = 20
llc_set_size_bytes = 64

llc_sets = llc_size_kb / llc_ways / llc_set_size_bytes

// This gives us llc_sets = 12 x 1024

llc_colors = llc_size_kb / llc_ways / system_page_size_kb

// This gives us llc_colors = 192

Thus the LLC has 12K sets and 192 colors. Here I run into trouble.

For a 32-bit memory address, the bits can be divided in the following way:

                     Physical Address 
 31                         11                           0
 --------------------------------------------------------
|  Page Frame Number        | Page Offset = lg(page_size)|
 --------------------------------------------------------
               |   Color    |                  
               | Selection  |   
               |   bits     |     
               | lg(192)?   | 
 --------------------------------------------------------
|  Tag Bits    | Index Bits = lg(num_sets) | Block Offset|
 --------------------------------------------------------
 31       (19 or 20)?   lg(12K) = 13.58     6            0
                  Address Mapping in Cache

Question-1:

How can I determine the exact number of set index bits in the LLC?

Question-2(a):

If I assume cache-slicing with each processor (The system has 6 physical cores) having its dedicated cache-slice, I get 192 / 6 = 32 colors available for cache-coloring which is a power of two but much smaller than 192 colors. If my guess about cache-slicing is correct, is it true that there are only 32 cache-colors available in this system? 

Question-2(b):

And in this case, do I need to use bits 12,13,14,15,16 for cache-coloring?

I will be grateful for help!

Thanks!

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
1,715 Views

The L2 of 1536KiB is six separate private L2 caches of 256 KiB each.  As far as I know they are indexed exactly like you would expect an 8-way associative 256KiB cache to be indexed, with 8 colors.

At the L3 level it is much harder to understand how to think about the cache allocation in terms of "colors".   Each cache line address is mapped to exactly one L3 slice, which is 20-way set-associative for that address.  But the "zero line" in each 4KiB page can be mapped to any of the 6 slices, depending on *all* of the upper address bits.   So depending on these upper address bits, the effective associativity of any particular offset in a set of 4KiB pages will be 20-way, 40-way, 60-way, 80-way, 100-way, or 120-way, depending on how many L3 slices the different pages are mapped to.   The hash is reasonably uniform, but without a closed-form expression, it is not possible to get the kind of deterministic behavior that page coloring is intended to provide.  It may reduce variability and increase effective cache capacity to color on bits 16:12, but because of the undocumented L3 slice select, this would be a modest improvement in the statistics rather than a full elimination of cache conflicts.

I have been experimenting with the distribution of lines using 2MiB pages on a 12-core Xeon E5 v3 system.  For the 32768 cache lines in each 2 MiB, I typically see that either (1) 8 of the 12 L3 slices have 1.6% fewer lines than the expected value of 32768/12 and 4 of the 12 slices have 3.2% more lines than the expected value, or (2) 8 of the 12 slices have 1.6% more lines than the expected value and 4 slices have 3.2% fewer lines than expected.   Random 4KiB pages may have, in aggregate, either more or less variability than this.

These tests only show the bulk allocation of lines to L3 slices, and don't say anything about where in the 256KiB (4096 cache line) address space these addresses land.  It is easy enough to test any specific hypothesis about the mapping of cache lines to the congruence classes in each L3 slice, but if the easy guesses don't prove to be correct, the number of other possible hashes would need to be considered is nearly unbounded.

View solution in original post

0 Kudos
5 Replies
McCalpinJohn
Honored Contributor III
1,715 Views

The hash of physical addresses to L3 "slice" is not documented, and it requires a fair amount of work to determine the hash using the CBo performance counters.   One paper that reports investigation for processors with a power-of-two number of L3 slices is

http://www.s3.eurecom.fr/docs/raid15_maurice.pdf

Each of the "slices" of the L3 cache is 2.5 MiB, 20-way associative, which corresponds to 128 4KiB pages -- so at least part of the calculation is straightforward.   Unfortunately, the 64 cache lines in each 4KiB physical page are going to be hashed across the 6 L3 slices using the undocumented hash.  64 items taking one of 6 values allows for 64 billion combinations.  I don't know how many of these combinations are actually generated by the address hash, but it is likely to be a large number.  It may be possible to use the results of the paper above (Table 3) to determine how many different combinations of L3 slices are used across the set of 4KiB physical addresses available in your system.

Another consideration is the difference between the L2 and L3 mappings.  Page coloring for the L3 cache won't be the same as page coloring for the L2 cache.   A page coloring allocator for the Level 2 cache would only need to deal with 8 colors (256KiB/8-ways/4KiB), and there appears to be no undocumented hash on these addresses.

0 Kudos
Waqar_A_
Beginner
1,715 Views

Thanks for the response. I studied the paper that you have mentioned and I understand the point you made regarding the complex addressing as it relates to the Intel LLC. However, if I assume that the hash function used by Intel evenly distributes the physical addresses across the cache-slices, then I can in principle use the 5-bits (Bit 12 - 16) in the physical address for page-coloring. Is that correct?

0 Kudos
Waqar_A_
Beginner
1,715 Views

Also, regarding the page-coloring for L2-cache, the L2-cache size in my system is 1536-KB which as again a non-power of two. 1536/KiB/8-ways/4KiB leads to 48 colors so how can I select the page-coloring bits for L2 cache?

0 Kudos
McCalpinJohn
Honored Contributor III
1,716 Views

The L2 of 1536KiB is six separate private L2 caches of 256 KiB each.  As far as I know they are indexed exactly like you would expect an 8-way associative 256KiB cache to be indexed, with 8 colors.

At the L3 level it is much harder to understand how to think about the cache allocation in terms of "colors".   Each cache line address is mapped to exactly one L3 slice, which is 20-way set-associative for that address.  But the "zero line" in each 4KiB page can be mapped to any of the 6 slices, depending on *all* of the upper address bits.   So depending on these upper address bits, the effective associativity of any particular offset in a set of 4KiB pages will be 20-way, 40-way, 60-way, 80-way, 100-way, or 120-way, depending on how many L3 slices the different pages are mapped to.   The hash is reasonably uniform, but without a closed-form expression, it is not possible to get the kind of deterministic behavior that page coloring is intended to provide.  It may reduce variability and increase effective cache capacity to color on bits 16:12, but because of the undocumented L3 slice select, this would be a modest improvement in the statistics rather than a full elimination of cache conflicts.

I have been experimenting with the distribution of lines using 2MiB pages on a 12-core Xeon E5 v3 system.  For the 32768 cache lines in each 2 MiB, I typically see that either (1) 8 of the 12 L3 slices have 1.6% fewer lines than the expected value of 32768/12 and 4 of the 12 slices have 3.2% more lines than the expected value, or (2) 8 of the 12 slices have 1.6% more lines than the expected value and 4 slices have 3.2% fewer lines than expected.   Random 4KiB pages may have, in aggregate, either more or less variability than this.

These tests only show the bulk allocation of lines to L3 slices, and don't say anything about where in the 256KiB (4096 cache line) address space these addresses land.  It is easy enough to test any specific hypothesis about the mapping of cache lines to the congruence classes in each L3 slice, but if the easy guesses don't prove to be correct, the number of other possible hashes would need to be considered is nearly unbounded.

0 Kudos
Waqar_A_
Beginner
1,715 Views

Thank you very much for the detailed response. I understand your point now.

0 Kudos
Reply