Shared vs Unshared L3 hits?

dsf423 · ‎06-30-2014

I'm new to performance monitoring and I want to make sure I understand everything that goes into calculating the L2 hit ratio.

In the source code for the Intel PCM software, the L2 hit ratio is calculated as follows:

uint64 hits = L2Hit;
uint64 all = L2Hit + L2HitM + L3UnsharedHit + L3Miss;
if (all) return double(hits) / double(all);

The variable name L3UnsharedHit seems to imply that there's something else called an L3SharedHit, which would presumably happen when a load request misses in the L2 but is present in a separate socket's L3 cache. Is there such a thing? Do modern processors with QPI derive any benefit from finding a cache line in another socket's L3, versus having to go out to memory?

Also, I assume the variable L2HitM means the number of misses in the L2, but that doesn't make sense with the variable's name. I haven't been able to track down exactly what event number and umask that corresponds to in the PMU. Is there a better interpretation?

Thanks for your time,
David

McCalpinJohn · ‎06-30-2014

None of this stuff is easy... :-(

There are two important caveats with any attempt to measure cache hit ratios on recent Intel processors:

1. The primary event used to count hits (MEM_LOAD_UOPS_RETIRED.L2_HIT) only counts demand loads that miss the L1 and hit in the L2. It does not count prefetches that bring the data from the L2 to the L1 in advance of the load. Whether that is what you want to count as a "hit rate" or not depends on whether you are thinking about spatial locality (prefetchability) or temporal locality (data re-use).

This event also does not count store misses (RFOs) that miss in the L1 and hit in the L2.

2. When using AVX 32-Byte loads, the MEM_LOAD_UOPS_RETIRED.L2_HIT counter never increments. Instead all L1 misses increment the MEM_LOAD_UOPS_RETIRED.HIT_LFB counter, which normally only increments when there are multiple loads that miss the L1 but point to (usually different parts of) the same cache line.

dsf423 · ‎06-30-2014

Thanks Dr. McCalpin. Do you happen to have any general resources for doing this kind of work?

Just out of curiosity- I'm trying to understand why a parallel program suffers performance degradation when I involve two processor sockets (as opposed to just one). For example, a program might run slower on 10 cores across two sockets than on five cores within a single socket. There are lots of general hand-wavy explanations, but I want to rigorously explain the behavior I see in this specific instance. Do you have any suggestions?

Thanks again,

David

TimP · ‎06-30-2014

HitM I believe stands for hit modified; that is, another core owns a modified copy of the cache line.

Threads on different sockets sharing cache lines is not effective. However, BIOS upgrades during the time I was testing IvyTown made a significant improvement.

I doubt you will be able to test this with any rigor unless you can set affinity and avoid dynamic scheduling.

Bernard · ‎06-30-2014

>>>Just out of curiosity- I'm trying to understand why a parallel program suffers performance degradation when I involve two processor sockets (as opposed to just one). For example, a program might run slower on 10 cores across two sockets than on five cores within a single >>>

You should take into account also NUMA distance issue.

McCalpinJohn · ‎07-01-2014

Having a shared L3 makes core-to-core data sharing very fast. It is not quite as fast as getting unshared data from the L3, but the L3 knows which core has the modified data and is able to arrange for a fast cache-to-cache transfer. Table 2-10 in section 2.2.5.1 of the Intel Optimization Reference Manual (document 248966-029, March 2014) says that a "clean" L3 hit has a latency of 26-31 cycles (I measure an average of ~35 cycles for pointer-chasing code), while a hit that is "dirty" in another L1 or L2 on the same chip has a reported latency of 60 cycles (20 ns at 3.0 GHz).

Latency to modified cache lines in the other socket is much higher -- similar to the remote memory latency of ~135 ns. This gives a latency ratio of between 6:1 and 7:1 in favor of the shared cache configuration.

The sharing can be either deliberate or accidental (false sharing). Given the intervention latency ratio of ~6.5:1, either case could account for 5 cores on 1 socket running faster than 10 cores on 2 sockets. It is often difficult to come up with an automated test to detect false sharing, but in general one looks for cache-to-cache transfer rates that increase very rapidly with thread count (much faster than a linear increase). You should be able to see this in both the one-socket and two-socket systems, but due to the high ratio of intervention latency between the two cases, you can have false sharing that is tolerable in the single-socket case and intolerable in the two-socket case.

dsf423 · ‎07-01-2014

Thanks again everyone. I'll do some digging with this and see what I come up with.

David