Dear John,

Danilcha · ‎06-11-2015

Hello,

Could you please help me?

I'm measuring L3 cache misses and accesses when scanning a huge array vs. scanning a huge linked list, that was randomized in memory beforehand. Element size is 32 bytes.

The performance difference is huge — 1–2 orders of magnitude. Which is understandable. However, cache miss measurements are not that different.

For linked list it's 1.00 access and 1.00 miss per element. Which is understandable.

For the array it is 0.45 accesses, 33% of which are misses. And that is strange. Why 0.45 and not 0.50? And why 33% miss rate? It is as if the prefetch only loads 3 lines upon L3 miss. But in that case the performance difference wouldn't be 100 faster.

It is also clear that these counters does not explain the performance implications of L3 misses, because a miss during random memory access is apparently much more expensive than L3 miss when sequentially scanning the memory. Do you know if there is a way to count 'real' L3 misses, then a full random RAM access and page table lookup are performed?

McCalpinJohn · ‎06-11-2015

If I understand what you are doing, the biggest difference between the two cases is concurrency. When you are traversing a linked list (or chasing a pointer chain), you will only have one outstanding load at a time. The next load can't be issued until the data from the current cache miss provides the address to the next one.

In contrast, scanning sequentially allows the processor to issue as many reads as the out-of-order hardware will support, since their addresses are all known in advance. Most recent Intel processors support 10 L1 Data Cache misses, which should give roughly a factor of 10 improvement in performance. Sequential accesses also enable the L2 streaming prefetcher to fetch data into the L3 and/or the L2 in advance, which reduces the average time per load further.

You will need to be very specific about what performance counter interface software you are using so that we can figure out exactly how the hardware performance counters are programmed. There are lots of ways for the hardware to count certain events (especially cache transactions), and many of these counters have limitations, bugs, or definitions that may not line up with what you expect.

You might also try running the cases with the hardware prefetchers disabled, as described at: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors ;

Danilcha · ‎06-11-2015

Dear John,

Thank you.

That's totally not like I imagined it. I thought like that: memory bandwidth is (at my machine) 50 GB/s, which allows reading every 32 bytes (element size) sequentially every ~0.7 ns. One random RAM access, on the other hand, takes about 70 ns (http://www.sisoftware.net/?d=qa&f=ben_mem_latency). This is because RAM is not really 'random' access memory and needs time to be set up for a new address. I then imagined the process this way: when I traverse the linked list, each jump takes 1 random RAM access, which gives 70 ns (corresponds my measurements). When I scan an array, the CPU understands that I am doing a scan and just tells the RAM to read the bytes at the speed of RAM bandwidth. And this is called "prefetch".

It means I was wrong? And instead of one super-fast thread that reads RAM at the maximum bandwidth we have 10 parallel threads that all read chunks from different (i.e. random) places? But in that case each thread would still spend 70 ns to start reading and the whole process would be limited by 7 ns. And array scan is much faster. At least five times faster (though not as fast as theoretical 0.7 ns).

McCalpinJohn · ‎06-11-2015

Most systems with four DRAM channels cannot reach full bandwidth with a single thread, because a single thread can't generate enough concurrent accesses to "fill the pipeline". For example, at 50 GB/s and 70 ns latency, Little's Law says that you need 50 GB/s * 70 ns = 3500 Bytes of memory references "in flight" at all times. 3500 Bytes is about 55 cache lines.

A core can generate 10 L1 Data Cache misses, which is nowhere near enough concurrency to "fill the pipeline", so the sustained bandwidth is "concurrency-limited" in this case. Rearranging Little's Law says that the average bandwidth should be 640 Bytes / 70 ns = 9.1 GB/s. (Note that the L1 hardware prefetchers don't help because they share the same 10 cache miss buffers.)

The L2 hardware prefetchers help by bringing the data into the L3 and/or L2 caches in advance. This reduces the average "latency" in the equation. For example, if everything were fetched into the L3 in advance and the L3 latency is 12 ns, then Little's Law says that the bandwidth limit is 640 Bytes / 12 ns = 53 GB/s. Obviously this can't be maintained if the data is being moved from memory to the L3 at no more than 50 GB/s, so with perfect prefetch the problem would be bandwidth-limited, not concurrency-limited.

In practice the details are more complex. You can't prefetch everything since the hardware prefetchers stop at the end of every 4 KiB page (and have to be restarted for the next 4KiB page). The prefetchers don't generate the 64 prefetches for a 4 KiB page instantaneously -- the number of prefetches starts at zero and the beginning of the 4KiB page and is then ramped up as the hardware sees more and more loads for that page. The prefetchers also have a limited number of outstanding transactions, though it is hard to find documentation at this level of detail.

For a code consisting of only contiguous reads, the combination of L1 misses and L2 prefetches typically gives 12-15 GB/s on 2-socket systems -- definitely better than the L1-concurrency-limited value, but not close to full DRAM bandwidth. The best I have been able to achieve with a single thread on a Xeon E5-2670 (Sandy Bridge EP) system is about 18.4 GB/s, which corresponds to about 20 cache lines in flight at the minimum latency of ~67 ns.

Danilcha · ‎06-12-2015

Dear John,

You're cool!

And you are right. Attaching the results of my experiments when scanning an array of 40-byte elements ("gapless" on the plot) vs. traversing a linked list. The latency of getting one element for sequential access for big workloads is ~2.8 ns, which for 40-byte elements gives ~14.3 GB/s. This is on 2-socket system with Xeons X5550 (32 GB/s memory bandwidth).

I'm now trying to imagine the whole picture of how everything works from the beginning. So, the core feels that I'm doing a sequential scan, and it can predict that I will soon need next lines. So instead of asking for the whole page of memory which could have been read at the speed of memory bandwidth it just issues 10 L1d cache misses for the next 10 lines. And these requests are independent. The L3 cache feels this and issues 10 simultaneous independent requests to RAM. The RAM also does not feel that the lines are actually sequential and satisfies each request independently. For each request it spends 70 ns to set up the address and then 2 ns to transfer each page to the CPU at the speed of memory bandwidth. Somehow it is doing it simultaneously (how?!) so the total latency for setting up all 10 request addresses is just 70 ns / 10 = 7 ns. But then it also has to transfer all 10 lines, which takes 2 ns * 10 = 20 ns. And here we are—my model does not make sense.

McCalpinJohn · ‎06-12-2015

It is difficult to find out what is happening at the lowest levels of the hardware, but we can make some general observations.

For the contiguous case, the core will continue to fetch and issue instructions after a cache miss. This will include any additional loads whose addresses can be computed (which should be all of them in the sequential case). Eventually the core will fill up its re-order buffer and will have to wait for the oldest instruction to complete -- this is typically the load that missed in the cache. The number of cache misses that can be generated by this out-of-order execution depends on the details of the instruction stream, but is typically at least 3-4, and may be higher (but no more than 10).

If the Data Cache streaming hardware prefetcher notices loads to consecutive addresses, it will issue hardware prefetches to the next cache line(s). It is difficult to tell how many of these are issued -- only a few of the performance counters even mention L1 DC prefetches. In any case these use the same 10 "Line Fill Buffers" as L1 Data Cache Demand misses, so the maximum number of L1 Data Cache misses "in flight" is 10 (for a single core).

The L2 hardware prefetchers operate completely asynchronously from the core and L1 Data Cache hardware prefetchers, and can issue prefetch requests many lines in advance of the current L1 Data Cache demand load miss (or L1 DC Hardware Prefetch) request. The details are not well documented and it is difficult to build test cases to measure these features. From my single-thread experiments, I know that the L2 streaming prefetcher can support at least 20 requests to memory, but that test was fetching from several independent 4 KiB pages --- it is not clear whether the L2 prefetcher can request that many lines from a single 4KiB page.

The L2 hardware prefetch requests go first to the L3 (where we assume that they miss), and then to the memory controllers. The memory controllers fetch the data and return it to the L3, with some of the data being returned to the L2. The decision of whether to fetch the data into the L3 or into the L2 is based on several factors that are not visible to the user, such as the number of demand misses outstanding at the time.

Once you get to the memory controller there are usually plenty of buffers to handle concurrent transactions. The data cannot be fetched as a whole page because there is only one size of read transaction on a DRAM channel -- a single cache line. The memory controller only sees cache line fetch requests that miss the L3 cache, and it forwards these to the 2/3/4 DRAM channel controllers that it is in charge of. The memory controller and DRAM controllers do work together to use "open page mode" in the DRAMs whenever possible. This reduces the latency of the reads (or writes) and uses less power, but each cache line must still be requested individually.

Some systems have supported multi-line prefetch instructions (e.g., IBM POWER), but I have not seen these in Intel processors. Such instructions may be more helpful in the future as power becomes more of a concern (i.e., allowing one core to move data at full bandwidth while allowing the other cores to idle at lower power consumption), but adding this complexity to an already terrifyingly complex processor is not something to be done lightly.

Anyway, most of the lines in a page will be prefetched with varying degrees of effectiveness. Some hardware prefetches will be sent out only a few ns before the corresponding demand load (so they won't help very much), while other hardware prefetches will get out well in advance of the demand load and save a lot of time. The very steady average performance masks the underlying complexity -- this is very much not a "steady-state" system at fine scales. The time required to transfer a 4 KiB page is about the same as the latency for one access, so for every set of 64 cache line loads the behavior spans the full range of "not prefetched" to "effectively prefetched".

Danilcha · ‎07-07-2015

Dear John,

Thank you again. Everything is much clearer now.

I was trying to educate myself to understand some things you said that are unknown to me (like open page mode). One of the main sources I counted on is article "What Every Programmer Should Know About Memory" (http://www.akkadia.org/drepper/cpumemory.pdf). I heard about it many times from different sources, people seem to think very highly of it.

It contains all the cool plots and seem to talk about the same low-level things as we do here. But I couldn't finish it. It felt like the author is explaining the things using the first explanation he thought of. And I'm not sure he is always right. He might be, but in that case I cannot understand his logic, because the explanations are also very concise.

For example, there is a plot (Figure 3.13) with the performance of three different algorithms. The first traverses a linked list, second additionally increments a value in the list (this->value++), and the third instead of that adds the value of the next list element to the value of the current list element (this->value += this->next->value). And within L2 the second is slower than the first (because there are writes, not just reads), but the third is as fast as the first!

He explains it on page 22 like that:

The naive assumption would be that the “Addnext0” test runs slower because it has more work to do. Before advancing to the next list element a value from that element has to be loaded. This is why it is surprising to see that this test actually runs, for some working set sizes, faster than the “Inc” test. The explanation for this is that the load from the next list element is basically a forced prefetch. Whenever the program advances to the next list element we know for sure that element is already in the L1d cache. As a result we see that the “Addnext0” performs as well as the simple “Follow” test as long as the working set size fits into the L2 cache. (End of citation)

This seems just wrong to me. Or I cannot understand it.

He then continues:

The “Addnext0” test runs out of L2 faster than the “Inc” test, though. It needs more data loaded from main memory. This is why the “Addnext0” test reaches the 28 cycles level for a working set size of 2^21 bytes. The 28 cycles level is twice as high as the 14 cycles level the “Follow” test reaches. This is easy to explain, too. Since the other two tests modify memory an L2 cache eviction to make room for new cache lines cannot simply discard the data. Instead it has to be written to memory. This means the available bandwidth on the FSB is cut in half, hence doubling the time it takes to transfer the data from main memory to L2. (End of citation)

This also seems strange.

Do you know this article? Could you please share your opinion about it and of the example I presented?

Cache misses for sequential vs. random access patterns