bandwidth of memory hierarchy

maw__abe · ‎07-23-2018

A few questions here:

Can cache bandwidth become a bottleneck? When and are there well-known examples (such as software / benchmarks)?

How to tell memory bandwidth become a bottleneck?

I think memory bandwidth issue must appear together with high cache miss rate. But how to know if the memory bandwidth is also a bottleneck? To your knowledge, which well known software have memory bandwidth issues and is there any systematic research on how to amelorate memory bandwidth issues?

Thanks

McCalpinJohn · ‎07-24-2018

It is easy to come up with examples of cache bandwidth becoming a bottleneck, without even considering pathological cases....

In the olden days, vector supercomputers were implemented to support main memory bandwidth of approximately two 64-bit loads plus one 64-bit store for each arithmetic pipeline (composed of a 64-bit floating-point adder and a 64-bit floating-point multiplier). This allowed execution of the DAXPY kernel at (close to) the peak arithmetic rate of the system. As an example, each CPU in the Cray C90 from 1991 had two arithmetic pipelines running at 0.244 GHz, for a peak performance of 0.976 GFLOPS. The STREAM Triad performance for a single processor was reported as 9500 MB/s, corresponding to 0.792 GFLOPS. (9.500 GB/s * 2 FLOPS / 24 Bytes = 0.792 GFLOPS) This is about 81% of the peak performance of 0.976 GFLOPS. Some earlier vector machines did a bit better, but by the early to mid 1990's, it was clear that it was not going to be practical to create memory systems that could keep up with the rate of increase possible for arithmetic performance.

I have not tested these specific numbers myself, but Intel's Optimization Reference Manual (document 248966) includes peak and sustained cache bandwidth numbers that can be compared with peak arithmetic rates. Table 2-1 gives the values for Broadwell and Skylake Xeon processors, which give the following percentages of peak arithmetic performance for DAXPY (requiring 24 Bytes / 2 FLOPS) running at the stated bandwidths:

Processor Peak FP/cycle L1 sustained L1 DAXPY L2 sustained L2 DAXPY
                         Bytes/cycle  %Peak    Bytes/cycle    %Peak
Broadwell      16           93        48.4%        25         13.0%
Skylake Xeon   32          133        34.6%        52         13.5%

So cache bandwidth limits performance of vector kernels even for L1-contained data. The L1 cache is typically not large enough for cache blocking, so algorithms that are blocked are typically blocked to fit into the L2 cache.

maw__abe · ‎07-24-2018

Thanks a lot John.

1. Cray C90 has dual-pipeline at 244MHz, so 2*0.244=0.488 GFLOPS. Where does the other factor of 2 come from to make it 0.976?

2. btw, is 9500MB/s from https://www.cs.virginia.edu/stream/standard/Bandwidth.html ("1991.11.13 Cray_C90 1 6965.4 6965.4 9378.7 9500.7 data")

To be preicse, the document number should be 248966-040.

3. DAXPY means "z = a*x + y", right? When you mention "24 bytes / 2FLOPS". I guess 24 bytes mean x, y and z? And the two insturcitons mean FMA and store?

4. How about memory bandwidth for integer workload? If the memory access is not contiguous, how much possibility that memory bandwidth could become the bottleneck?

5. Any well-known workload/benchmarks have bottlenecks due to cache bandwidth and memory bandwidth?

6. How to measure PMCs to know that the bandwidth become the bottleneck? Especially for cloud environment, there could be many processes/threads access cache/memory simultaneously. I guess we need to whole-system PMC counting to know if bandwidth is a bottleneck, right? Pointers to any similar measurement/work/document would be appreciated.

Thanks!

McCalpinJohn · ‎07-24-2018

Each of the Cray C90 pipelines has both an adder and a multiplier, so the peak floating-point performance is four operations per cycle (2 adds, 2 multiplies).

For DAXPY [ x(i) = x(i) + q*y(i) ] there are two 8-byte loads and one 8-byte store, giving 24 bytes per iteration. Each iteration includes one multiply and one add, for a total of 2 FLOPS.

Integer data is moved around at the same speeds as floating-point data, but fewer integer workloads have the advantage of full vectorization. Cache bandwidth is dramatically reduced without SIMD loads and stores. For example, with 32-bit integers, both Broadwell and Skylake are limited to 12 bytes per cycle of L1 cache bandwidth (2*4-Byte loads + 1*4-Byte store). Transfers from the L2 to the L1 still take place in full cache lines, but it takes the core extra cycles to process the individual words in the cache line, so throughput is significantly reduced.

Non-contiguous memory accesses make matters much worse (for data outside of the L1 cache) because (1) not all of the data moved is used, and (2) the hardware prefetchers are less likely to pull data into the caches before it is needed. If the hardware prefetchers are not working effectively (or if only a small number of cores are in use), the performance needs to be modeled as a combination of latency and bandwidth effects. This requires understanding the average concurrency (number of outstanding cache misses at each level of the cache hierarchy), which can be challenging on modern architectures.

Some performance counter events are useful in evaluating whether memory access limits application performance. The event CYCLE_ACTIVITY (EventSelect 0xA3) can be configured to count dispatch stalls that occur while there is at least one demand load outstanding at various levels of the memory hierarchy. The out-of-order pipeline can often handle the latency of an L1 miss/L2 hit (so the occurrence of a dispatch stall while there is an outstanding L1 miss can be a coincidence), but when there are demand loads missing the L2, most of the stalls are expected to be due to waiting on the cache miss to be resolved.

In general, however, the characterization of application sensitivity to memory bandwidth and memory latency remains a challenging topic.