single core bandwidth: haswell vs. broadwell vs. skylake

Newman__Chuck · ‎01-03-2019

Some of my customers' workloads require only a single core for code segments that require any memory bandwidth. Particularly for read-intensive workloads, we see better performance on Haswell than on Skylake, with Broadwell somewhere in between, depending on the access pattern.

Find attached a table of single core memory performance as measured by "mlc --max_bandwidth -b200m -m8"

Also attached is a table with pertinent information of the servers used. Note that the Haswell server is 1P with 2 DPC and the others are 2P with 1 DPC, which I understand will have some impact on performance.

Of course, using eight cores of a single processor shows the expected result of Skylake turning in the best values, followed by Broadwell and then Haswell.

I have a couple of questions:
> Is this expected?
> What guidance can you offer to give optimal Skylake single core performance?
> mlc offers only a limited number of access patterns. Is there another tool that allows a finer granularity of read-write ratios, and in particular 100% writes?

McCalpinJohn · ‎01-04-2019

I have not tested Haswell in the single socket configuration, but your results show a large boost in performance relative to the 2-socket configurations.

Single-thread STREAM Triad results on my Xeon E5-2690 v3 systems are almost always very close to 19.9 GB/s when configured with one dual-rank DIMM (DDR4/2133) per channel and booted in Home Snoop mode -- so your single-thread Haswell result is a full 35% higher than I see in 2-socket Haswell systems.

Part of this is due to the higher frequency of your system (max Turbo of 3.8 GHz, vs max Turbo of 3.5 GHz on my systems), but the larger impact is almost certainly the reduction in latency in the single-socket system.

Single-thread bandwidth is governed by "Little's Law" on these systems:
Bandwidth = Concurrency / Latency
Here "concurrency" is the maximum number of cache lines "in flight". Each of these processor cores supports 10 outstanding L1 Data Cache misses, while the L2 hardware prefetchers are able to generate additional concurrency (or reduced "effective latency", depending on how you choose to model the system).

The simplest case to analyze is single-thread, read-only, with hardware prefetching disabled. In this case, the bandwidth cannot exceed the maximum concurrency divided by the minimum latency. For my Xeon E5-2690 v3 in Home Snoop mode, the corresponding numbers are:
(10 cachelines * 64 Bytes/cacheline) / 90 ns = 7.1 GB/s
With the L2 hardware prefetchers enabled, the average latency is reduced because some/many of the L1 Data Cache misses find their data in the L2 or L3 cache (and so have reduced "effective latency").

There is a lot more discussion of this topic in a series of posts starting at http://sites.utexas.edu/jdm4372/2010/11/03/optimizing-amd-opteron-memory-bandwidth-part-1-single-thread-read-only/ ; That series does not include my Haswell results, but the properly vectorized results vary between ~12.8 GB/s (for Version002) and 16.6 GB/s (for Version011 and Version012). All of these were run with the L2 HW prefetchers enabled -- Version011 and Version012 improve the effectiveness of the HW prefetchers by increasing the number of read streams being accessed concurrently (interleaving accesses across independent 4KiB pages).

Youur SKX "read-only" results look a little low. My Xeon Platinum 8160 single-threaded read-only numbers average about 15.2 GB/s. The difference is probably that I used the "-Y" option in this test to use 256-bit SIMD loads (while yours should default to 128-bit SIMD). Your "Triad-like" numbers are similar to the STREAM Triad values that I get -- the performance reduction relative to the 3:1, 2:1, and 1:1 results is due to reduced performance of streaming stores in the SKX processor. (Streaming stores are still useful when using enough cores to get close to saturating memory bandwidth, but they are definitely slower with a single thread.)

Best single-thread bandwidth on all of these platforms will typically require that you spread the memory accesses across more "streams", since the L2 HW prefetchers operate within 4KiB pages. Accessing more pages concurrently allows a larger average number of prefetches to be in flight.
I have not tested my ReadOnly code on SKX, but based on earlier tests, I would expect the best results to come from doubling or quadrupling the number of pages being accessed. E.g.,

    for (i=0; i<N; i++) a = b;

would be replaced with

    for (i=0; i<N/2; i++) {
       a = b;
       a[i+N/2] = b[i+N/2];
    }

etc....

Newman__Chuck · ‎01-04-2019

Thanks for the pointer to your blog; that will take some digesting on my part.

On my Haswell server, I'm running the processor at 4.5 GHz, also with Home Snoop, and mlc reports my idle latency as 68.8 ns., which is better than the 90 ns you used in your calculation. With Little's Law, that points to a bandwidth of 9.3 GB/s, but I'm unclear how to correlate that to my measured rate of 16.9 GB/s (from the L2 hardware prefetchers? I probably have to read your blog more closely).
The DIMMs are "HP 16GB (1 x 16GB) Dual Rank x4 DDR4-2133 CAS-15-15-15 Registered Memory"

The 6137 processors run at 4.1 GHz, and mlc says the idle latency is 74.9 ns.
Using the mlc_avx512 binary gives the same bandwidth rates as with the mlc binary.
The DIMMs are "HPE 16GB (1 x 16GB) Dual Rank x8 DDR4-2666 CAS-19-19-19 Registered Smart Memory"

McCalpinJohn · ‎01-04-2019

The latency ratio of 90 ns / 68.8 ns is 1.31, which is pretty close to the ~1.35x ratio that STREAM Triad shows.... This may be a coincidence -- there is a lot of detail that Intel does not document, and I don't have any single-socket systems for testing.

The mlc_avx512 binary will only give different behavior if you also include the "-Z" option to enable 512-bit SIMD instructions.

I don't expect ranking to make much difference with single-thread workloads.

Newman__Chuck · ‎01-04-2019

From before, using mlc I got 13.1 GB/s, and now mlc_avx512 -Z brought that up to 14.3.

I downloaded your package that you referenced in your blog and I ran it on my Skylake server.

Version 12 had the best peak, at 17.8 GB/s, which ran on node 0/core 0 (Average was 14.1 GB/s). A second run saw the same thing.
(Version 12 did 100 repetitions; Version 15 does only 1 and turned in a rate of 14.3 GB/s, which is very similar Version 12's average)
When I ran it on CPUs that were booted with isolcpus, however, I got a peak of 19.4 GB/s on node 0 (Average of 15.0) and 18.8 GB/s on node 1 (Average of 14.7).
Background "stuff" mucking up the cache?
All three of those runs had worst-case results in the low-12 GB/s range (ignoring the couple of bad outliers).
Of course, the system was otherwise idle during those three runs.

McCalpinJohn · ‎01-05-2019

All of the codes do a lot of repetitions (internally), but the output depends on whether the test case has a tunable parameter.

For example, Version015 does 100 iterations internally. Two output files are produced -- one with all of the results, and one containing only the average result for the 100 iterations. Version012 uses the same structure, but adds software prefetching. The tunable parameter is the distance between the current load address and the address of the software prefetch. This distance is varied from 0 to 1023 array elements. For each distance, 10 iterations are performed. One output file contains all of the results, while the other output file just contains the average values for each of the 1024 software prefetch-ahead distances. The values plotted in http://sites.utexas.edu/jdm4372/files/2014/05/Version012b.png, for example, are the average (of 10 iterations) for each of the 1024 "prefetch-ahead" distances. The idea for these "tunable" versions is to understand the pattern and be able to pick a single prefetch distance.

I have attached a plot from Version012 run on a (slightly) less archaic Xeon E5-2690 (v1) node. Without software prefetch (Version015), this system gives an average performance of 16.3 GB/s, while with software prefetches placed at 80-95 lines ahead of the current load, the system gives 16.8 GB/s -- an increase of about 3%. For the nominal 67ns latency of the system, 16.8 GB/s corresponds to a concurrency of (16.8 GB/s * 67 ns =) 1126 Bytes, or about 17.6 cache lines. The best result I obtained on that system was with Version010 with a "spinner" program running on a core in the other socket. This "spinner" prevents the remote chip's uncore from dropping to minimum frequency (which would increase the response time for snoops, and therefore increase the effective latency). Version010 without the spinner delivered a hair under 17.5 GB/s, while with the spinner it reached 18.4 GB/s -- an increase of over 5%. It is sad to see essentially no change in the maximum available single-thread bandwidth over five generations of Xeon processors (SNB, IVB, HSW, BDW, SKX).

The improvement in performance with isolcpus is interesting -- I have not worked with systems in that mode under Linux.

Version012 and Version015 both split the reads across 8 independent streams. Your systems all have dual-rank DDR4 memory, so they have 32 DRAM banks, which is plenty for 8 read streams. When running in multi-threaded mode, the "optimization" of splitting the read streams becomes a performance problem -- e.g., 8 threads reading from 8 streams each would generate concurrent access to 64 DRAM pages. If you only have 32 DRAM banks, the banks will have to be closed and re-opened repeatedly, reducing overall performance and increasing the DRAM power consumption. Intel's memory controllers try hard to reorder transactions to minimize the penalty, but there will still be a drop of a few percent. The penalty for oversubscribing the DRAM banks can be much worse when the memory access patterns contain a mixture of reads and writes.