To add one additional data

Thomas_B_2 · ‎04-11-2014

Hi all,

I had originally asked this question in a separate Intel community forum (https://communities.intel.com/thread/50808), but it was suggested that I repost here. There is also a stackoverflow question from another user linked in the other posting (http://stackoverflow.com/questions/22793669/poor-memcpy-performance-on-linux/) that provides more details on a specific test platform.

To summarize the core question/observation: When benchmarking the memory performance of (pinned) single-threaded operations on large buffers (larger than the last level of cache), we observe substantially lower copy bandwidth on dual-socket E5-26XX and E5-26XX v2 Xeon systems than on other systems tested, including older Westmere systems, i7 CPUs, etc. This result can be seen using CacheBench (http://icl.cs.utk.edu/projects/llcbench/cachebench.html) as shown in the stackoverflow posting. I realize that the aggregate bandwidth numbers can be increased substantially by using mulitple threads pinned to each core on a socket, but I am currently primarily interested in understanding the performance of a single thread. All of the test systems run either CentOS 6.5 or Fedora 19 and all of the dual-socket systems has Supermicro boards.

For some concrete numbers, running 'taskset -c 2 ./cachebench -p -x1 -m26 -d2 -e1' on several systems generates the following copy bandwidth for 64MiB buffers:

Dual-socket Xeon E5-2650 v2: 6249 MB/s
Dual-socket Xeon E5-2670: 5896 MB/s
Dual-socket Xeon X5660: 9283 MB/s
Core i7-3960X: 11525 MB/s

I can run the tests for longer (-d2 denotes 2 seconds per buffer size), but the trend is clear. Does anyone know why the results for the E5 Xeons lag so far behind other systems?

Thanks and regards,
Thomas Benson

Thomas_B_2 · ‎04-11-2014

To add one additional data point to the above results:

Single socket Xeon E3-1275 v3: 19045 MB/s

McCalpinJohn · ‎04-11-2014

Some of those numbers look funny, but there are lots of things that might be happening, so a bit of patience is required.

There are a couple of factors at play here:

1. Idiom Substitution:
You have to be very careful to avoid having the cachebench memcpy kernel replaced with a library routine. The library routine is usually fast, but is often not the fastest option, and since the assembly code is not so easy to locate it is much harder to interpret. With the STREAM benchmark I usually add the "-ffreestanding" compiler option to tell the compiler not to make memcpy substitutions -- then I double-check the assembly code to be sure. (Note that STREAM Copy counts both read and write traffic, so the values are twice as big as the memcpy results above -- see the discussion at http://www.cs.virginia.edu/stream/ref.html#counting )

2. Memory Latency:
Single threaded memory bandwidth is concurrency-limited on these systems. Each of these Intel cores can handle 10 L1 Data Cache misses, so (in the absence of L2 hardware prefetching), your read bandwidth is going to be limited to 10 cache lines per memory latency.    I don't have exactly the same set of processors, but I do have some very similar processors that show:
              Dual-Socket Xeon E5-2680: 79 ns   (running at max Turbo speed of 3.1 GHz)         "Sandy Bridge EP"
Dual-Socket Xeon X5680:    69 ns (running at nominal 3.33 GHz)                          "Westmere EP"
              Xeon E3-1270:                        53.6 ns (running at nominal 3.4 GHz)                          "Sandy Bridge" (with "client" uncore

3. Streaming Stores:
Depending on how the code is compiled, it may or may not contain non-temporal ("streaming") stores. Streaming stores reduce the overall memory traffic by eliminating the read of the target cache lines before the are overwritten. This provides a large performance boost in the multicore case, but for a single thread streaming stores often reduce performance because they cannot be prefetched. (Prefetching store targets reduces the length of time that the store transactions hold on to the L1 Data Cache Line Fill Buffers, and so improve overall throughput.) The performance of streaming stores for a single thread differs significantly across Intel processors, but the details are not easy to investigate.

Here are some numbers from the STREAM benchmark to show how these factors interact.   All are single-threaded STREAM Copy values run with processor pinning and (where applicable) enforced NUMA memory affinity. They were compiled with various versions of the Intel C compiler (versions 11 through 13, though there is very little difference in performance for this test):
                                                           with streaming stores              without streaming stores        with memcpy substitution
      Dual-Socket Xeon E5-2680                   7528 MB/s                               12545 MB/s                          8640 MB/s
      Dual-Socket Xeon X5680                      8140 MB/s                               10215 MB/s                             ???
      Single-Socket Xeon E3-1270              17950 MB/s                               11970 MB/s                              ???

Recall that these numbers should be about twice the values that cachebench uses, so the 12545 MB/s (STREAM Copy) on the Xeon E5-2680 is only about 6% higher than the 5896 MB/s (Cachebench memcpy) on the Xeon E5-2670.

On both the Westmere EP and Sandy Bridge EP, streaming stores reduce the performance of STREAM Copy. This is not terribly surprising -- without streaming stores the DRAM utilization is low enough that the extra read traffic is not a problem, and the prefetching of the store targets allows the 10 Line Fill Buffers to handle more transactions per unit time. On the Xeon E3-1270 the situation is reversed because this system has only 2 DRAM channels (providing 21.33 GB/s peak BW), so the 11970 MB/s without streaming stores actually corresponds to 11970*3/2 = 17955 MB/s total DRAM traffic, which is 84% of peak and close to lots of other limits. At this high level of utilization there is a large benefit in eliminating the extra memory traffic associated with reading the target array before overwriting it.

With streaming stores, the older Westmere EP is about 8% faster than the newer Sandy Bridge EP. My argument is that this is mostly due to the approximately 13% lower memory latency on the older system. Without streaming stores, the newer Sandy Bridge EP is faster than the older Westmere EP. In this case I suspect that the difference is due to more aggressive hardware prefetching on the Sandy Bridge EP, but I don't have quantitative evidence to support that hypothesis.

I am puzzled by the Xeon X5660 result in the initial posting -- I have not seen any cases in which the same software runs that much faster (57%) on the Westmere EP than on the Sandy Bridge EP. Note that the cachebench value of 9283 MB/s is 82% faster than the STREAM Copy result that I see on my (faster) X5680 processors. This suggests something funny --- maybe a multi-threaded memcpy got called by accident?

The improvement on the Core i7-3960X is partly due to the reduced latency from being in a single socket (so no need to wait for snoop responses from another chip) and partly due to the reduced latency due to the higher frequency. Unlike most previous Intel processors, the "uncore" on the "Sandy Bridge EP" runs at the same frequency as the fastest core, so latency is a fairly strong function of processor frequency. My latency model suggests that the Core i7-3960X latency should be no higher than about 60 ns (and certainly could be lower), compared to 80 ns on the Xeon E5-2670. This would account for a 33% improvement -- a decent chunk of the 95% difference observed, but not enough to be entirely satisfying.

Thomas_B_2 · ‎04-11-2014

John,

Thank you for the detailed response. I have run STREAM on each of the above systems to remove memcpy() from consideration (at least initially). We had also seen high variability based on the particular memcpy() implementation in use, including enabling/disabling the use of non-temporal stores when using Agner Fog's asmlib library (c.f. the stackoverflow discussion). I compiled STREAM via

gcc -O2 stream.c -o stream_c -ffreestanding -mtune=native

on each system. I do not see a memcpy call in stream.c, but I confirmed via nm that memcpy is not in the symbol table of the executable. The results on the same systems as above for the copy test are as follows:

  Dual-socket Xeon E5-2650 v2:    12950 MB/s
  Dual-socket Xeon E5-2670:       12202 MB/s
  Dual-socket Xeon X5660:          9750 MB/s
  Core i7-3960X:                  14970 MB/s
  Single socket Xeon E3-1275 v3:  13112 MB/s

I re-ran CacheBench on the X5660 system and obtained similar results to the original posting, so the X5660, i7-3960X, and E3-1275v3 results differ quite substantially between CacheBench and STREAM (presumably all three benefit from the specific memcpy implementation). The E5 results, on the other hand, are consistent between CacheBench and STREAM.

The STREAM results trend approximately how I would expect based soley on when the hardware became available (assuming, perhaps naively, that performance improves over time). However, the CacheBench results indicate that the X5660, i7-3960X, and E3-1275v3 CPUs can all exhibit substantially higher single-threaded copy performance using whatever optimizations are included in the memcpy() implementation[1]. A natural question then seems to be whether the E5 CPUs have some hardware limitation preventing their single-threaded copy performance from matching (or even approaching) that of the other CPUs, or if there is some more optimal software optimization approach for copying data on the E5 CPUs?

Thanks again for your feedback,

Thomas

[1] This is assuming that the implementation on those systems is not "cheating" somehow, such as copying from a single zero page. I modified the CacheBench code to fill the initial buffer such that all pages are unique, but that did not change the performance results.

Thomas_B_2 · ‎04-11-2014

I ran 3 tests on each machine using the code posted in the stackoverflow thread. The three versions use (1) asmlib (http://www.agner.org/optimize/) with streaming stores, (2) asmlib without streaming stores, and (3) a naive copy. The buffer size is 1 GiB, aligned at a page boundary. The single thread is pinned to a single core and the cores are all in performance mode (i.e. not dynamically clocking, except for the Haswell machine which uses the intel_pstate driver). The results are rounded to 10MB/s increments.

                              Streaming asmlib Non-streaming asmlib   Naive (*dst++ = *src++)    Max
Dual-socket Xeon E5-2650 v2:     6580 MB/s         11360 MB/s                11360 MB/s        11360 MB/s
Dual-socket Xeon E5-2670:        6540 MB/s         10420 MB/s                10300 MB/s        10300 MB/s
Dual-socket Xeon X5660:          9040 MB/s         10300 MB/s                 8660 MB/s        10300 MB/s
Core i7-3960X:                  13700 MB/s         13700 MB/s                13700 MB/s        13700 MB/s
Single socket Xeon E3-1275 v3:  20200 MB/s         12900 MB/s                12900 MB/s        20200 MB/s

Looking only at the maximum bandwidths, the values all seem reasonable. I am fine with the single socket results being higher for the reasons that John stated above. Is there a way to directly measure the impact that snooping traffic has on bandwidth? I checked the BIOS to see if a socket could be disabled, but there was no such option. I assume there is no way to disable cache coherency between sockets?

The primary anomaly seems to be the initial X5660 results. I will have to look more closely at that test case. I have not seen anything obvious that can account for the difference. I ran it through gdb, which would have reported any created threads, but there were none. I tested on several identically configured Westmere machines and the result was reproducible.

Regards,

Thomas

McCalpinJohn · ‎04-14-2014

On these systems snooping only impacts bandwidth indirectly -- by increasing the latency in two-socket systems. It seems unlikely that snooping will contribute much to the increase in latency under heavy loads -- most of that increase should be due to queuing delays in the data path.

Historically speaking, there have been a number of processors whose maximum snoop rate has limited the overall bandwidth of large SMP systems. Examples include IBM POWER4 and POWER5 processors, and the first two generations of the AMD Opteron Family 10h processors. This is one reason why recent 4-socket (and larger) systems typically include "snoop filters" (also called "probe filters", "directories", "directory filters", or in AMD's case: "HyperTransport Assist").

It is clear that none of Intel's recent 2-socket systems run into snoop limits on memory bandwidth -- at least when they are running at full speed. Nehalem EP, Westmere EP, and Sandy Bridge EP all show perfect scaling from 1 socket to 2 sockets on the STREAM benchmark, and Ivy Bridge should as well. The Sandy Bridge EP uncore performance monitoring guide mentions that the Performance Management Unit monitors snoop rates and may choose to keep processor core frequencies from dropping too low if snoop rates are high. This is probably more important for latency than for multicore bandwidth, but of course single-thread bandwidth is closely tied to latency.

The Xeon Phi uses "distributed duplicate tags" to reduce the snoop rate required for any single set of cache tags. Addresses are hashed to one of 64 Duplicate Tag Directories, so each only has to handle (on average) 1/64th of the global snoop rate. This is definitely required -- the STREAM Triad benchmark can sustain more than 175 GB/s of memory bandwidth, which corresponds to 2.75 billion cache lines per second. If every core had to snoop at that rate, they would all be doing 2.5 tag lookups per core cycle just for the external probes -- not counting the cache tag accesses required for their own cache hits. Maximum rates for external snoops are typically much less than one per cycle.

If you have physical control over your Sandy Bridge EP system, you should be able to remove one of the processors and boot as a single-socket server. This would eliminate the snoop response delay in the latency equation. I tried this once on my systems, but the system booted with a non-optimal memory interleaving, and I never got around to trying to work past that problem.

Single-threaded memory performance for dual socket Xeon E5-* systems