Stream benchmark runs slower on 8Core E5-2680 when vectorized .

QIAOMIN_Q_ · ‎08-29-2013

I know there are many hardware geeks in this forum ,so i'd like to make my enquires here.

Compiled Stream.c with 'icc -O -g stream.c -o stream-icc.out -vec-report2' and run on 8Core E5-2680 and 4Core i7-2600K for comparision .You can see the below differences.

on 8-core E5-2680 ,
Function    Best Rate MB/s Avg time     Min time     Max time
Copy:            6935.4     0.023127     0.023070     0.023201
Scale:           6928.5     0.023124     0.023093     0.023184
Add:             9395.9     0.026508     0.025543     0.026677
Triad:           9361.4     0.026323     0.025637     0.026437

interesting on 4-core i7-2600K i got
Function    Best Rate MB/s Avg time     Min time     Max time
Copy:           15771.4     0.010163     0.010145     0.010179
Scale:          17654.2     0.009080     0.009063     0.009096
Add:            17832.0     0.013481     0.013459     0.013569
Triad:          17727.7     0.013560     0.013538     0.013597

And after compiled with 'icc -O -g -no-vec stream.c -o stream-icc.out -vec-report2' ,and run the non-vectorized Stream on both servers. You can see we get better performance on E5-2680 this time.

Function Best Rate MB/s Avg time Min time Max time //On 4-core i7-2600K
Copy: 16111.4 0.009949 0.009931 0.009970
Scale: 12010.1 0.013338 0.013322 0.013367
Add: 13300.1 0.018077 0.018045 0.018134
Triad: 13278.0 0.018119 0.018075 0.018238

On 8Core E5-2680

Function Best Rate MB/s Avg time Min time Max time
Copy: 7039.5 0.022788 0.022729 0.022834
Scale: 12362.8 0.012959 0.012942 0.012983
Add: 13268.6 0.018137 0.018088 0.018179
Triad: 13360.8 0.017970 0.017963 0.017984

So can someone share valuable experiences?

Also i attached the viginia's stream.c and the compiles vectorized asm file ,also i attached the vtune project for you guys to dig into the related events (such as less l1_hit when vectorized).

Thanks,

Joey

Patrick_F_Intel1 · ‎08-29-2013

Hello QIAOMIN,

You want to post this to the intel compiler forum. Without digging through the assembly code and/or Vtune files it is hard to know why the performance is different.

Pat

McCalpinJohn · ‎08-29-2013

Without looking at the assembly, I would guess that the vectorized version is using streaming stores, while the non-vectorized version is using ordinary cacheable stores. On Westmere and Sandy Bridge processors (and almost certainly Ivy Bridge as well), single-thread performance is limited by the occupancy of the core's 10 Line Fill Buffers (LFB's). When you use streaming stores, the LFBs are occupied until the data can be handed off to the memory controller. When you run ordinary cacheable stores, the hardware prefetcher can bring the target cache line for the store in advance, so that the LFB is only occupied until the cache line can be handed off to the cache -- a much shorter period of time. Shorter occupancy in the LFB's means that the LFB's can handle more transactions per second, thus providing more bandwidth.

The penalty you pay when using cached stores is extra read traffic, since each target cache line must be moved into the cache before it can be written. When you are only using one core, the total DRAM utilization is so low that this extra traffic is not a problem. For example, the Xeon E5-2680 running without vectorization reports 13.36 GB/s on the Triad kernel, but the extra reads of the store targets result in a total traffic of 17.8 GB/s --- much much lower than the 51.2 GB/s peak DRAM bandwidth of the Xeon E5-2680 chip.

You should see the behavior reverse when you compile for OpenMP and use all the cores. At these higher levels of utilization the extra read traffic gets in the way of performance, so the version with streaming stores is typically faster by almost 1.5x on the Copy and Scale kernels and by almost 1.33x on the Add and Triad kernels.

QIAOMIN_Q_ · ‎08-29-2013

Thanks for you guys' comments.

After experimented the '-opt-streaming-stores ' and '-no-vec' options on E5-2680,i think the slowdown maybe caused by the vectorization ,and the vectorization correlate with the streaming-store?

$ icc -O stream.c -o stream_icc_O2sse2
$ ./stream_icc_O2sse2
-------------------------------------------------------------
Function    Best Rate MB/s Avg time     M time     Max time
Copy:          &< 8737.9     0.019362     0.018311     0.022907
Scale:           7354.6     0.022629     0.021755     0.026725
Add:          &Triad:           9623.1     0.025625     0.024940     0.030861
-------------------------------------------------------------

$ icc -O3 -xAVX stream.c -o stream_icc_O3AVX
$ ./stream_icc_O3AVX
-------------------------------------------------------------
Function    Best Rate MB/s Avg time     M time     Max time
Copy:          &< 8772.4     0.018804     0.018239     0.022841
Scale:           7304.9     0.022505     0.021903     0.026893
Add:          &Triad:           9318.2     0.025918     0.025756     0.026797
-------------------------------------------------------------

$ icc -O -no-vec stream.c -o stream_icc_O2novec
$ ./stream_icc_O2novec
-------------------------------------------------------------
Function    Best Rate MB/s Avg time     M time     Max time
Copy:          &< 7279.3     0.023134     0.021980     0.027058
Scale:          12918.7     0.013260     0.012385     0.016197
Add:          &Triad:          13663.5     0.018187     0.017565     0.022969
-------------------------------------------------------------

$ icc -O -ansi-alias -ip -opt-streaming-stores always stream.c -o stream_icc_O2_alias
$ ./stream_icc_O2_alias
-------------------------------------------------------------
Function    Best Rate MB/s Avg time     M time     Max time
Copy:          &< 7234.6     0.022261     0.022116     0.022311
Scale:           7253.9     0.022142     0.022057     0.022195
Add:          &Triad:           9565.6     0.025238     0.025090     0.025312
-------------------------------------------------------------

$ icc -O -no-vec -ansi-alias -ip -opt-streaming-stores always stream.c -o stream_icc_O2novec_alias
$ ./stream_icc_O2novec_alias
-------------------------------------------------------------
Function    Best Rate MB/s Avg time     M time     Max time
Copy:           12211.9     0.013144     0.013102     0.013244
Scale:          12587.5     0.012735     0.012711     0.012767
Add:          &Triad:          13566.2     0.017842     0.017691     0.017893

As you can see from the last compilation option ,-no-vec matters more than -opt-streaming-stores here.

And as i can see from the stream.s(in stream.zip attachment) ,the differences between vec and -no-vec are only 'packed load/store' VS 'scalar load/store' for the benchmark .

Part of the code in the stream.s:

Part of the asm i paste here:
..LN344:
   .loc    1 315 is_stmt 1
        movl      $c, %edi                                      #315.6
..LN345:
        movq      %r13, %rsi                                    #315.6
..LN346:
        movl      $80000000, %edx                               #315.6
..LN347:
        call      _intel_fast_memcpy                            #315.6

Thanks，

Qiaomin

QIAOMIN_Q_ · ‎08-30-2013

I uploaded a vtune sampling event result in the attachment ,which compares the assembly code between no-vec and vectorization.

To conclude on single thread Stream benchmark on E5-2680:

The best score comes from :1 -no-vec and 2 -no-vec and -opt-streaming-stores never (it seems when -no-vec specified ,compiler would not generate streaming stores even we add the option '-opt-streaming-stores')

less score comes from : vectorization and -opt-streaming-stores never

worst performance comes from : vectorization and -opt-streaming-stores auto (default by compiler)

From vtune ,we can see after vectorization ,there are less MEM_LOAD_UOPS_RETIRED.L1_HIT_PS ,this is right because of less retired instructions after vec.

The point in this picture is more MEM_LOAD_UOPS_RETIRED.LLC_MISS after vectorization compared to the no-vec one.

So seems vectorization intensifies the streaming stores's stress on the E5-2680's LLC and more cache misses.

Thanks ,

Qiao

Patrick_F_Intel1 · ‎08-30-2013

Don't know how applicable this is but there is a report that, for some versions of the Intel c++ compiler, not generating streaming stores even if you request them. What version are you using?

http://software.intel.com/en-us/articles/hpcc-stream-performance-loss-with-the-11-0-compiler

Pat

TimP · ‎08-30-2013

Even if you are testing single thread stream performance on a multi-core CPU, you may need to set affinity, e.g. by taskset. Stream normally is run using all cores (but not with multiple threads per core).

As John McCalpin pointed out, with a single thread, you aren't testing full CPU bandwidth (the usual objective of this benchmark). I don't know what conclusions you expect to draw. I've seen cases where non-temporal stores show to more advantage with single thread execution than when using all cores

McCalpinJohn · ‎08-30-2013

I compiled the code using each of the command lines above, then recompiled to save the .s files.
My compiler version ("icc --version") is "icc (ICC) 13.1.0 20130121"

Then a simple "grep movnt stream_[version].s" showed that each of the "slow" runs contained movnt instructions, while each of the "fast" versions did not. I followed that with "grep mulpd stream_[version].s" to see which codes used packed double multiplications, and found that each code that used non-temporal stores also used packed arithmetic. So for these five sets of command-line options, there is no independent test of packed vs non-packed arithmetic with normal stores (or packed vs non-packed with non-temporal stores).

I don't have time to follow this up further today, but you might have better luck with differential control of vectorization and non-temporal stores using inline pragmas rather than compiler command-line options. It seems clear that you need to inspect the assembly code to determine what the compiler is doing, rather than trusting the compiler options to do what you might expect.

QIAOMIN_Q_ · ‎08-30-2013

@Patrick : hello Patt

$ icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.0.072 Build 20130710

This version has no problem on generating streaming stores which i have checked.

@John : I have inspected the assembly code and find that

1 when specify '-no-vec' ,there would generate non-packed arithmetic with normal stores ,and only with this compilation option ,the Stream benchmark can get the most score on E5-2680.

2 when specify vectorization and -opt-streaming-stores never ,compiler would generate packed-arithmetic with normal stores ,which would cause less performance to the '-no-vec' version ,so in this case as i said ,the vectorization intensifies the memory stress a little ,as i can see from the less score got when run Stream benchmark compiled with this option and see the increase EM_LOAD_UOPS_RETIRED.LLC_MISS compared to the '-no-vec' one.

@Timp : As you said ' set affinity, e.g. by taskset. Stream normally is run using all cores (but not with multiple threads per core).' .So maybe this is the point ,i will test this single-thread application pinned on a single core.

Thanks ,your comments make me more clear to the single-thread behaviour of Stream on a multi-core processor.

Qiao

McCalpinJohn · ‎08-31-2013

It took me a while to figure out how to get the 13.1 compiler to generate the required combinations for comparison, but by adding
#pragma vector temporal
to the four STREAM kernel loops and then compiling with and without the "-no-vec" compiler option I was able to get vector vs non-vector comparisons for which both cases used ordinary "temporal" stores.

For both SSE2 and AVX targets, I can now confirm that there is a systematic bias of 11%-12% in favor of the non-vectorized versions for the Triad kernel.

However, generated code has differences that prevent one from immediately concluding that vectorization is a direct cause of the slowdown.
Comparing the versions (all compiled with "-O2" with compiler version 13.1.1 20130313), I see:

SSE2, no-vec: 2 elements per loop iteration: unroll by 2, SSE2 scalar operations
AVX, no-vec: 2 elements per loop iteration: unroll by 2, AXV scalar operations
SSE2, vec: 8 elements per loop iteration: 4 pairs of 2-wide operations
AVX2, vec: 16 elements per loop iteration: 4 sets of 4-wide operations

Curiously, the scalar versions use different addressing modes than the vector versions. I don't know what these are called, but the scalar codes use the form "8+b(%rax)", while the vector versions use the form "16+b(,%rax,8)".

So we have differences in packed vs scalar instructions, number of operations per loop, and addressing mode. These need to be controlled more precisely to understand what is going on.

I tried applying a "#pragma unroll (8)" directive to all the cases. This did cause the scalar cases to unroll to 8 elements per loop iteration, with no significant change in performance (perhaps a tad higher?). The compiler complained about the directive in the cases with vectorization, but the unrolling appears to have doubled in those cases, with no signfiicant change in performance (perhaps a fraction of a percent lower). The addition of this unroll directive did not change the addressing modes, which remained different for the two cases.

It seems to me extremely unlikely that the vectorized arithmetic instructions make any difference -- the performance difference is almost certainly due to the change in the number of loads (due to the wider vectorized loads) making subtle differences in the behaviour of the hardware prefetch engines.

I re-ran the 8 test cases with hardware prefetch disabled, and saw that the performance varied quite a bit, mostly dependent on the number of elements processed per loop iteration, but with some strange outliers, so no obvious conclusions...

-----------------------------------------------------------------------------------------------
Vector   Unroll   ISA            Triad MB/s          Description
-----------------------------------------------------------------------------------------------
yes       --      SSE2             6712.1       8 updates per loop iteration (4 2-wide vector)
yes       x8      SSE2             6703.9       16 updates per loop iteration (8 2-wide vector)
yes       --      AVX              6616.4       16 updates per loop iteration (4 4-wide vector)
yes       x8      AVX              6549.3       32 updates per loop iteration (8 4-wide vector)

no        x8      SSE2             5306.2       8 updates per loop iteration (scalar)
no        x8      AVX              5217.4       8 updates per loop iteration (scalar)

no        --      SSE2             4449.4       2 updates per loop iteration (scalar)
no        --      AVX              4394.0       2 updates per loop iteration (scalar)
-----------------------------------------------------------------------------------------------

QIAOMIN_Q_ · ‎08-31-2013

Thanks for Dr. Bandwidth's deep dig into this .I will investigate more into the asm between the no-vec+no-streaming-store and vec+no-streaming-store,beacuse the latter option combination causes lower porformance and compare the events in Vtune between these two compilation options.

Thanks for yours' time again.

QIAOMIN_Q_ · ‎09-01-2013

As to the point on 'vectorization & no streaming stores slower than -no-vec and no streaming stores'

After discussed with some colleagues ,it's said based on the results on WSM, streaming stores or vectorization hurts more on SNB,it is likely to hit SNB-EP uncore specific issue ,since this loop is memory bound, no vectorization speedup expected on Sandy Bridge.The problem does not appear in IVB or some SNB-EP systems. As i tested on the above mentioned 'on 4-core i7-2600K ,vectorization & streaming stores give the best performance.'

Qiao

Zara_g_ · ‎09-05-2013

Interesting ,seems this problem is common. Can someone share the experiences on single-threaded Stream on E5-26XX ? I saw many forum threads complain that only 1 fourth peak memory bandwidth can achieved when just use one core ,don't know which is the limited part for one core in this package either.

McCalpinJohn · ‎09-05-2013

Single threaded STREAM is definitely concurrency-limited on the Xeon E5-26xx series.

What does this mean?

Reaching maximum bandwidth requires that the memory "pipeline" be filled with requests. Since the (idle) memory latency on my Xeon E5-2680 processors is about 77 ns and the peak memory bandwidth is 51.2 GB/s, filling the pipeline requires 77 ns * 51.2 GB/s = 3942 Bytes of outstanding requests. This corresponds to 61.2 cache misses outstanding at each point in time.

The Sandy Bridge core has 10 Line Fill Buffers to handle L1 cache misses, so in the absence of hardware L2 prefetches, you would expect to get a maximum of 10 cache lines * 64 Bytes/cache line = 640 Bytes every 77 ns. This is about 8.3 GB/s --- about 1/6 of the peak bandwidth of the chip.

Fortunately, all of Intel's processors include some degree of hardware L2 prefetching, so a core can have more than 10 cache line transfers in flight. The number of additional transactions in flight is difficult to determine directly, but the performance impact is clearly measurable. For a simple test with only memory reads (no stores), I managed to get a sustained bandwidth of about 17.5 GB/s using one core. Assuming that the average latency remains about 77 ns, this corresponds to 21 cache line transfers in flight at all times. Getting this much extra traffic from the L2 hardware prefetchers required some code rearrangement so that the processor was reading from multiple independent 4 KiB pages at the same time. Each 4 KiB page is tracked separately by the L2 hardware prefetcher, so accessing more pages allows more L2 prefetches to be generated.

In the end, all transfers into the L1 cache have to go through the L1 Line Fill Buffers. The way that L2 prefetches help bandwidth is to bring the data into the L3 and/or L2 cache so that the L1 Line Fill Buffers are occupied for a shorter amount of time. Decreasing the average latency increases the throughput available from a fixed number of buffers. If the data comes from (or goes directly to) DRAM, the buffer needs to be occupied for an amount of time similar to the latency (77 ns in this case), so the sustainable bandwidth is
10 buffers * 64 Bytes/buffer / 77 ns = 8.3 GB/s
If many of the L1 misses are satisfied in the L2 or L3 cache, the average time that the Line Fill Buffers is occupied will decrease. As an example, a 50% reduction in occupancy gives a sustained bandwidth of
10 buffers * 64 Bytes/buffer / 38.5 ns = 16.6 GB/s

This brings us back to STREAM and streaming stores. With "ordinary" stores, L2 hardware prefetcher can fetch lines in advance and reduce the time that the Line Fill Buffers are occupied, thus increasing sustained bandwidth. On the other hand, with streaming (cache-bypassing) stores, the Line Fill Buffer entries for the stores are occupied for the full time required to pass the data to the DRAM controller. In this case, the *loads* can be accelerated by hardware prefetching, but the stores cannot, so you get some speedup, but not as much as you would get if both loads and stores were accelerated.

The details are somewhat different on AMD processors because streaming stores are handled by a different set of buffers. I have not been tracking the newer AMD processors, but in the Family 10h processors (Barcelona, Shanghai, Istanbul, MagnyCours), each core had 8 buffers to handle L1 cache misses (either loads or "ordinary" stores) and 4 additional buffers to handle streaming stores. The extra available buffers made streaming stores effective in more cases than we see with the Intel processors.

Of course, when using many cores, the extra memory traffic required for "ordinary" stores (reading the data into the L1 cache before overwriting it) makes the streaming stores more valuable (since they do not read the data into the cache before overwriting it -- except on a few platforms, such as the Xeon E7, where the reads happen anyway).

QIAOMIN_Q_ · ‎09-05-2013

Thanks John

On my own Core i7 2660K (4-core Sandy Bridge),i get the most peak mem bandwidth at near 18GB/s with streaming stores in single thread ,which has a max mem bandwidth of 22GB/s. Thanks for your explanations on my enquiry.

My only concern is that considering in a scenario that all loads in the large loop has unit-strided mem access nature like the Stream ,and at the beginning of this large blocking copy loop compiler automatically (or we manually) inserts load accesses to the data on the next adjecent 4KB pages (TLB priming) to enable more prefetching in flight ,so actually in this case ,line fill buffer here may not become the limitness considering the quick LFB_hit when prefetchers automatically prefetch multiple cache lines in flight simutanously. I think in such an achivable scenario ,the only limitness for this would be about 60 load buffer entries ?

I see that AVX offers better cache bandwidth especially L1 caches ,and on SNB the peak bandwidth for L2 and L1 is 251.7GB/s / 108.54GB/s separately ,also i know that sometimes the measured memory bandwidth reported by PCM can even double the number reported by the Stream .Thanks for pointing out the inadequacy of my speculated argument ,something maybe i overlooked.

Qiao

McCalpinJohn · ‎09-06-2013

18 GB/s for streaming stores using a single core is excellent. I have not gotten around to trying that case yet, but it seems clear that the Line Fill Buffer occupancy for streaming stores has to be significantly lower than the DRAM read latency to enable this. That does make sense -- the read latency is a round trip, while the write latency is only one way.

The Core i7-2660 looks like about the same configuration as my Xeon E3-1270 (4-core, 3.4 GHz, 2 channels of DDR3-1333).
On that Xeon E3, the open page memory read latency is a lower than in my 2-socket Xeon E5-2680 nodes because it is not necessary to wait for the coherence response from the other socket before using the data. I measured 53.4 ns for open page read latency and about 67.1 ns for closed-page read latency. The difference of 13.7 ns is reasonable enough -- the extra time to open a page (without conflicts) should be about 12 ns. For the 1.5 ns major clock of DDR3-1333, a typical timing for page open is 8-9 cycles, which is 12.0-13.5 ns -- very close to what I observed.

Using the 53.4 ns open page read latency as a Line Fill Buffer occupancy estimate for streaming stores and a peak bandwidth of 21.33 GB/s gives an concurrency requirement of 1139 Bytes, or 17.8 cache lines. But the occupancy is almost certainly lower than the full read latency. Taking the 18 GB/s you observed and assuming the concurrency is about 10 cache lines allows us to calculate an average occupancy of 35.6 ns, which is certainly plausible -- about 18 ns less than the open page read latency. Of course it is entirely possible that in the streaming store case the core can hand off the streaming stores and their data to a larger pool of buffers in the memory controller, thus keeping the occupancy of the CPU Line Fill Buffers down to smaller values. It would be interesting to repeat this test case on a system with higher latency and higher bandwidth to see if the streaming store bandwidth gets even higher, or if it is close to its concurrency-limited value at the 18 GB/s observed....

QIAOMIN_Q_ · ‎09-06-2013

Hello John,

Thanks for your in-depth comments here .I am curious about the cacheable 'orinary' load now ,i know in some cases ,streaming stores would be linf-fill-buffer bound ,and there should be not much limitness in "cacheable 'orinary' load demand" cases ,as i said above ,in the scenario of good TLB priming and multiple software prefetch insturctions in a large blocking_copy loop ,in this case ,LFB would not be a problem because of less data cache miss when effective software preftching made. So considering on SNB the peak bandwidth for L2 and L1 is 251.7GB/s / 108.54GB/s separately ,maybe the limited part for achiving peak mem BW would be memory controller's capacity and then load buffer entries ?so in the case of Core i7-4770K (Haswell) ,which can get 37.3GB/s peak bandwidth with a 2.333Ghz dual-channel memory sub-system ,the max ideal read memory bandwidth would still only limits by the memory controllers' capacity ,which is only one third of the L1's peak bandwidth i think.

Thanks,

Qiao

Zara_g_ · ‎09-07-2013

Hello Qia and John ,

Could you explain more on the different perfomance varition between performance gain on corei7 and perf degration on xeonE5 in the nontemporal scenario? as you said ,they are of the same uarch and simlar cpu prequency ,the only diff is as John said xeonE5 has a bigger LLC ,which maynot suffer addational snoop traffic especially when streaming stores, so in this case ,E5 sounds should behave better in non-temporal case than I7 ,seems John miss to mention it above or maybe i'm wrong.

McCalpinJohn · ‎09-08-2013

In general, streaming stores will help when the DRAM utilization is very high, since they reduce the required read traffic.
That is why they are beneficial for a single thread on a system with only two memory channels, but detrimental for a single thread on a system with four memory channels (per socket).

The details are more complex, and some of the details are not clear to me yet. The results below show that the Xeon E3 is a lot faster than the Xeon E5 when running STREAM with streaming stores using a single thread. The Xeon E3 is 75% faster. The largest part of this is probably simply the latency difference -- the open page (read) latency on Xeon E5 is 42% higher than than on the Xeon E3, so a concurrency-limited benchmark should be about that much faster on the lower-latency system. The clock frequency of 10% might account for a bit more of the difference -- the Xeon E5 sustained bandwidth is definitely a function of the clock frequency (though the difference is not terribly large when comparing 2.7 GHz and 3.1 GHz).

The simpler "uncore" of the Xeon E3 could lead to significantly lower Line Fill Buffer occupancy for streaming stores. The Xeon E5 has a much more complex ring structure to navigate in order to hand off the streaming stores from the core buffers to the memory controllers, so the occupancy might differ by a larger factor than the memory (read) latency.

Comparing STREAM Triad results with and without streaming stores on my Xeon E3-1270 (3.4 GHz, quad-core, single socket, 2 channels of DDR3-1333) and on my Xeon E5-2680 (2.7 GHz typically running at 3.1 GHz Turbo, 8-core, two-socket, 4 channels of DDR3-1600 per socket) shows this:

System                   with streaming stores               without streaming stores
Xeon E3-1270               18.1 GB/s (85%)                 13.1 GB/s (61%,82%)
Xeon E5-2680               10.3 GB/s (20%)                   13.7 GB/s (27%,36%)

The values in parentheses are the percent of peak DRAM bandwidth sustained for a single socket.
Where there are two numbers, the first is based on the bandwidth reported by STREAM and the second is the actual DRAM utilization taking into account the additional read traffic required by store misses (when not using streaming stores).

QIAOMIN_Q_ · ‎09-08-2013

Some supplements to the topic：For sandy bridge-EP ,each core only supports up to 10 L1 dcache misses and 16 total L2 outstanding misses ,and seems the LLC cache bandwidth has only a peak bw about 30GB/s per core in Turbo mode ,although L2 can have a theorotical 100GB/s(or maybe more) peak bandwidth ,and there are more than only outstanding misses we have been talking about here ,the read bandwidth should also be taken into consideration (demand requests and software/hw prefetches ).

And there are limited num of open requests per core ,especially on a single core of E5 ,is has a longer ring bus and increased LLC slices ,in single threaded scenario there may not have addational snooping traffic and non-local accesses in other seven LLC slices.

McCalpinJohn · ‎09-10-2013

I get about 17.4 GB/s for a loop of two 256-bit AVX streaming stores using a single thread on my Xeon E3-1270 (Sandy Bridge, dual-channel DDR3, quad-core, 3.4 GHz, Turbo disabled). This corresponds to an average occupancy of ~37 ns if I assume 10 buffers. In contrast, my Xeon E5-2680 (Sandy Bridge EP, quad-channel DDR3, 8-core, 3.1 GHz turbo) delivers about 8.8 GB/s.

There are some interesting features in the performance profile on both of these systems as I vary the size of the loop. If I cut the loop back to 152 double-precision elements (19 cache lines), the reported bandwidth approximately doubles, with values in the range of 31 GB/s to 35 GB/s. Since these values are impossible for the DRAM to sustain, it is apparent that there is an additional set of buffers between the core and the DRAMs that can absorb multiple writes to the same addresses. Increasing the loop length to 160 elements (20 cache lines) causes the reported performance to fall back to plausible levels (less than 16 GB/s).

This is a good reminder of how hard it can be to define "latency" and "bandwidth" for stores, since the memory system only has to provide the appearance of storing the data to memory, it does not actually have to put it there.