@John

DLake1 · ‎09-04-2014

This is something I was talking about on the C++ forum, I have an i7 3820 at 4.3GHz with 16GB quad channel 2133MHz RAM. With single threaded benchmarking with my own inline assembly... If 1 64 bit memory channel were able to transfer 64 bits on every cycle of the 2133MHz DDR clock it would be transferring 15.9GB/s and hey that's right where my write bandwidth is at! Now if 1 core were to store 64 bits per clock cycle it would be transferring 34.4GB/s so obviously the RAM is the limit because the CPU is only transferring 64 bits per cycle. This must be because the load and store ports can only transfer 64 bits per cycle and theres only 1 store port per core so that's what's limiting write bandwidth. But there are 2 load ports so I should get about 31.8GB/s read but I'm about 10GB/s short so where's the limit? <---Heres the question. These are some things I discovered using inline assembly for benchmarking: 1. temporal stores are faster for copying memory probably because caching the (slower) writes interferes with the reads less 2. using 1 xmm is fastest for copying and writing 3. use prefetcht2 for copying 4. non-temporal stores are faster for just writing 5. use prefetcht0 for reading 6. use all available xmm's for reading 7. building for 64 bit is slightly faster because there are more sse registers available One last thing, AIDA says my memory latency is 53.6ns.

Patrick_F_Intel1 · ‎09-04-2014

Hello CommanderLake,

It's sounds kind of like you are trying to write a memcpy or at least doing a lot of the same work involved in a memcpy. If you aren't working on a memcpy then a lot of the comments are maybe not so useful.

For performance reasons, a memcpy can be broken down into a pure read of the source array and a pure write (like a memset) of the dest array. Or at least that is how I broke it down.

If the data is not in L1 already, then the 'pure read' can be profiled with a touch of each cache line. So just loading one integer from each 64 bytes will read the data at about as fast as it can be read. This helps me see how fast one cpu can do the 'read' side of the copy operation... where the limiting factor is just how fast a cache line can be moved around. Usually a simple C loop is sufficient to achieve close to 10% of theoretical (manufacturer guaranteed not to exceed) bandwidth. But I have assembly code to do this just so I get less variation in performance when using different compilers or platforms. I usually measure performance in terms of bytes/clocktick (and you have to careful with the 'just read 1 integer per cacheline in the case where the cacheline is already in L1... then you are only moving the 4 bytes, not the whole cacheline).

Prefetching can help performance if the data is not already in cache. If the data was already in cache or if the prefetch is done too early (so that the data gets kicked out of cache before it is used) then prefetching can slow things down.

For the 'pure write' side of the 'copy', we can simulate a memset operation.One can approach peak bandwidth with a C loop where you modify 1 integer per cacheline. This operation will do the implicit read-for-ownership and then the write. So the actual memory moved is twice the size of the dest array.

The non-temporal stores can avoid the read-for-ownership if you are writing a full cacheline. If you not writing full cachelines then the non-temporal store is much slower. A memcpy done with a non-temporal store only does the read of the source array and a write of the dest array. Without the non-temporal store, a memcpy reads the source AND the dest (via a read for ownership (RFO)) and then a write (so the actual memory moved is 3x the source array.

The wisdom of when to use non-temporal stores is hard to know. I'm not sure what happens if the cacheline is already in cache... it seems like on some systems that the cacheline was first kicked out and then the non-temporal store was done. This is actually slower. I'm not sure if current cpus still have this behavior. Also, if the dest array is going to be needed soon then doing non-temporal stores can actually slow things down (since the dest array is evicted from cache by the non-temporal stores). For this reason I only used non-temporal stores when the size of the memcpy was larger than half of the last level cache.

If you know know you are going to be doing specific sizes in specific scenarios then it is easier to write an optimized memcpy.

When I was working on a memcpy years ago, we looked at the frequency, size and the time for each memcpy operation in a variety of applications. By far the most important sizes in terms of frequency were sizes < 64 bytes. And if you look at the time taken, then the smaller sizes are even more important. So.... all the easier memcpy stuff (moving big blocks as fast as possible) isn't usually important for real applications. It is the small sizes that matter. That is why you see the different code paths based on size in the Intel compiler memcpy. There is (or at least there used to be) one code path for sizes < 64 bytes, maybe another code path for < 256 and another for size < half of last level cache. There is a cost to using xmm registers... or at least there used to be a penalty for unaligned memory accesses. The 'aligning' of the maybe-not-aligned source or dest data to xmm 16 byte boundaries could be expensive... too expensive for a small memcpy. This 'importance of small sizes' also probably doesn't allow prefetching nor non-temporal stores for small sizes.

So the rules you list above... can be helpful... in the right cases. It all just depends... on how one is using the data being moved around, how you are going to use the memory after it is copied, where the data was before it was copied... the frequency that you are moving that size of data...

Hopefully this is helpful...

Pat

DLake1 · ‎09-04-2014

I just want to know why I cant achieve the max theoretical read speed with large transfers when I can easily achieve full write bandwidth with 64 bits per cycle of my 2133MHz RAM, I use 2x 512MB (536870912 byte) aligned arrays and this assembly for the read test:

__asm{
	mov rax, data0;
	mov r8, 2097152;
	loop_read:
	prefetcht0 256[rax];
	prefetcht0 272[rax];
	prefetcht0 288[rax];
	prefetcht0 304[rax];
	prefetcht0 320[rax];
	prefetcht0 336[rax];
	prefetcht0 352[rax];
	prefetcht0 368[rax];
	prefetcht0 384[rax];
	prefetcht0 400[rax];
	prefetcht0 416[rax];
	prefetcht0 432[rax];
	prefetcht0 448[rax];
	prefetcht0 464[rax];
	prefetcht0 480[rax];
	prefetcht0 496[rax];
	movdqa xmm0, 0[rax];
	movdqa xmm1, 16[rax];
	movdqa xmm2, 32[rax];
	movdqa xmm3, 48[rax];
	movdqa xmm4, 64[rax];
	movdqa xmm5, 80[rax];
	movdqa xmm6, 96[rax];
	movdqa xmm7, 112[rax];
	movdqa xmm8, 128[rax];
	movdqa xmm9, 144[rax];
	movdqa xmm10, 160[rax];
	movdqa xmm11, 176[rax];
	movdqa xmm12, 192[rax];
	movdqa xmm13, 208[rax];
	movdqa xmm14, 224[rax];
	movdqa xmm15, 240[rax];
	add rax, 256;
	dec r8;
	jnz loop_read;
	loop_read_end:
}

If you could make it faster that would be amazing!

Patrick_F_Intel1 · ‎09-04-2014

So how many bytes/clocktick do you get with the above code?

Is turbo enabled? Do you loop multiple times over the assembly code? Do you have good timers?... I assume you do...

Pat

McCalpinJohn · ‎09-04-2014

If the BIOS configured your memory subsystem correctly, then the four DRAM channels will be interleaved on cache-line boundaries, giving a peak memory bandwidth of 68.26 GB/s for contiguous accesses. If this is the case, getting exactly 1/4 of this for the write-only benchmark is just a coincidence.

Single threaded memory bandwidth on your system will be concurrency-limited. With a latency of 53.6 ns and a bandwidth of 68.26 GB/s, you will need 3659 Bytes (rounds up to 58 cache lines) "in flight" at all times. Each core of your system can only support 10 outstanding L1 Data Cache misses, which is clearly not close to the required concurrency.

For reads, the L2 hardware prefetchers will bring the data into the L3 (and sometimes into the L2), which will reduce the effective latency. On a 2-socket Xeon E5 (Sandy Bridge), my most aggressive single-threaded code was able to get read bandwidth of about 18 GB/s, with an "effective concurrency" of about 21 cache lines. Getting this level of performance required accessing multiple 4 KiB pages concurrently, as I describe in a series of blog postings at http://sites.utexas.edu/jdm4372/2010/11/, with some additional results from Xeon E3 (Sandy Bridge) and Xeon E5 (Sandy Bridge EP) in an Intel forum discussion at https://software.intel.com/en-us/forums/topic/480004.

Temporal stores behave like reads from a concurrency perspective, since each store that misses in the cache hierarchy reads the cache line from memory before overwriting it. At high bandwidth levels (e.g., more than 50% utilization), these extra reads (usually referred to as "write allocates") will get in the way of reads and slow things down, but at lower bandwidth levels they are usually no problem.

Non-temporal stores are much harder to understand from a concurrency perspective because there is no obvious way to estimate how long the buffers used by these non-temporal stores will be occupied for each transaction. As discussed in a number of other forum topics, it appears that the "client" chips hand off the non-temporal stores to the memory controller quite quickly, so the 10 Line Fill Buffers support a very high store bandwidth. On the "server" chips, the non-temporal stores have lower throughput, which suggests (but certainly does not prove) that they are held in the Line Fill Buffers for a longer period of time for each transaction. In any case, for ordinary ("temporal") stores, the L2 hardware prefetchers can bring the cache line in early so that the "write allocate" can complete faster, but there is no comparable mechanism for non-temporal stores -- each 64 Bytes of non-temporal stores occupies a Line Fill Buffer until it is able to hand it off to the memory controller. Some of this is discussed in the forum topic at https://software.intel.com/en-us/forums/topic/456184.

In that same forum topic (456184), I also note that *scalar* code is faster than SSE code for some memory-bound kernels -- quoting an 11%-12% improvement for the STREAM Triad kernel. For the STREAM Copy kernel the improvement was smaller -- I don't have the numbers in front of me, but I seem to recall a boost in the 6%-8% range. My current hypothesis is that scalar loads ramp up the L1 hardware prefetcher faster than vector loads (since there are more of them), and that this helps ramp up the L2 prefetcher faster. The initial ramp is important for performance because the L2 prefetchers stop and start at 4KiB page boundaries, and the time required to transfer a 4 KiB page is pretty similar to the latency for the first access. On your system, 4096 Bytes at 68.26 GB/s is 60 ns -- only about 12% longer than the initial latency.

Finally, I should note that even if a core could support a much larger amount of concurrency, at some point you will run into cache bandwidth limitations. Although I don't understand the details of the performance limitations, on my Xeon E5 (Sandy Bridge EP) processors, I have been able to construct benchmarks that deliver about 42 Bytes/cycle (7/8 of the peak bandwidth of 48 Bytes/cycle) for L1-contained data, but this drops to 14 Bytes per cycle for L2-contained data (7/8 of 1/2 of the peak bandwidth of 32 Bytes/cycle), and drops further to 8 Bytes/cycle for L3-contained data (1/4 of the peak bandwidth of 32 Bytes/cycle). For your 4.3 GHz system, 8 Bytes/cycle is 34.4 GB/s, or about 1/2 of the total peak DRAM bandwidth.

I have never tested a Sandy Bridge E system ("server" uncore in a single socket), but comparing your system to my Xeon E5-2670 (Sandy Bridge EP), you have 20% lower memory latency and 33% higher bandwidth. My best single-threaded values for STREAM Copy have been in the 13.65 GB/s range -- compiled with "icc -O3 -ffreestanding -no-vec stream.c -o stream_uni_novec". I would expect the Sandy Bridge E system to be at least 20% faster (due to reduced latency) with the same binary -- somewhere in the 16.4 GB/s range. This is counting explicitly requested reads (8.2 GB/s) plus explicitly requested writes (8.2 GB/s) and ignoring the additional 8.2 GB/s of reads that are required to bring the target cache lines into the cache. See http://www.cs.virginia.edu/stream/ref.html#counting for a discussion of how STREAM counts traffic, and http://sites.utexas.edu/jdm4372/2013/01/05/ for an explanation of why STREAM uses "decimal" millions instead of "binary" millions in reporting memory bandwidth.

DLake1 · ‎09-04-2014

I dont like walls of text but I try my best, thanks for all the info that's a mighty brain you have! So what assembly will go fastest for me with a large amount of data? A code sample tells a thousand words. I P/Invoke the methods in a dll from a C# program in a loop which times it with Stopwatch and displays the effective performance and verify the actual bandwidth with VTune. I the highest read I can get is about 21.5GB/s and 17GB/s write as measured with VTune but my C# program says 20GB/s read and 16GB/s write.

DLake1 · ‎09-04-2014

I keep channel and rank interleaving off btw its slightly slower with either of them on.

What will be the next architecture to have a significant memory access performance increase?

McCalpinJohn · ‎09-05-2014

Since you are getting more than 17.06 GB/s on reads you must be using more than one channel at least part of the time. If you are using the default small page size, then each 4KiB page can be expected to be on a random channel and you will get some small amount of overlap on the page transitions. It takes a lot of extra effort to control what is happening if you disable channel and rank interleaving -- at the very least you need to use large pages, obtain the physical address for each large page (via /proc/pid/pagemap on Linux), and then work out which channel and rank each of those 2 MiB pages maps onto. My BIOS does not provide options for turning off interleave -- if it did then I would have some interesting experiments to try....

My fastest "ReadOnly" results on the Sandy Bridge EP (about 18.5 GB/s) came from a combination of techniques, including using large pages, software prefetching (between 4-8 lines ahead) and reading from two streams concurrently. The system was configured with one dual-rank DDR3/1600 DIMM per channel and both channel interleaving and rank interleaving were enabled. In the sequence of experiments described at http://sites.utexas.edu/jdm4372/2010/11/, this was "Version010", and the source code is available at https://utexas.box.com/ReadOnly-tarfile.

It is not going to be easy getting significantly higher bandwidth from a single thread. As I noted previously, I am only getting 8 Bytes/cycle from the L3 cache (using STREAM Triad, which reads two arrays and writes to a third array), which would be 34.4 GB/s at the 4.3 GHz of your system. You are already at over 1/2 of this rate, so even if the latency decreases, the concurrency increases, and the DRAM bandwidth increases, you will have limited room for improvements. There may be more upside for read-only kernels since you would not need to use any of the cache bandwidth for handling cast-outs of dirty victim lines.

On the other hand you should be able to get big improvements in bandwidth right now by using multiple threads. I routinely get 38 GB/s (19 GB/s read + 19 GB/s write) for STREAM Copy on each Xeon E5-2680 socket. This is ~75% of the peak DRAM bandwidth. Intel's "Memory Latency Checker" shows that >90% utilization is possible for pure reads, and close to 90% is possible for combinations of reads and writes that update one of the input arguments (instead of using streaming stores). Within a single chip the overhead for coordinating threads is very low -- I measured 0.5 microseconds for an OpenMP barrier and 1.0 microseconds for an OpenMP "parallel for" on a single socket of my Xeon E5-2680 systems -- so the parallel approach should be effective for even fairly small transfers to/from memory.

Future processors may include high-bandwidth on-package or "stacked" memory, but I don't expect significant changes in single-thread sustained bandwidth. Single thread bandwidth might go up slightly on some processors (Haswell has doubled the bandwidth of the L1 and L2, for example, which might help in the systems with 4 DRAM channels), but it is likely to go down on other processors (e.g., Knights Landing will have slower cores than your current Sandy Bridge E -- lots more aggregate bandwidth - especially from the on-package eDRAM, but it will almost certainly have lower single-thread bandwidth than a Sandy Bridge/Ivy Bridge/Haswell processor with 4 DRAM channels).

The systems with the highest single-thread memory bandwidth have traditionally been the vector supercomputers, but the single-socket Intel processors are certainly closing in on some of those traditional numbers. Looking at the STREAM website (http://www.cs.virginia.edu/stream/), I see that my Sandy Bridge EP result of 15.4 GB/s (STREAM Triad using 1 core) already passes the (1995 era) Cray T94 result of 13.9 GB/s, but we have not yet caught up with the (2001 era) NEC SX-6 at 31.9 GB/s. (The newer 2008-era SX-9 series systems have 8 times the bandwidth per processor of the SX-6, but I have not gotten any STREAM benchmark submissions since the SX-7. The SX-9 HPC Challenge benchmark submission does not appear to include any single-threaded STREAM results: http://icl.cs.utk.edu/hpcc/).

Bernard · ‎09-05-2014

@John

Do you know if SB Memory Controller can dynamically schedule workload on Port2 and Port3 by utilizing for example two ports for loads or one port for load and one for store operations with the CPU cycle granularity?

DLake1 · ‎09-05-2014

Measuring with VTune I got an average read of about 24GiB/s with peaks reaching 25.5GiB/s by reading 2x 512MiB arrays into all 16 xmm's and lots of non-temporal prefetching with this asm:

__asm{
	mov rax, data0;
	mov rdx, data1;
	mov r8, datasize;
	shr r8, 8;
	loop_read:
	prefetchnta 256[rax];
	prefetchnta 256[rdx];
	prefetchnta 288[rax];
	prefetchnta 288[rdx];
	prefetchnta 320[rax];
	prefetchnta 320[rdx];
	prefetchnta 352[rax];
	prefetchnta 352[rdx];
	prefetchnta 384[rax];
	prefetchnta 384[rdx];
	prefetchnta 416[rax];
	prefetchnta 416[rdx];
	prefetchnta 448[rax];
	prefetchnta 448[rdx];
	prefetchnta 480[rax];
	prefetchnta 480[rdx];
	movdqa xmm0, 0[rax];
	movdqa xmm0, 0[rdx];
	movdqa xmm1, 16[rax];
	movdqa xmm1, 16[rdx];
	movdqa xmm2, 32[rax];
	movdqa xmm2, 32[rdx];
	movdqa xmm3, 48[rax];
	movdqa xmm3, 48[rdx];
	movdqa xmm4, 64[rax];
	movdqa xmm4, 64[rdx];
	movdqa xmm5, 80[rax];
	movdqa xmm5, 80[rdx];
	movdqa xmm6, 96[rax];
	movdqa xmm6, 96[rdx];
	movdqa xmm7, 112[rax];
	movdqa xmm7, 112[rdx];
	movdqa xmm8, 128[rax];
	movdqa xmm8, 128[rdx];
	movdqa xmm9, 144[rax];
	movdqa xmm9, 144[rdx];
	movdqa xmm10, 160[rax];
	movdqa xmm10, 160[rdx];
	movdqa xmm11, 176[rax];
	movdqa xmm11, 176[rdx];
	movdqa xmm12, 192[rax];
	movdqa xmm12, 192[rdx];
	movdqa xmm13, 208[rax];
	movdqa xmm13, 208[rdx];
	movdqa xmm14, 224[rax];
	movdqa xmm14, 224[rdx];
	movdqa xmm15, 240[rax];
	movdqa xmm15, 240[rdx];
	add rax, 256;
	add rdx, 256;
	dec r8;
	jnz loop_read;
	loop_read_end:
}

How do I use larger pages? I'm not really familiar with pages.

Patrick_F_Intel1 · ‎09-05-2014

Aren't you supposed to just be prefetching 1 addr per cacheline?

DLake1 · ‎09-05-2014

Oh I thought it prefetched 8 bytes but now I look closer it does prefetch a minimum 32 bytes: http://x86.renejeschke.de/html/file_module_x86_id_252.html

Seems obvious now thanks for pointing that out.

Doesn't seem to make any difference anyway.

Patrick_F_Intel1 · ‎09-05-2014

I'm 99% sure on all recent Intel cpus a prefetch will fetch 64 bytes (a cacheline).

McCalpinJohn · ‎09-05-2014

The 24 GB/s read bandwidth is pretty close to what I would expect to be the practical limit. The best I could get was 18.5 GB/s on a Xeon E5-2680. 24 GB/s is 30% faster, which is on the high end of the expected range of 20% faster (memory latency) to 33% faster (peak memory bandwidth).

Prefetches are definitely fetching 64 Bytes into the L1 cache, but they can trigger the hardware prefetchers to fetch additional lines into the L2 and/or L3 caches.

If you are running on Windows I don't know how to enable large pages. On Linux systems large pages can be enabled a couple of different ways depending on the version. You can request large pages as an option to "shmget()" or as an option to "mmap()". Recent Linux versions have "transparent large page support", which (if enabled) automatically creates large pages and uses them for sufficiently large allocations.

If you are running with channel interleave disabled, then large pages will probably decrease your performance (by limiting accesses to a single DRAM channel for each 2 MiB range of addresses). When accessing two independent arrays I would expect strong variations in performance depending on whether the arrays are mapped to the same DRAM channel or different channels. With default 4KiB pages the same thing is happening, but the conflicts will be for much shorter periods of time --- i.e., at 24 GB/s it only takes 160 ns to load 4 KiB of data, after which your virtual to physical address translation will change to another randomly chosen channel.

For L1-contained data it might be worthwhile to look at interleaving loads and stores to make it easier for the hardware to issue to the different ports. (The out-of-order capability will also enable this, as long as the reorder buffer & reservation stations are not filled with reads before any writes are found in the instruction stream.) For data in L3 or memory it won't make any difference -- there are so many cycles with nothing happening that efficiency of instruction issue is completely irrelevant.

Another item for L1-contained data is that if you want 2 reads plus 1 write per cycle you have to use AVX loads and stores instead of SSE loads and stores. The Sandy Bridge core only has 2 address generation ports, and 2 reads + 1 write needs 3 addresses. The 256-bit AVX loads and stores take twice as long to move twice as much data, but only require 3 addresses every 2 cycles (instead of 3 addresses every cycle). Once you get outside of the L1 cache this does not make any difference any more, since there are plenty of stall cycles to use for address generation.

Bernard · ‎09-06-2014

@CommanderLake

If you are interested I can test my memory copying routines where main copying loop is written in inline assembly. I have Core i7 Haswell and Core i5 Haswell.

Bernard · ‎09-06-2014

iliyapolak wrote:

@CommanderLake

If you are interested I can test my memory copying routines where main copying loop is written in inline assembly. I have Core i7 Haswell and Core i5 Haswell.

I meant testing for maximum bandwidth.Of course results probably will be different.

DLake1 · ‎09-06-2014

Turns out I already had large pages turned on in windows, I tried committing the space for the buffers with VirtualAlloc with large pages but it made little to no difference if anything slightly slower, also channel and rank interleaving make little to no difference in any case but I will keep them on auto (on).

What's your Haswell bandwidth then Iliya?

Bernard · ‎09-07-2014

>>>What's your Haswell bandwidth then Iliya?>>> I have not yet test for memory bandwith. I plan to do it very soon and post the results.

DLake1 · ‎09-17-2014

For starters vmovdqu should be slightly faster for writing to memory than movdqu for 128 bit xmm registers on an AVX compatible CPU.

DLake1 · ‎09-17-2014

When I tried the compiler intrinsics I found it to be slower than compiled code because the compiler messes around with it and may slow it down so I switched to __asm which is much faster as you get to tell the CPU precisely what instructions to execute so you have complete control over it.

Try this for memory copy's I know its not entirely optimal but the memory copy's in my program were mostly large and multiples of 16:

bool memcopy(void* __restrict dest, const void* __restrict src, int size){
	if (size % 16 == 0){
		auto psize = size;
		__asm{
			mov eax, src;
			mov edx, dest;
			mov ecx, psize;
			shr ecx, 4;
			loop_copy:
			prefetcht0 16[eax];
			vmovdqu xmm0, 0[eax];
			vmovdqu 0[edx], xmm0;
			add eax, 16;
			add edx, 16;
			dec ecx;
			jnz loop_copy;
			loop_copy_end:
		}
	}
	else memcpy(dest, src, size);
	return true;
}

That assembly is for x86 only btw for x64 rax, rbx, rcx, rdx and r8 to r15 can be used.

For small copy's that fit in the CPU cache comment out the prefetch instruction.

What's the assembly you tried I'd like to compare the fast and slow assembly.

Bernard · ‎09-18-2014

>>>Please give me some hints on how 128bit transfers can be 1GB/s slower than 32bit GP register(s). It really bothers me and I don't know where to look for answer>>>

This is very interesting question. From the hardware point of view there is the fixed numer of store/load ports on the receiving end of memory read/write operations. I suppose when the memory transfer operates on 64-bit GP registers or on SSE registers the same 64-bit mem operation per single channel is carried out by the Memory Controller. Haswell can sustain 2 256-bit stores and 1 256-bit load per cycle and I think that total bandwidth per single CPU cycle probably can be dependent on the numer of memory transactions and their corresponding uops.

SB-E memory read bandwidth limitation?