SB-E memory read bandwidth limitation? - Page 2

DLake1 · ‎09-04-2014

This is something I was talking about on the C++ forum, I have an i7 3820 at 4.3GHz with 16GB quad channel 2133MHz RAM. With single threaded benchmarking with my own inline assembly... If 1 64 bit memory channel were able to transfer 64 bits on every cycle of the 2133MHz DDR clock it would be transferring 15.9GB/s and hey that's right where my write bandwidth is at! Now if 1 core were to store 64 bits per clock cycle it would be transferring 34.4GB/s so obviously the RAM is the limit because the CPU is only transferring 64 bits per cycle. This must be because the load and store ports can only transfer 64 bits per cycle and theres only 1 store port per core so that's what's limiting write bandwidth. But there are 2 load ports so I should get about 31.8GB/s read but I'm about 10GB/s short so where's the limit? <---Heres the question. These are some things I discovered using inline assembly for benchmarking: 1. temporal stores are faster for copying memory probably because caching the (slower) writes interferes with the reads less 2. using 1 xmm is fastest for copying and writing 3. use prefetcht2 for copying 4. non-temporal stores are faster for just writing 5. use prefetcht0 for reading 6. use all available xmm's for reading 7. building for 64 bit is slightly faster because there are more sse registers available One last thing, AIDA says my memory latency is 53.6ns.

Bernard · ‎09-18-2014

iliyapolak wrote:

>>>Please give me some hints on how 128bit transfers can be 1GB/s slower than 32bit GP register(s). It really bothers me and I don't know where to look for answer>>>

This is very interesting question. From the hardware point of view there is the fixed numer of store/load ports on the receiving end of memory read/write operations. I suppose when the memory transfer operates on 64-bit GP registers or on SSE registers the same 64-bit mem operation per single channel is carried out by the Memory Controller. Haswell can sustain 2 256-bit stores and 1 256-bit load per cycle and I think that total bandwidth per single CPU cycle probably can be dependent on the numer of memory transactions and their corresponding uops entering load and store buffers.

McCalpinJohn · ‎09-18-2014

There have been many observations of scalar code generating higher single-threaded memory bandwidth than vector code. See, as one example, the discussion at https://software.intel.com/en-us/forums/topic/516265.

I have not done as much testing on the Haswell EP yet, but initial STREAM benchmark results show that SSE code delivers slightly higher bandwidth than AVX code when using a single thread. For a "typical" single-threaded STREAM benchmark run on a Xeon E5-2690 v3 system, I saw:

Kernel AVX2 GB/s SSE4.1 GB/s SSE advantage
Copy 17.417 18.072 +3.8%
Scale 19.598 18.121 +8.2%
Add 18.626 18.752 -0.7%
Triad 18.719 18.880 -0.8%

On the Sandy Bridge EP systems I also saw that the scalar code was faster than the vector code, but this is not the case on the Haswell EP -- the scalar code is about 30% slower.

Although Haswell EP is very new and I have not done enough experiments yet, it appears that there are two reasons for these differences:

Non-temporal stores:
1. These are relatively slow on Sandy Bridge, so performance is improved by switching to ordinary (allocating) stores and letting the hardware prefetcher bring the lines into the cache early.
2. Haswell EP appears to execute streaming stores very efficiently.
Hardware Prefetching:
1. The L1 hardware prefetcher is triggered by sequences of loads to ascending addresses. Since the processor has to execute two 128-bit SSE loads for each 256-bit AVX load, it is clear that the L1 hardware prefetcher can identify an ascending sequence of contiguous addresses more quickly with SSE instructions.
2. The reason that starting the prefetches quickly matters is that the hardware prefetchers stop at 4 KiB page boundaries (with one exception introduced in the Ivy Bridge processors). Bandwidths are now so high that the latency of loading the first cache line in the page is about the same as the time required to load the entire 4KiB page, so you need to start the prefetches very very quickly to get good bandwidth utilization.

DLake1 · ‎09-21-2014

I've been working on my memcopy again, its faster with very small copies because it minimizes the overhead:

void RC::memcopy( void* dest, void const* src, int size ){
	auto remainder=size % 16;
	auto psize = size;
	__asm{
		mov		ecx, psize;
		mov		eax, src;
		mov		edx, dest;
		cmp		ecx, 15;
		jle		loop_copy_end;
		sub		ecx, remainder;
		shr		ecx, 4;
loop_copy:
		prefetcht0 16[eax];
		movdqu	xmm0, [eax];
		movdqu	[edx], xmm0;
		add		eax, 16;
		add		edx, 16;
		dec		ecx;
		jnz loop_copy;
loop_copy_end:
		mov		ecx, remainder;
		cmp		ecx, 0;
		jle		loop_copyremainder_end;
		push ebx;
loop_copyremainder:
		mov		bl, [eax];
		mov		[edx], bl;
		add		eax, 1;
		add		edx, 1;
		dec		ecx;
		jnz loop_copyremainder;
		pop ebx;
loop_copyremainder_end:
	}
	return;
}

Again this is x86 only assembly.

Hey this is strange, using my bandwidth test program, unaligned copies with aligned memory are slightly faster than aligned copies?

Patrick_F_Intel1 · ‎09-21-2014

It looks like you have a code path where pop ebx could be executed without a matching push ebx.

DLake1 · ‎09-22-2014

Thanks for pointing that out.

I assume the pop should go before the loop end label?

DLake1 · ‎09-23-2014

Georgi, have you tried movntdq for writes also does using just 1 xmm make any difference?

Bernard · ‎09-23-2014

>>>Somewhere I spotted the info of Haswell being able to move 64bytes in one tact, this triggered my greed-for-speed on and my wish was to exploit this by using an YMM register for all transfers 4/8/16/32 bytes, my dummy logic was that no matter how long the match is YMM is the solution, even now I cannot see my 'fault', it is purely intuitive to me to use highest order of available register>>> I suppose that 72-entry Load buffer and 42-entry store buffer will accumulate unrolled mov uops probably per cache line granularity (64-bytes) or it could be 32-byte because bandwith per AGU unit is 32-byte/cycle. As far as my understanding is probably Memory Controller single channel will transfer 64-bit memory transaction per cycle that's mean that in case of 4x unrolled loop pointed by 64-bit GP register 4 cycles will be needed to transfer 64-byte transactions into Load buffer.

Bernard · ‎09-24-2014

>>>I suppose that 72-entry Load buffer and 42-entry store buffer will accumulate unrolled mov uops >>>

I am not sure if Load and Store buffers will accumulate mov instructions(uops) operands(data itself) or they will accumulate uops only and uops target will be send to AGU units.

Hope that someone can explain that.

McCalpinJohn · ‎09-24-2014

In general terms, the purpose of the core's load buffers is to track the program order of loads (especially those that miss in the L1 data cache), hold the mapping of load uops to physical register targets, and to hold the mapping between the load uops and the Line Fill Buffers that are servicing the corresponding cache misses. In recent Intel processors there are only 10 Line Fill Buffers, but there can be many load misses to the same cache line. This information provides control over where the incoming data is sent (and when it is sent), but the load buffers don't need to actually touch the incoming read data.

Most of the same issues apply to stores. Stores have data, of course, but this does not have to be co-located with the buffers used for tracking operations. Stored data is generally written to an internal combining buffer before being written to the L1 data cache. The details depend on the L1 data cache banking scheme used (which varies across processor families), but combining the writes can reduce the load on the L1 data cache and is especially useful if the L1 data cache is protected by ECC across bit fields that are larger than the data provided by a single store instruction.

On recent Intel processors non-temporal stores use the Line Fill Buffers to collect the data from multiple stores to the same aligned 64-Byte (cache line) block. Full 64-Byte transfers provide the highest efficiency at the memory controller, which can simply write 64 Byte aligned blocks to DRAM, but which must perform a read/modify/write operation on smaller chunks of data. (This is required when using ECC, but it also required for all recent DRAM implementations (which only support 64 Byte transfers (8 bytes wide and 8 "beats" long).)

The really ugly part of the load and store buffers has to do with ensuring that the memory ordering model is observed even when responses happen out of order and in and getting store to load forwarding to work correctly and efficiently for as many cases as practical. It is clear from the Intel documentation that there are lots of special cases -- look for "forwarding" in the Intel Optimization Reference Manual and you will be able to see how this functionality has gotten more complex over time. (The addition of wider stores with SSE and AVX have significantly increased the number of different cases that need to be addressed.)

Bernard · ‎09-24-2014

@John

Thank you very much for your explanation. I was looking for that information.