- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>Please give me some hints on how 128bit transfers can be 1GB/s slower than 32bit GP register(s). It really bothers me and I don't know where to look for answer>>>
This is very interesting question. From the hardware point of view there is the fixed numer of store/load ports on the receiving end of memory read/write operations. I suppose when the memory transfer operates on 64-bit GP registers or on SSE registers the same 64-bit mem operation per single channel is carried out by the Memory Controller. Haswell can sustain 2 256-bit stores and 1 256-bit load per cycle and I think that total bandwidth per single CPU cycle probably can be dependent on the numer of memory transactions and their corresponding uops entering load and store buffers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There have been many observations of scalar code generating higher single-threaded memory bandwidth than vector code. See, as one example, the discussion at https://software.intel.com/en-us/forums/topic/516265.
I have not done as much testing on the Haswell EP yet, but initial STREAM benchmark results show that SSE code delivers slightly higher bandwidth than AVX code when using a single thread. For a "typical" single-threaded STREAM benchmark run on a Xeon E5-2690 v3 system, I saw:
- Kernel AVX2 GB/s SSE4.1 GB/s SSE advantage
- Copy 17.417 18.072 +3.8%
- Scale 19.598 18.121 +8.2%
- Add 18.626 18.752 -0.7%
- Triad 18.719 18.880 -0.8%
On the Sandy Bridge EP systems I also saw that the scalar code was faster than the vector code, but this is not the case on the Haswell EP -- the scalar code is about 30% slower.
Although Haswell EP is very new and I have not done enough experiments yet, it appears that there are two reasons for these differences:
- Non-temporal stores:
- These are relatively slow on Sandy Bridge, so performance is improved by switching to ordinary (allocating) stores and letting the hardware prefetcher bring the lines into the cache early.
- Haswell EP appears to execute streaming stores very efficiently.
- Hardware Prefetching:
- The L1 hardware prefetcher is triggered by sequences of loads to ascending addresses. Since the processor has to execute two 128-bit SSE loads for each 256-bit AVX load, it is clear that the L1 hardware prefetcher can identify an ascending sequence of contiguous addresses more quickly with SSE instructions.
- The reason that starting the prefetches quickly matters is that the hardware prefetchers stop at 4 KiB page boundaries (with one exception introduced in the Ivy Bridge processors). Bandwidths are now so high that the latency of loading the first cache line in the page is about the same as the time required to load the entire 4KiB page, so you need to start the prefetches very very quickly to get good bandwidth utilization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been working on my memcopy again, its faster with very small copies because it minimizes the overhead:
void RC::memcopy( void* dest, void const* src, int size ){ auto remainder=size % 16; auto psize = size; __asm{ mov ecx, psize; mov eax, src; mov edx, dest; cmp ecx, 15; jle loop_copy_end; sub ecx, remainder; shr ecx, 4; loop_copy: prefetcht0 16[eax]; movdqu xmm0, [eax]; movdqu [edx], xmm0; add eax, 16; add edx, 16; dec ecx; jnz loop_copy; loop_copy_end: mov ecx, remainder; cmp ecx, 0; jle loop_copyremainder_end; push ebx; loop_copyremainder: mov bl, [eax]; mov [edx], bl; add eax, 1; add edx, 1; dec ecx; jnz loop_copyremainder; pop ebx; loop_copyremainder_end: } return; }
Again this is x86 only assembly.
Hey this is strange, using my bandwidth test program, unaligned copies with aligned memory are slightly faster than aligned copies?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you have a code path where pop ebx could be executed without a matching push ebx.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for pointing that out.
I assume the pop should go before the loop end label?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I suppose that 72-entry Load buffer and 42-entry store buffer will accumulate unrolled mov uops >>>
I am not sure if Load and Store buffers will accumulate mov instructions(uops) operands(data itself) or they will accumulate uops only and uops target will be send to AGU units.
Hope that someone can explain that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In general terms, the purpose of the core's load buffers is to track the program order of loads (especially those that miss in the L1 data cache), hold the mapping of load uops to physical register targets, and to hold the mapping between the load uops and the Line Fill Buffers that are servicing the corresponding cache misses. In recent Intel processors there are only 10 Line Fill Buffers, but there can be many load misses to the same cache line. This information provides control over where the incoming data is sent (and when it is sent), but the load buffers don't need to actually touch the incoming read data.
Most of the same issues apply to stores. Stores have data, of course, but this does not have to be co-located with the buffers used for tracking operations. Stored data is generally written to an internal combining buffer before being written to the L1 data cache. The details depend on the L1 data cache banking scheme used (which varies across processor families), but combining the writes can reduce the load on the L1 data cache and is especially useful if the L1 data cache is protected by ECC across bit fields that are larger than the data provided by a single store instruction.
On recent Intel processors non-temporal stores use the Line Fill Buffers to collect the data from multiple stores to the same aligned 64-Byte (cache line) block. Full 64-Byte transfers provide the highest efficiency at the memory controller, which can simply write 64 Byte aligned blocks to DRAM, but which must perform a read/modify/write operation on smaller chunks of data. (This is required when using ECC, but it also required for all recent DRAM implementations (which only support 64 Byte transfers (8 bytes wide and 8 "beats" long).)
The really ugly part of the load and store buffers has to do with ensuring that the memory ordering model is observed even when responses happen out of order and in and getting store to load forwarding to work correctly and efficiently for as many cases as practical. It is clear from the Intel documentation that there are lots of special cases -- look for "forwarding" in the Intel Optimization Reference Manual and you will be able to see how this functionality has gotten more complex over time. (The addition of wider stores with SSE and AVX have significantly increased the number of different cases that need to be addressed.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@John
Thank you very much for your explanation. I was looking for that information.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »