Best pattern for memcpy using AVX2 registers and intrinsics

roberto_g_1 · ‎01-09-2016

Hello, I have to quickly memory copy 512 bytes using a 4th generation i7 core in Visual C++ (using Intel compiler), and want to avoid the call to memcpy as everything is aligned to 64 bytes.

For that, I am using 16 _mm256_load_si256 intrinsincs operations (on ymm0-15) followed by 16 _mm256_stream_si256 operations (same ymm registers). Alternatively, using 16 _mm256_store_si256 operations in place of the latter ones.

Using VTune, I noted a non-negligible difference of performance whether the above 32 instructions are interleaved or not. I tried several patterns of interleaving, getting different performances (in any case, faster than memcpy).

Question: What is the best pattern for interleaving loads/stores at this point?

Thanks
-Roberto

Vladimir_Sedach · ‎01-10-2016

Roberto,

Which store or stream instructions to use depends on location and number of your 512 byte blocks. Use stream if the number is big, say > size(cache2) / 2.

You can't control instructions interleaving w/o inline assembly with ICC. Compiler chooses the order it thinks is the best.

roberto_g_1 · ‎01-10-2016

Thanks Vladimir for the reply, I am in the case < size(cache2) / 2

Actually, looking at the assembly, when I use _stream_ the Intel compiler (v.16) does not reorder the loads/stores, whereas it does when I use _store_, probably to combine write operations on the cache: however, in the latter case, reordering is not the same for some different interleaving patterns. This is puzzling me... a common feature is the that the last half of the stores are performed together at the end; instead, the first half of the stores can vary as reordering.

Cheers
-Roberto

andysem · ‎01-10-2016

I would suggest you also try Agner Fog's asmlib (see the Subroutine library section here: http://www.agner.org/optimize/).

As for the most efficient order of instructions, I believe it would depend on the particular CPU architecture.

Vladimir_Sedach · ‎01-10-2016

Roberto,
I suspect the 512 byte blocks are not adjacent in memory. In this case CPU can't auto-prefetch the blocks. You may want to do it yourself with _mm_prefetch(). E.g., prefetch next block before reading current one.

You are not the only one who is puzzled. The compiler is puzzled. too -- it doesn't know how to cope with store/stream mix )

roberto_g_1 · ‎01-10-2016

Hello Andy,

I already did it, and using my solution is faster as I can exploit the knowledge that the 512 bytes are consecutive and surely 64-byte memory aligned. Also, I need to reuse soon later the info I have in registers ymm0-ymm15, which is the reason why I am not using memcpy. I tried __intel_fast_memcpy when it replaces memcpy: it goes inline but it uses 128-bit registers xmms, whereas i use 256-bit registers.

I did not expect that copying from memory to 16 registers and then back to memory was so subtle...

-Roberto

andysem wrote:

I would suggest you also try Agner Fog's asmlib (see the Subroutine library section here: http://www.agner.org/optimize/).

As for the most efficient order of instructions, I believe it would depend on the particular CPU architecture

roberto_g_1 · ‎01-10-2016

Actually I already exploited with success _mm_prefetch. This loads/stores reordering is the ice on the cake, but it makes a difference within the currently optimised code, to my surprise... and I cannot try all possible 16! combinations :)

As I said the _stream_ ops are not reordered. As for the _store_ ops, I did some experiments with 12 registers (instead of 16). In the left I put the order of the intrinsics I gave, and on the right the reordering by the Intel compiler, where L denotes a load and S a store:

LLLLLL LLLLLL SSSSSS SSSSSS -> LLLLLL SSSSSS LLLLLL SSSSSS
LSLSLS LSLSLS LSLSLS LSLSLS -> LLLSSS LLLSSS LLLLLL SSSSSS
LLLLSS SLLSLL LSLSLS SSSLSS -> LLLLSS LLSSSS LLLLLL SSSSSS

Looking at VTune results, CPI and memory bound figures vary, for example. The CPU time from one order to another can differ by 25-30%... I copied long batched of blocks to get these figures, and the percentage is non-negligible as the rest of my copy function is highly optimised for my setting.

Thanks
-Roberto

Vladimir Sedach wrote:

Roberto,
I suspect the 512 byte blocks are not adjacent in memory. In this case CPU can't auto-prefetch the blocks. You may want to do it yourself with _mm_prefetch(). E.g., prefetch next block before reading current one.

You are not the only one who is puzzled. The compiler is puzzled. too -- it doesn't know how to cope with store/stream mix )

Vladimir_Sedach · ‎01-10-2016

Roberto,

Prefetching is useless in case of consecutive reads. I think it is completely obsolete for modern CPUs.
Streams are slower than stores if you just copy, say, 100K. In your case they could be ok because of large calculations between them.
You could try to interleave streams not just with loads, but also with calculations.:
load
load
stream
calc
stream
calc
Perhaps, the CPU can't do fast too many streams in a row -- it has to remember too many addresses to do that.

Why don't you measure the time in your code w/o VTune? Use __rdtscp() or QueryPerformanceCounter() (which perhaps in turn calls rdtscp).
It's interesting to compare results! __rdtscp() is reliable if you measure long intervals.

roberto_g_1 · ‎01-10-2016

Vladimir, my code performs essentially loads and stores: it is a sort of multicast memcpy where the same source buffer is replicated over several destination buffers. I will try __rdtscp() or QueryPerformanceCounter(), thanks for the suggestion.

Cheers
-Roberto

Vladimir_Sedach · ‎01-10-2016

Roberto,
What is the size and the number of output buffers? Are they allocated and then used once or multiple times?
I can test it on my 4th gen i7 Intel.

roberto_g_1 · ‎01-10-2016

You can safely assume 512 bytes each buffer. There are 4 of them, allocated once from the driver and read multiple times. You can safely assume that they occupy a chunk B of 2048 consecutive bytes.

The host is in charge of filling them when asked from the driver through a callback: it reads 512 bytes from the source (a big malloc array A of several megabytes) and copies them four times to fill the 2048 consecutive bytes in B. So I implemented the host so it reads the 512 bytes from A with the 16 registers ymms and soon write them in the first portion of B. This is the expensive part. Next, for other three times, it suffices to write the content of the registers to fill the rest in B. This cost is less compared to the former part.

If I perform 4 times a memcpy I clearly go worse, as I can avoid to read 4 times the same 512 bytes from A. This is why I cannot reuse memcpy. For the sake of completeness, 4 vectorized for loops work the same, as the compiler is smart enough to avoid reading four times from A.

Thanks
-Roberto

Vladimir_Sedach · ‎01-10-2016

Streams are not appropriate in this case. The order of read/writes is most likely negligible. It could be that VTune gets wrong results.
Does it measure host, host/driver communication, and driver handling time separately?

roberto_g_1 · ‎01-10-2016

VTune is measuring them separately, so I can get figures for the buffer filling task only. Since it has to be quick (so as not to slow down the DPC queue), I am monitoring that part.

Vladimir_Sedach · ‎01-11-2016

Since the buffer filling task is interrupted after each call by the driver, it is hard to say if the processor auto-prefetching is working for the input buffer.
It's also not clear if the output buffer resides constantly in the L1 cache. It depends on what the driver and OS are doing. What's your opinion on this things -
auto-prefetching and L1 cache? It is hard to reproduce on my computer.
If both mechanisms are working, the time of the host task should not depend on such things as the order of in/out instructions
(provided you are not using stream instructions that can only considerably slow down execution).

Vladimir_Sedach · ‎01-11-2016

Could you test the host function with the driver doing nothing. Would it change the time?
Does the time of the host function execution include the time of driver-host communication? It is expensive.

Bernard · ‎01-13-2016

>>>QueryPerformanceCounter() (which perhaps in turn calls rdtscp).>>>

I suppose , that QueryPerformanceCounter accesses HPET timer in kernel mode in some version of Windows and TSC on others.

roberto_g_1 · ‎01-25-2016

I gave a thought on my examples in the previous posts, and I do not know if the following explain the difference in performance among the different orderings

The scenario is the following.
- The loads and stores are performed for a driver callback.
- The driver runs in kernel mode while the callback function I wrote probably is in user mode.
- The driver is intensively handing a device (network board) causing lots of hardware interrupts.
- Computation is memory bound and the CPI rate is low as I only have load and store operations, making my callback code highly interruptible.
- Context switching due to DPCs with higher priority than my callback is probably expensive as I use ymm registers

Hence, different loads/stores orderings in my callback give different context switching costs. Does it make any sense?

Cheers
-R

Bernard · ‎01-25-2016

Higher priority DPC can be that one at Clock and/or IPI level , but it is unknown(from your post) at which frequency they occur at your system. If you taking into account your callback routine an if it runs at passive level then any code at DPC level and above can preempt it. Maybe this is the situation.

Vladimir_Sedach · ‎01-26-2016

Roberto,

> Context switching due to DPCs with higher priority than my callback is probably expensive as I use ymm registers
> Hence, different loads/stores orderings in my callback give different context switching costs. Does it make any sense?

loads/stores have nothing to do with ring 0 <-> ring 3 context switching.
Context switching occurs when the driver calls the callback and again on return from your callback.
You can considerably decrease the cost of it (and overall productivity) by handling, say,
10..100 instead of only one 512-byte chunk on each call. Also move 1->4 chunk handling to the driver.

You may also consider increasing priority of the callback's process/thread with something like:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
to avoid long delays caused by the switching.

JWong19 · ‎01-26-2016

I have a few points for your consideration.

- if the io buffer is cachable, make a local copy only when you use it (e.g. calculation)

- it sounds that your goal is to minimize the number of total cpu cycles (copy + calculations) instead of (copy alone)

- though optimizing local copy performance is a basic step, you'd better measure both copy + calculations time when cache is in concern

roberto_g_1 · ‎01-27-2016

@iliyapolak : it could be, I am not an expert of DPCs. The driver issuing my callbacks is managing an intense streaming for Ethernet board, and there are many interrupts intermixed with my callbacks.

@Vladimir: My guess is that it is quite probable that a run of loads/stores will be interrupted. Unfortunately I do not have access to the source code of the driver, so I cannot control its priority (but I suppose it is realtime) and the size of the requested transfer: my callback should obey to what it asks.

@Jeremy: In some of the variations that I tried, I am already using part of the buffer as a local cache (e.g. https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers?wapkw=intel_fast_memcpy)

I have very few calculations to perform during the callback: one easy-predicatble branch and few local variable increases. So 99% of the code is loads/stores for copying. Sometimes, when I change few things, the compilers uses some extra ymms registers and so it saves them at the beginning and restores them at the end of the callback (looking at its assembly).

In general, in this scenario, the optimized code is roughly proportional to the amount of read bytes. I tried to force nounroll in the loops but the situation is worsening as I invalidate the pipeline, being these loops too short. This to say that high-performance memory copy of small buffers (≤ 4Kb) is less trivial than I initially thought…