Re:Merging data from two aligned SIMD loads to match the alignment of a destination buffer

HD86 · ‎03-03-2021

In the Intel optimization manual, under "15.16.3.1 SIMD Heuristics to implement Memcpy()," they suggest a technique to match the alignment of the source buffer data in each iteration to the alignment of the destination buffer by merging data from two 16-byte chunks into one 16-byte chunk with the PALIGNR instruction. They next say that it is inefficient to try to do the same with 32-byte chunks because the 256-bit VPALIGNR works within 128-bit lanes and can't be used to directly stitch data from two 32-byte chunks.

However, in AVX-512 there is an alternative to VPALIGNR which is the VPERM instruction. This can be used to merge data from two 32-byte chunks or two 64-byte chunks in one instruction, and based on the throughput and latency numbers given in the Intel intrinsics guide, it seems to be fast. Should I use this instruction to align the source data in each iteration in the same manner described in the manual?

I know that REP MOVSB should be used nowadays for an implementation of memcpy, but this question is relevant in other cases, e.g. I may need to load data from one buffer, modify the data in some way, then store it in another buffer.

PrasanthD_intel · ‎03-05-2021

Hi Hani,

Thanks for reaching out to us.

We are working on it and will get back to you.

Regards

Prasanth

Varsha_M_Intel · ‎03-05-2021

Hi,

What is your usage model? Are you coding using intrinsics or asm?

Why VPERM is specific to AVX512?

HD86 · ‎03-05-2021

I am using intrinsics, but I also need to know the answer in case assembly code is used.

Why VPERM is specific to AVX512?

Because this is how it is listed in the Intel intrinsics manual, e.g. "_mm256_permutex2var_epi32" requires the CPUID flags AVX512VL and AVX512F. Of course, I am talking here about VPERM which works on two source operands, because this is the one that can merge data from two chunks in one instruction.

The question is which of the following two alternatives is more efficient in iteration:

1. Load an aligned chunk and merge its data with the previously loaded aligned chunk with one VPERM instruction.

2. Load an unaligned chunk in each iteration, i.e. do not care about alignment in the source buffer.