I in the process of making some 2D/planar image data resampler and current highest performance approach of 2D convolution need to perform shift of long float32 vector between AVX (ymm or zmm) registers at fixed number of float32 (4 byte) values. The shift distance is equal to multiplication ratio of image enlargment, so for integer ratio I need to shift to 2,3,4,5,6 float32 in between long number of AVX registers. The shifting on 8 and 16 is just a renaming of registers.
The fastest current code for shift to 4 float32 uses rare AVX instruction
_mm256_permute2f128_ps. It allow to shift and exchange data between 2 ymm registers.
So current shift to 4 float32 in between 7 ymm registers looks like
my_ymm0 = _mm256_permute2f128_ps(my_ymm0, my_ymm1, 33);
my_ymm1 = _mm256_permute2f128_ps(my_ymm1, my_ymm2, 33);
my_ymm2 = _mm256_permute2f128_ps(my_ymm2, my_ymm3, 33);
my_ymm3 = _mm256_permute2f128_ps(my_ymm3, my_ymm4, 33);
my_ymm4 = _mm256_permute2f128_ps(my_ymm4, my_ymm5, 33);
my_ymm5 = _mm256_permute2f128_ps(my_ymm5, my_ymm6, 33);
my_ymm6 = _mm256_permute2f128_ps(my_ymm6, my_ymm6, 49);
my_ymm6 = _mm256_insertf128_ps(my_ymm6, *(__m128*)(pfProc + 56), 1);
with storing out 4 float result at first, shift and load 4 float to the end. With a large number of fma instructions this multi-sample convolution approach reach about 30..35% of theoretical FMA performance of 4 core intel 9th generation CPU with 4 FMA512 units.
Unfortunately I can not found 1 instruction to make shift to other number of floats in the instruction sets before AVX512VL.
Currently found approach on before_AVX512VL for shifting to 2 floats uses permute for shift inside ymm and blend to transfer shifted-out number of floats to another ymm:
const register __m256i my_ymm8_main_circ = _mm256_set_epi32(1, 0, 7, 6, 5, 4, 3, 2); // main circulating const
my_ymm2 = _mm256_permutevar8x32_ps(my_ymm2, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm3 = _mm256_permutevar8x32_ps(my_ymm3, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm2 = _mm256_blend_ps(my_ymm2, my_ymm3, 192); // copy higher 2 floats
my_ymm4 = _mm256_permutevar8x32_ps(my_ymm4, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm3 = _mm256_blend_ps(my_ymm3, my_ymm4, 192); // copy higher 2 floats
my_ymm5 = _mm256_permutevar8x32_ps(my_ymm5, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm4 = _mm256_blend_ps(my_ymm4, my_ymm5, 192); // copy higher 2 floats
my_ymm6 = _mm256_permutevar8x32_ps(my_ymm6, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm5 = _mm256_blend_ps(my_ymm5, my_ymm6, 192); // copy higher 2 floats
my_ymm7 = _mm256_permutevar8x32_ps(my_ymm7, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm6 = _mm256_blend_ps(my_ymm6, my_ymm7, 192); // copy higher 2 floats
It works visibly slower and uses very large immediate (sized as large as ymm 256 bit register) to set only a few control bits for permutation. That do not allow do use more ymm regs for data storing and cause compiler to load/store immediate and slow down execution.
The other approach is to store (to L1d cache) the long vector from 5..7 ymm regs and load back with shifted address. But it looks even slower because of large buck memory read-write operation and I can not be sure if it will not be transfered via all cache hierarchy causing memory controller to slow down the main memory load/store streams in multi core CPU.
With AVX512VL instruction set I see the 2-sources permutation instruction
_mm512_permutex2var_ps (also uses large zmm immediate to select float32 for permutation).
It can replace _mm256_permute2f128_ps for any number of float32 taken from 1 and from 2nd zmm registers so allow to shift to any number of float32 between registers in 1 instruction.
But AVX512VL CPUs are not in common use at endusers in even 2021. May be in 10 years later will. But not now.
May be other approaches exist for non-AVX512VL capable CPUs for faster shift of float32 long vectors beween a number of ymm (and may be zmm) registers at any step measured in ps floats ?