Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1711 Discussions

Shifting long vectors between AVX registers (on pre AVX512VL CPUs)


I in the process of making some 2D/planar image data resampler and current highest performance approach of 2D convolution need to perform shift of long float32 vector between AVX (ymm or zmm) registers at fixed number of float32 (4 byte) values. The shift distance is equal to multiplication ratio of image enlargment, so for integer ratio I need to shift to 2,3,4,5,6 float32 in between long number of AVX registers. The shifting on 8 and 16 is just a renaming of registers.

The fastest current code for shift to 4 float32 uses rare AVX instruction 

_mm256_permute2f128_ps. It allow to shift and exchange data between 2 ymm registers.

So current shift to 4 float32 in between 7 ymm registers looks like

_mm_store_ps(pfProc, _mm256_castps256_ps128(my_ymm0));
my_ymm0 = _mm256_permute2f128_ps(my_ymm0, my_ymm1, 33);
my_ymm1 = _mm256_permute2f128_ps(my_ymm1, my_ymm2, 33);
my_ymm2 = _mm256_permute2f128_ps(my_ymm2, my_ymm3, 33);
my_ymm3 = _mm256_permute2f128_ps(my_ymm3, my_ymm4, 33);
my_ymm4 = _mm256_permute2f128_ps(my_ymm4, my_ymm5, 33);
my_ymm5 = _mm256_permute2f128_ps(my_ymm5, my_ymm6, 33);
my_ymm6 = _mm256_permute2f128_ps(my_ymm6, my_ymm6, 49);
my_ymm6 = _mm256_insertf128_ps(my_ymm6, *(__m128*)(pfProc + 56), 1);

with storing out 4 float result at first, shift and load 4 float to the end. With a large number of fma instructions this multi-sample convolution approach reach about 30..35% of theoretical FMA performance of 4 core intel 9th generation CPU with 4 FMA512 units.

Unfortunately I can not found 1 instruction to make shift to other number of floats in the instruction sets before AVX512VL.

Currently found approach on before_AVX512VL for shifting to 2 floats uses permute for shift inside ymm and blend to transfer shifted-out number of floats to another ymm:

const register __m256i my_ymm8_main_circ = _mm256_set_epi32(1, 0, 7, 6, 5, 4, 3, 2); // main circulating const

my_ymm2 = _mm256_permutevar8x32_ps(my_ymm2, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm3 = _mm256_permutevar8x32_ps(my_ymm3, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm2 = _mm256_blend_ps(my_ymm2, my_ymm3, 192); // copy higher 2 floats

my_ymm4 = _mm256_permutevar8x32_ps(my_ymm4, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm3 = _mm256_blend_ps(my_ymm3, my_ymm4, 192); // copy higher 2 floats

my_ymm5 = _mm256_permutevar8x32_ps(my_ymm5, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm4 = _mm256_blend_ps(my_ymm4, my_ymm5, 192); // copy higher 2 floats

my_ymm6 = _mm256_permutevar8x32_ps(my_ymm6, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm5 = _mm256_blend_ps(my_ymm5, my_ymm6, 192); // copy higher 2 floats

my_ymm7 = _mm256_permutevar8x32_ps(my_ymm7, my_ymm8_main_circ); // circulate by 2 ps to the left
my_ymm6 = _mm256_blend_ps(my_ymm6, my_ymm7, 192); // copy higher 2 floats

It works visibly slower and uses very large immediate (sized as large as ymm 256 bit register) to set only a few control bits for permutation. That do not allow do use more ymm regs for data storing and cause compiler to load/store immediate and slow down execution.

The other approach is to store (to L1d cache) the long vector from 5..7 ymm regs and load back with shifted address. But it looks even slower because of large buck memory read-write operation and I can not be sure if it will not be transfered via all cache hierarchy causing memory controller to slow down the main memory load/store streams in multi core CPU. 


With AVX512VL instruction set I see the 2-sources permutation instruction 

_mm512_permutex2var_ps (also uses large zmm immediate to select float32 for permutation).

It can replace _mm256_permute2f128_ps for any number of float32 taken from 1 and from 2nd  zmm registers so allow to shift to any number of float32 between registers in 1 instruction.

But AVX512VL CPUs are not in common use at endusers in even 2021. May be in 10 years later will. But not now.

May be other approaches exist for non-AVX512VL capable CPUs for faster shift of float32 long vectors beween a number of ymm (and may be zmm) registers at any step measured in ps floats ?

0 Kudos
0 Replies