Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Looking for smartest way to insert a DWORD into AVX register

Elmar
Beginner
1,268 Views

Hi all,

I'm looking for the smartest(=fastest) way to insert a DWORD into an AVX register.

Here is what I found so far:

AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway

AVX vpinsrd doesn't work for the same reason, and - truly sad unless the docs are wrong - hasn't been promoted in AVX2, even though the immediate value has space to encode where to insert also in 256bit vectors.

There are lots of multi-instruction workarounds I could think of, but I hoped that the Intel engineers have a smart trick for this basic operation which I overlooked?

Thanks,

Elmar

0 Kudos
8 Replies
SergeyKostrov
Valued Contributor II
1,268 Views
Did you consider /* * Scalar to 128/256-bit vector broadcast operations. */ extern __m256i __ICL_INTRINCC _mm256_broadcastd_epi32( __m128i ); intrinsic function?
0 Kudos
Elmar
Beginner
1,268 Views

Hi Sergey,

thanks, but vpbroadcastd fills the entire vector, I want to insert a single dword at a given location (like vpinsrd), and I want to do that fast, without consuming an extra temporary register (e.g. if I combine a vpbroadcastd with a vpblendd, that's a workaround that needs an extra register).

CU,

Elmar

0 Kudos
SergeyKostrov
Valued Contributor II
1,268 Views
What about these two intrinsic functions: ... extern __m256i __ICL_INTRINCC _mm256_set_epi32( int, int, int, int, int, int, int, int ); ... and ... extern __m256i __ICL_INTRINCC _mm256_setr_epi32( int, int, int, int, int, int, int, int ); ... Examples of application for _mm256_set_epi32 could look like: ... __m256i v1 = _mm256_set_epi32( 0, 77, 0, 0, 0, 0, 0, 0 ); or __m256i v2 = _mm256_set_epi32( 0, 0, 0, 0, 0, 0, 77, 0 ); ...
0 Kudos
andysem
New Contributor III
1,268 Views

@Sergey Kostrov: These are multi-instruction constructs, which basically come down to broadcasts or moves+shuffles. And the OP seem to want to inject a single dword into an existing register filled with data.

I don't think there is a single instruction for this. But depending on the surrounding code and actual requirements you could use vshufps on the two registers (the one with the old contents and the other with the loaded dword). The downside is that the lower and upper halves of ymm are shuffled the same way, so you'll have to insert two values this way. This can be mitigated by copying the half to be preserved from the original data register to the dword register first (see vperm2f128). But you save a mask register that would be needed in case of blend.

0 Kudos
bronxzv
New Contributor II
1,268 Views

Elmar wrote:
AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway

simply use vinsertps followed by vinsertf128, this is the fastest available option AFAIK, I use it for my AVX legacy generic gather path detailed here for example: http://software.intel.com/en-us/comment/reply/285867/1740679

 

0 Kudos
Elmar
Beginner
1,268 Views

andysem wrote:

I don't think there is a single instruction for this. But depending on the surrounding code and actual requirements you could use vshufps on the two registers (the one with the old contents and the other with the loaded dword). The downside is that the lower and upper halves of ymm are shuffled the same way, so you'll have to insert two values this way. This can be mitigated by copying the half to be preserved from the original data register to the dword register first (see vperm2f128). But you save a mask register that would be needed in case of blend.

But vshufps can only insert a QWORD, since two adjacent DWORDS must come from the same operand,no?

For inserting a DWORD, I currently use vinsertps or vpermilps to place the DWORD at the right spot in an unused register, and then vblendps to move the DWORD into the target register (note that vblendps takes an immediate blend factor, not a mask register). If the DWORD crosses a lane, I need a third instruction for the cross-lane-shuffle.

I had hoped that Intel engineers would immediately fire the optimal solution at me (in terms of false dependencies, latency etc.), but it seems that they are busy (hopefully cleaning up the AVX2 manual #319433-014, because that's full of bugs ;-))...

Thanks,

Elmar

0 Kudos
SergeyKostrov
Valued Contributor II
1,268 Views
Elmar, I did a verification and with these intrinsics: >>... >>__m256i v1 = _mm256_set_epi32( 0, 77, 0, 0, 0, 0, 0, 0 ); >>or >>__m256i v2 = _mm256_setr_epi32( 0, 0, 0, 0, 0, 0, 77, 0 ); >>... a performance impact is possible and implementation of a similar functionality with native instruvtions could be faster. Please do a performance evaluation if you decide to use these two intrinsics functions.
0 Kudos
andysem
New Contributor III
1,268 Views

> But vshufps can only insert a QWORD, since two adjacent DWORDS must come from the same operand,no?

You're right, sorry for the confusion. It seems, inserts and blends are the way to go.

0 Kudos
Reply