How to clear the upper 128 bits of __m256 value?

Vladimir_Sedach · ‎01-27-2014

How can I clear the upper 128 bits of m2:
__m256i m2 = _mm256_set1_epi32(2);
__m128i m1 = _mm_set1_epi32(1);

m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);
don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”.
At the same time I can easily do it in assembly:
VMOVDQA xmm2, xmm2
VMOVDQA xmm2, xmm1

Of cause I'd not like to use _mm256_insertf128_si256().

TimP · ‎01-27-2014

Do you mean like

#ifdef __AVX__
_mm256_zeroupper();
#endif

For use to enable gcc to compile SSE intrinsics. icc translates SSE intrinsics to AVX-128 and suppresses the zeroupper() so may this is confusing. I haven't checked what MSVC does.

Vladimir_Sedach · ‎01-27-2014

Tim,

I want to zero the higher 128-bit part of ONE variable only.
In other words, to figure out how to mimic VMOVDQA xmm2, xmm1 (clears upper half of ymm2) with preferably one intrinsic.

andysem · ‎01-27-2014

m2 = _mm256_permute2x128_si256(m2, m2, 0x40);

Vladimir_Sedach · ‎01-28-2014

andysem:
_mm256_permute2x128_si256 is even slower than _mm256_insertf128_si256.
Actually, I want the compiler to use VMOVDQA.

m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.

Bernard · ‎01-28-2014

Tim

Sorry for off topic question.

Do you have any problems with conditional compilation of AVX intrinsics when predefined __AVX__ is used?

andysem · ‎01-28-2014

Vladimir Sedach wrote:

Actually, I want the compiler to use VMOVDQA.

m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.

_mm256_castsi128_si256 is basically a no-op and is only there to perform type casts. In conjunction with other intrinsics it can be completely elided from the resulting code.

What you request is actually a new intrinsic, which might not be a bad idea.

Vladimir_Sedach · ‎01-29-2014

andysem,

Yes, it could be an intrinsic.
Though I would rather allow implicit conversions (with zero extension if needed) between vector types that differ only in element number:

__m128d md1;
__m256d md2 = md1; //zero extension
md1 = md2; //truncation

after all, it is already being done with byte "arrays":

unsigned char c;
unsigned int i = c; //zero extension to 4-byte array
c = i; //truncation

andysem · ‎04-10-2014

GCC 4.8 recognizes this pattern:

__m256i ymm = _mm256_inserti128_si256(_mm256_setzero_si256(), xmm, 0);

and inserts a "vmovdqa xmmN, xmmN" instruction that clears the upper lane. It could potentially optimize away this instruction as well if it is known that the original xmm was filled with a VEX-encoded instruction (which is almost always the case), but it doesn't do that. I think, this is as close as you can get to a hand-written assembler.

AFog0 · ‎05-06-2019

This question is still actual. Now we have the same problem zero-extending 256-bit vectors to 512 bits.

_mm512_castsi256_si512 works most of the time, but an optimizing compiler can mess it up because the upper part is officially undefined. And it actually does in rare cases. I have seen some nasty errors because of this.

andysem · ‎05-07-2019

Can you use _mm512_maskz_mov_epi64 and similar intrinsics with a constant mask?

Beulich__Jan · ‎05-26-2019

Did you consider using e.g.

VPOR xmmN, xmmN, xmmN

to zero the upper halves of YMM registers and

VPOR ymmN, ymmN, ymmN

to zero the upper halves of ZMM registers (using suitable intrinsics, perhaps combined with the effectively no-op "casting" ones discussed above)?

andysem · ‎05-26-2019

You can't do VPOR + casts with intrinsics because the casts don't guarantee any particular contents in the upper bits. It may be zero by coincidence, it may be something else.

Beulich__Jan · ‎05-27-2019

I don't think the casts will alter the value in the registers at all. They merely provide a means to re-interpret layout and/or size of the registers.

andysem · ‎05-27-2019

The casts give the compiler a license to cheat with the upper bits. For example:

_mm512_castsi256_si512(_mm256_or_si256(_mm512_castsi512_si256(mm), _mm512_castsi512_si256(mm)))

Since upper bits are undefined and _mm256_or_si256 doesn't change the lower bits, the compiler is allowed to completely eliminate this code and directly use mm.