Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

How to clear the upper 128 bits of __m256 value?

Vladimir_Sedach
New Contributor I
1,673 Views

How can I clear the upper 128 bits of m2:
__m256i    m2 = _mm256_set1_epi32(2);
__m128i    m1 = _mm_set1_epi32(1);

m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);
don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”.
At the same time I can easily do it in assembly:
VMOVDQA xmm2, xmm2
VMOVDQA xmm2, xmm1

Of cause I'd not like to use _mm256_insertf128_si256().
 

0 Kudos
14 Replies
TimP
Honored Contributor III
1,673 Views

Do you mean like

#ifdef __AVX__
_mm256_zeroupper();
#endif

For use to enable gcc to compile SSE intrinsics.  icc translates SSE intrinsics to AVX-128 and suppresses the zeroupper() so may this is confusing.  I haven't checked what MSVC does.

0 Kudos
Vladimir_Sedach
New Contributor I
1,673 Views

Tim,

I want to zero the higher 128-bit part of ONE variable only.
In other words, to figure out how to mimic VMOVDQA xmm2, xmm1 (clears upper half of ymm2) with preferably one intrinsic.

 

0 Kudos
andysem
New Contributor III
1,673 Views

m2 = _mm256_permute2x128_si256(m2, m2, 0x40);

 

0 Kudos
Vladimir_Sedach
New Contributor I
1,673 Views

andysem:
_mm256_permute2x128_si256 is even slower than _mm256_insertf128_si256.
Actually, I want the compiler to use VMOVDQA.

m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.

0 Kudos
Bernard
Valued Contributor I
1,673 Views

Tim

Sorry for off topic question.

Do you have any problems with conditional compilation of AVX intrinsics when predefined __AVX__ is used?

0 Kudos
andysem
New Contributor III
1,673 Views

Vladimir Sedach wrote:

Actually, I want the compiler to use VMOVDQA.

m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.

_mm256_castsi128_si256 is basically a no-op and is only there to perform type casts. In conjunction with other intrinsics it can be completely elided from the resulting code.

What you request is actually a new intrinsic, which might not be a bad idea.

0 Kudos
Vladimir_Sedach
New Contributor I
1,673 Views

andysem,

Yes, it could be an intrinsic.
Though I would rather allow implicit conversions (with zero extension if needed) between vector types that differ only in element number:


__m128d md1;
__m256d md2 = md1; //zero extension
md1 = md2; //truncation 

after all, it is already being done with byte "arrays":


unsigned char c;
unsigned int i = c; //zero extension to 4-byte array
c = i; //truncation 

0 Kudos
andysem
New Contributor III
1,673 Views

GCC 4.8 recognizes this pattern:

__m256i ymm = _mm256_inserti128_si256(_mm256_setzero_si256(), xmm, 0);

and inserts a "vmovdqa xmmN, xmmN" instruction that clears the upper lane. It could potentially optimize away this instruction as well if it is known that the original xmm was filled with a VEX-encoded instruction (which is almost always the case), but it doesn't do that. I think, this is as close as you can get to a hand-written assembler.
 

0 Kudos
AFog0
Beginner
1,673 Views

This question is still actual. Now we have the same problem zero-extending 256-bit vectors to 512 bits.

_mm512_castsi256_si512 works most of the time, but an optimizing compiler can mess it up because the upper part is officially undefined. And it actually does in rare cases. I have seen some nasty errors because of this.

0 Kudos
andysem
New Contributor III
1,673 Views

Can you use _mm512_maskz_mov_epi64 and similar intrinsics with a constant mask?

0 Kudos
Beulich__Jan
Beginner
1,673 Views

Did you consider using e.g.

VPOR xmmN, xmmN, xmmN

to zero the upper halves of YMM registers and

VPOR ymmN, ymmN, ymmN

to zero the upper halves of ZMM registers (using suitable intrinsics, perhaps combined with the effectively no-op "casting" ones discussed above)?

0 Kudos
andysem
New Contributor III
1,673 Views

You can't do VPOR + casts with intrinsics because the casts don't guarantee any particular contents in the upper bits. It may be zero by coincidence, it may be something else.

0 Kudos
Beulich__Jan
Beginner
1,673 Views

I don't think the casts will alter the value in the registers at all. They merely provide a means to re-interpret layout and/or size of the registers.

0 Kudos
andysem
New Contributor III
1,673 Views

The casts give the compiler a license to cheat with the upper bits. For example:

 

_mm512_castsi256_si512(_mm256_or_si256(_mm512_castsi512_si256(mm), _mm512_castsi512_si256(mm)))

 

Since upper bits are undefined and _mm256_or_si256 doesn't change the lower bits, the compiler is allowed to completely eliminate this code and directly use mm.

0 Kudos
Reply