- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How can I clear the upper 128 bits of m2:
__m256i m2 = _mm256_set1_epi32(2);
__m128i m1 = _mm_set1_epi32(1);
m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);
don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”.
At the same time I can easily do it in assembly:
VMOVDQA xmm2, xmm2
VMOVDQA xmm2, xmm1
Of cause I'd not like to use _mm256_insertf128_si256().
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you mean like
#ifdef __AVX__
_mm256_zeroupper();
#endif
For use to enable gcc to compile SSE intrinsics. icc translates SSE intrinsics to AVX-128 and suppresses the zeroupper() so may this is confusing. I haven't checked what MSVC does.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
I want to zero the higher 128-bit part of ONE variable only.
In other words, to figure out how to mimic VMOVDQA xmm2, xmm1 (clears upper half of ymm2) with preferably one intrinsic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
m2 = _mm256_permute2x128_si256(m2, m2, 0x40);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem:
_mm256_permute2x128_si256 is even slower than _mm256_insertf128_si256.
Actually, I want the compiler to use VMOVDQA.
m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim
Sorry for off topic question.
Do you have any problems with conditional compilation of AVX intrinsics when predefined __AVX__ is used?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir Sedach wrote:
Actually, I want the compiler to use VMOVDQA.
m2 = _mm256_castsi128_si256(m1);
mostly uses VMOVDQA and clears the upper half, but I can't rely on that
until it is explicitly stated by compiler developers.
_mm256_castsi128_si256 is basically a no-op and is only there to perform type casts. In conjunction with other intrinsics it can be completely elided from the resulting code.
What you request is actually a new intrinsic, which might not be a bad idea.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem,
Yes, it could be an intrinsic.
Though I would rather allow implicit conversions (with zero extension if needed) between vector types that differ only in element number:
__m128d md1;
__m256d md2 = md1; //zero extension
md1 = md2; //truncation
after all, it is already being done with byte "arrays":
unsigned char c;
unsigned int i = c; //zero extension to 4-byte array
c = i; //truncation
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
GCC 4.8 recognizes this pattern:
__m256i ymm = _mm256_inserti128_si256(_mm256_setzero_si256(), xmm, 0);
and inserts a "vmovdqa xmmN, xmmN" instruction that clears the upper lane. It could potentially optimize away this instruction as well if it is known that the original xmm was filled with a VEX-encoded instruction (which is almost always the case), but it doesn't do that. I think, this is as close as you can get to a hand-written assembler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This question is still actual. Now we have the same problem zero-extending 256-bit vectors to 512 bits.
_mm512_castsi256_si512 works most of the time, but an optimizing compiler can mess it up because the upper part is officially undefined. And it actually does in rare cases. I have seen some nasty errors because of this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you use _mm512_maskz_mov_epi64 and similar intrinsics with a constant mask?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you consider using e.g.
VPOR xmmN, xmmN, xmmN
to zero the upper halves of YMM registers and
VPOR ymmN, ymmN, ymmN
to zero the upper halves of ZMM registers (using suitable intrinsics, perhaps combined with the effectively no-op "casting" ones discussed above)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can't do VPOR + casts with intrinsics because the casts don't guarantee any particular contents in the upper bits. It may be zero by coincidence, it may be something else.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think the casts will alter the value in the registers at all. They merely provide a means to re-interpret layout and/or size of the registers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The casts give the compiler a license to cheat with the upper bits. For example:
_mm512_castsi256_si512(_mm256_or_si256(_mm512_castsi512_si256(mm), _mm512_castsi512_si256(mm)))
Since upper bits are undefined and _mm256_or_si256 doesn't change the lower bits, the compiler is allowed to completely eliminate this code and directly use mm.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page