- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):
1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.
2. __m256 _mm256_undefined_si256 () should return __m256i.
3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.
4. _mm_alignr_epi8 has two descriptions.
5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?
I didn't read all instructions so there may be more issues. I'll post if I find anything else.
PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Shift clarification:
When describing shift operations, it would help clarify the shift amount by using the word "byte" or "bit" like in the documentation https://software.intel.com/en-us/node/524238.
For example:
__m128i _mm_srli_epi32 (__m128i a, int imm8)
...
Shift packed 32-bit integers in a right by imm8 bytes while shifting in zeros, and store the results in dst.
Thank you and good day.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Groarke, Philippe wrote:
__m128i _mm_srli_epi32 (__m128i a, int imm8)
...
Shift packed 32-bit integers in a right by imm8 bytes while shifting in zeros, and store the results in dst.
_mm_srli_epi32 operates on bits. By default, shift and rotate operations work in terms of bits. _mm_srli_si128/_mm_slli_si128 and equivalents for larger vectors are exceptions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wrong return type in the `rdtsc` intrinsic
I've found an issue with the description of the `rdtsc` intrinsic.
This instruction is used to read the value of the CPU's timestamp counter, a 64-bit monotonically-increasing unsigned integer. However, the docs say the return type is a signed `__int64`. Both GCC and Clang properly expose this intrinsic as returning an unsigned long long, not a signed one. I believe this should be fixed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The AVX512-VNNI instructions vpdpbusd doesn't show in the guide as far as i see.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kearney, Jim wrote:
The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?
Instructions that operate on 128 or 256-bit vectors require AVX-512VL in addition to AVX-512VBMI. 512-bit vectors only require AVX-512VBMI.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem wrote:
Quote:
Kearney, Jim wrote:
The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?
Instructions that operate on 128 or 256-bit vectors require AVX-512VL in addition to AVX-512VBMI. 512-bit vectors only require AVX-512VBMI.
Sorry, I phrased that poorly.
I meant, the presentation in the Guide acts as though "+" means "or". If one enables only AVX-512VL under Technologies, the _mm_ and _mm256_ variants are shown as available. I would expect to have to check AVX-512_VBMI as well ("and").
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Does _mm_loadl_epi64 (movq xmm, m64) have alignment requirements? The documentation seems to suggest that _mm_load_si128 does have a 16-byte alignment requirement, while _mm_loadu_si128 does not have any requirements. It doesn't mention the alignment requirements for _mm_loadl_epi64 as far as I can tell.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Armstrong, Brian wrote:
Does _mm_loadl_epi64 (movq xmm, m64) have alignment requirements?
SDM Volume 1, Section 4.1.1 Alignment of Words, Doublewords, Quadwords, and Double Quadwords summarizes memory alignment requirements for most instructions. In particular, it says that words, double words and quadwords need not be aligned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think the currently-defined intrinsics for SGX (_encls_u32, _enclu_u32, _enclv_u32) are appropriate. The leafs are so different and should really be considered different instructions. No other instructions that have defined intrinsics have this issue.
- Different leafs do completely different things
- Different leafs have completely different operands
- Different leafs may have different hardware support (in terms of CPU features)
The operation is of course documented in the SDM, but I've summarized all the leafs here https://github.com/fortanix/rust-sgx/issues/15#issuecomment-447738899
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For _mm256_testc_si256 and other similar intrinsics, the Intrinsics Guide lists the corresponding instruction as vptest and CPUID flags as AVX. vptest is an AVX2 instruction. I'm assuming, for AVX targets the compiler should translate the intrinsic to either vtestps or vtestpd instruction, but those instructions are not listed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For _mm_cvtpd_ps it would be useful to say in the description and pseudo-code that the upper half of the resulting register is filled with zero. For _mm_cvtps_pd it would be nice to mention in the description that it converts only the lower 2 elements of the vector.
There's been a while since the last update of Intrinsics Guide, when will be a new update?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The description of tpause has a few problems
- the asm syntax is weird, with spurious commas
- there is no statement about what the result of the intrinsic is. The implementation shows that it is rflags.cf, but you should say that explicitly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's good feedback, thank you James, I'll update accordingly.
andysem, I try to batch several updates together and coordinate with any new intrinsics being announced, I'm still looking for the right date to do a release.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The _rdtsc intrinsic has a return type of __int64 (signed integer), which is incorrect, since the time is is read from the CPU's timestamp counter, an unsigned integer.
Existing C compilers (such as gcc) already return an unsigned long long type.
This is an issue for Rust developers when implementing Intel's intrinsics, because we use the intrinsic guide as a reference for the return type and parameters of the intrinsics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intrinsics Guide Data Version: 3.4.4
Looks as if for the two non-masked operations _mm512_popcnt_epi8() and _mm512_popcnt_epi16(), the POPCNT() operator is missing in the pseudo-code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Re-wording proposal for _mm512_bitshuffle_epi64_mask()
Per 64-bit element in b and its 8 associated 8-bit elements in c: Gather 8 bits from 64-bit element in b at bit positions controlled by the 8 8-bit elements of c, and store the result in the associated Byte of mask k. There are 8 such operations done in parallel per 64-bit lane, each producing one Byte in k.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pseudo code of all _mm512_dpbusd_epi32() looks incorrect.
The for-loop is defined with j, but operator[] is using expressions in i.
I suspect that the expressions shall read
tmp2 := a.byte[4*j+1] * b.byte[4*j+1] // i.e. all in j and also for b it shall be 4*j
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the two non-masked operations _mm512_popcnt_epi32() and _mm512_popcnt_epi64(), the POPCNT() operator is missing in the pseudo-code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I found a mistake in Operation description for every addsub, fmaddsub, subadd, fmsubadd instructions.
The IF-statement inside loop looks like IF (j % 1 == 0) but it is IF (j % 2 == 0).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
There is a small typo in _mm256_hsub_ps as it says "Horizontally add adjacent pairs [...]" where really it should say subtract
Thanks!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page