Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Bugs in Intrinsics Guide

andysem
New Contributor III
26,575 Views

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

0 Kudos
220 Replies
firelight
Beginner
2,698 Views

Shift clarification:

When describing shift operations, it would help clarify the shift amount by using the word "byte" or "bit" like in the documentation https://software.intel.com/en-us/node/524238.

For example:

__m128i _mm_srli_epi32 (__m128i aint imm8)

...

Shift packed 32-bit integers in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Thank you and good day.

0 Kudos
andysem
New Contributor III
2,698 Views

Groarke, Philippe wrote:

__m128i _mm_srli_epi32 (__m128i a, int imm8)

...

Shift packed 32-bit integers in a right by imm8 bytes while shifting in zeros, and store the results in dst.

 _mm_srli_epi32 operates on bits. By default, shift and rotate operations work in terms of bits. _mm_srli_si128/_mm_slli_si128 and equivalents for larger vectors are exceptions.

 

0 Kudos
Gabriel_M_1
Beginner
2,698 Views

Wrong return type in the `rdtsc` intrinsic

I've found an issue with the description of the `rdtsc` intrinsic.

This instruction is used to read the value of the CPU's timestamp counter, a 64-bit monotonically-increasing unsigned integer. However, the docs say the return type is a signed `__int64`. Both GCC and Clang properly expose this intrinsic as returning an unsigned long long, not a signed one. I believe this should be fixed.

0 Kudos
Eden_S_Intel
Employee
2,698 Views

Hi,

The AVX512-VNNI instructions vpdpbusd doesn't show in the guide as far as i see.

0 Kudos
Kearney__Jim
Beginner
2,698 Views

The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?

0 Kudos
andysem
New Contributor III
2,698 Views

Kearney, Jim wrote:

The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?

Instructions that operate on 128 or 256-bit vectors require AVX-512VL in addition to AVX-512VBMI. 512-bit vectors only require AVX-512VBMI.

0 Kudos
Kearney__Jim
Beginner
2,698 Views

andysem wrote:

Quote:

Kearney, Jim wrote:

 

The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?

Instructions that operate on 128 or 256-bit vectors require AVX-512VL in addition to AVX-512VBMI. 512-bit vectors only require AVX-512VBMI.

Sorry, I phrased that poorly.

I meant, the presentation in the Guide acts as though "+" means "or".  If one enables only AVX-512VL under Technologies, the _mm_ and _mm256_ variants are shown as available.  I would expect to have to check AVX-512_VBMI as well ("and").

 

0 Kudos
Armstrong__Brian
Beginner
2,698 Views

Hi,

Does _mm_loadl_epi64 (movq xmm, m64) have alignment requirements? The documentation seems to suggest that _mm_load_si128 does have a 16-byte alignment requirement, while _mm_loadu_si128 does not have any requirements. It doesn't mention the alignment requirements for _mm_loadl_epi64 as far as I can tell.

Thanks!

0 Kudos
andysem
New Contributor III
2,698 Views

Armstrong, Brian wrote:

Does _mm_loadl_epi64 (movq xmm, m64) have alignment requirements?

SDM Volume 1, Section 4.1.1 Alignment of Words, Doublewords, Quadwords, and Double Quadwords summarizes memory alignment requirements for most instructions. In particular, it says that words, double words and quadwords need not be aligned.

 

0 Kudos
Jethro_B_
Innovator
2,698 Views

I don't think the currently-defined intrinsics for SGX (_encls_u32, _enclu_u32, _enclv_u32) are appropriate. The leafs are so different and should really be considered different instructions. No other instructions that have defined intrinsics have this issue.

  • Different leafs do completely different things
  • Different leafs have completely different operands
  • Different leafs may have different hardware support (in terms of CPU features)

The operation is of course documented in the SDM, but I've summarized all the leafs here https://github.com/fortanix/rust-sgx/issues/15#issuecomment-447738899

0 Kudos
andysem
New Contributor III
2,695 Views

For _mm256_testc_si256 and other similar intrinsics, the Intrinsics Guide lists the corresponding instruction as vptest and CPUID flags as AVX. vptest is an AVX2 instruction. I'm assuming, for AVX targets the compiler should translate the intrinsic to either vtestps or vtestpd instruction, but those instructions are not listed.

0 Kudos
andysem
New Contributor III
2,697 Views

For _mm_cvtpd_ps it would be useful to say in the description and pseudo-code that the upper half of the resulting register is filled with zero. For _mm_cvtps_pd it would be nice to mention in the description that it converts only the lower 2 elements of the vector.

 

There's been a while since the last update of Intrinsics Guide, when will be a new update?

 

0 Kudos
James_C_Intel2
Employee
2,775 Views

The description of tpause has a few problems

  1. the asm syntax is weird, with spurious commas
  2. there is no statement about what the result of the intrinsic is. The implementation shows that it is rflags.cf, but you should say that explicitly.
0 Kudos
Patrick_K_Intel
Employee
2,775 Views

That's good feedback, thank you James, I'll update accordingly.

andysem, I try to batch several updates together and coordinate with any new intrinsics being announced, I'm still looking for the right date to do a release.

0 Kudos
Gabriel_M_1
Beginner
2,775 Views

The _rdtsc intrinsic has a return type of __int64 (signed integer), which is incorrect, since the time is is read from the CPU's timestamp counter, an unsigned integer.

Existing C compilers (such as gcc) already return an unsigned long long type.

This is an issue for Rust developers when implementing Intel's intrinsics, because we use the intrinsic guide as a reference for the return type and parameters of the intrinsics. 

0 Kudos
Stefan_M_Intel1
Employee
2,775 Views

Intrinsics Guide Data Version: 3.4.4

Looks as if for the two non-masked operations _mm512_popcnt_epi8() and _mm512_popcnt_epi16(), the POPCNT() operator is missing in the pseudo-code.

0 Kudos
Stefan_M_Intel1
Employee
2,775 Views

Re-wording proposal for _mm512_bitshuffle_epi64_mask()

Per 64-bit element in b and its 8 associated 8-bit elements in c: Gather 8 bits from 64-bit element in b at bit positions controlled by the 8 8-bit elements of c, and store the result in the associated Byte of mask k. There are 8 such operations done in parallel per 64-bit lane, each producing one Byte in k.

0 Kudos
Stefan_M_Intel1
Employee
2,775 Views

Pseudo code of all _mm512_dpbusd_epi32() looks incorrect.

The for-loop is defined with j, but operator[] is using expressions in i.

I suspect that the expressions shall read

tmp2 := a.byte[4*j+1] * b.byte[4*j+1]  // i.e. all in j and also for b it shall be 4*j

0 Kudos
Stefan_M_Intel1
Employee
2,775 Views

For the two non-masked operations _mm512_popcnt_epi32() and _mm512_popcnt_epi64(), the POPCNT() operator is missing in the pseudo-code.

0 Kudos
Akhundzhanov__Alexan
2,775 Views

Hi, I found a mistake in Operation description for every addsub, fmaddsub, subadd, fmsubadd instructions.

The IF-statement inside loop looks like IF (j % 1 == 0)  but it is IF (j % 2 == 0).

0 Kudos
Lopez__Emilio
Beginner
2,775 Views

Hi,

There is a small typo in _mm256_hsub_ps as it says "Horizontally add adjacent pairs [...]" where really it should say subtract

Thanks!

0 Kudos
Reply