Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Bugs in Intrinsics Guide

andysem
New Contributor III
26,374 Views

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

0 Kudos
220 Replies
Philip_T_
Beginner
1,471 Views

In v3.3.3, _mm_madd_epi16 claims (in the description and operation sections) to saturate the result of the addition. But the description of PMADDWD in the Software Developer's Manual doesn't say any saturation occurs, and actually says it will wrap (when the 16-bit inputs are all 0x8000, which I think is the only case where saturation/wrapping could possibly matter). Some test code confirms that it does wrap, not saturate, so it looks like a bug in the Intrinsics Guide.

(Same applies to all the other intrinsics for PMADDWD/VPMADDWD.)

0 Kudos
Patrick_K_Intel
Employee
1,471 Views

Thanks, I've corrected that as well.

Both the 8-bit immediate and pmaddwd issues should be corrected in data version 3.3.4. I don't believe it's live yet, sometimes it takes the web ops people a little while to publish the changes.

0 Kudos
bronxzv
New Contributor II
1,471 Views

Patrick Konsor (Intel) wrote:
Thanks for reporting this issue. I have updated the documentation around immediate parameters to clarify this better.

now I see the changes online, neat!

0 Kudos
Hans_P_Intel
Employee
1,471 Views

It looks like _mm512_set1_pd is only marked as AVX512F although it is available since IMCI.

0 Kudos
Jeremias_M_
Beginner
1,471 Views

Hi,

i was trying to use the function _mm512_set1_epi32 in KNC, but I received the following error:

On the remote process, dlopen() failed. The error message sent back from the sink is /var/volatile/tmp/coi_procs/1/4087/load_lib/icpcoutMmwX7Q: undefined symbol: _mm512_maskz_sllv_epi32
offload error: cannot load library to the device 0 (error code 20)
On the sink, dlopen() returned NULL. The result of dlerror() is "/var/volatile/tmp/coi_procs/1/4087/load_lib/icpcoutMmwX7Q: undefined symbol: _mm512_maskz_sllv_epi32"

I believe that this function is only avaiable in AVX-512.

 

In the other hand, the function _mm512_set1_epi32 is available in KNC.

 

Thanks.

0 Kudos
Jonathan_R_1
Beginner
1,471 Views

Documentation for _mm_hsub_ps is wrong; The order on the operands in each pair-wise subtraction is reversed.

given sse registers A and B,

hsub(A, B) = [B[2] - B[3], B[0] - B[1], A[2] - A[3], A[0] - A[1]]

0 Kudos
Patrick_K_Intel
Employee
1,471 Views

Thanks, I've corrected the floating-point hsub intrinsics.

0 Kudos
Kenny_S_
Beginner
1,471 Views

The documentation for the SHA-1 instructions is wrong in several places.

Several times, the shift operation (<<) is written where rotate (<<<) is supposed to be. Such as in _mm_sha1rnds4_epu32:

A[1] := f(B, C, D) + (A << 5) + W[0] + K;
B[1] := A;
C[1] := B << 30;
D[1] := C;
E[1] := D;

FOR i = 1 to 3
  		A[i+1] := f(B, C, D) + (A << 5) + W + E + K;
  		B[i+1] := A;
  		C[i+1] := B << 30;
  		D[i+1] := C;
  		E[i+1] := D;
ENDFOR;

All of those << should be <<<.

0 Kudos
Patrick_K_Intel
Employee
1,471 Views

Thank you, I will correct these.

0 Kudos
CFR
New Contributor II
1,471 Views

Could someone please check on the following?  It looks like too big of an error to not have been already mentioned and fixed, but in the current version 3.3.8 of the Intrinsics Guide...

I believe there is an error in the description for _mm512_srlv_epi32 (and similar intrinsics). The "operation" says that the shift is based on count[i+4:i] but according to what I see in the instruction extension manual, and behavior of SDE (KNL and SKX in particular) and a test program on KNC, it looks like the shift is based on the whole value of the count field.  It appears that a shift of >31 results in the bits being shifted off the end (result field is zero), not a shift by zero (result field unchanged).  The instruction extension manual probably also needs an update as the EVEX description is not specific as to what happens if a shift is greater than 32 (it is specific for VEX.128, VEX.256, etc...)

I believe there are also errors in the operation sections for other sizes (_mm256_srlv_epi32) and other related instructions (_mm512_sllv_epi32).

 

0 Kudos
Patrick_K_Intel
Employee
1,471 Views

Thanks for reporting this, I have corrected this information.

0 Kudos
Suf_I_
Beginner
1,471 Views

 

Hi,

Wouldn't it be clearer in the Intrinsics Guide documentation if a "const" is added for the immediate value for the shuffle functions (SSE). For example:

___m128 _mm_shuffle_ps (__m128 a__m128 b, const unsigned int imm8)

instead of 

___m128 _mm_shuffle_ps (__m128 a__m128 bunsigned int imm8)

Thank you

0 Kudos
andysem
New Contributor III
1,471 Views

It wouldn't change anything from the user's perspective. Parameter constness has no effect on the caller. Even if this is just a documentation change, it doesn't actually communicate the fact that an immediate constant (or, more generally, a constant expression) is required.

 

0 Kudos
andysem
New Contributor III
1,471 Views

There are a few bugs in the description of the ADX intrinsics.

1. _addcarry_* intrinsics generate the classic adc instruction, not adcx. As such, it affects both CF and OF and uses CF for carry. The adc instruction has been available long before ADX extension (no cpuid feature required).

2. _addcarryx_* can generate either adcx or adox, at compiler's choice. As such it uses either CF or OF for carry. These intrinsics require ADX cpuid feature.

3. _subborrow_* indeed generates sbb, which is the counterpart of adc. This instruction is also available in classic IA-32 and does not require ADX cpuid flag.

4. SDM defines _addcarry_* and _subborrow_* intrinsics for different integer sizes, from 8 to 64 bit, and Intrinsics Guide only describes 32 and 64-bit ones.

 

0 Kudos
Patrick_K_Intel
Employee
1,471 Views

Thanks, I will make those corrections.

On point 4, I believe the SDM is in error here, the compiler only defines 32 and 64-bit versions; I will confirm this internally.

0 Kudos
andysem
New Contributor III
1,471 Views

> On point 4, I believe the SDM is in error here, the compiler only defines 32 and 64-bit versions; I will confirm this internally.

Since the instruction does support 8 and 16-bit operands, it might be a good idea to add those intrinsics to the compiler if not already there.

 

0 Kudos
CFR
New Contributor II
1,488 Views

Is _mm256_s[rl]li_si256 misnamed?  In the other versions _epi16, _epi32, _epi64 all seem to indicate the "pocket".  It seems to me that they should be _mm256_s[rl]i_si128 (or maybe epi128?) since the shift "pocket" is 128 bits and not 256.  This would make them consistent with the convention and have the intrinsic name be reflective of the actual operation.  (I also note that _mm256_bs[rl]i_epi128 seems to be the same underlying instruction and it is _epi128).
 

0 Kudos
Vladimir_Sedach
New Contributor I
1,488 Views

_mm256_bslli_epi128 was added as a synonym of _mm256_slli_si256 because of the misleading name of the latter.

0 Kudos
CFR
New Contributor II
1,488 Views

Missing mask ops (kand, kmov, knot, etc...) for other than __mmask16?

I think (can never really be sure I've check every last corner of the system ;^)) that the intrinsic guide (and Parallel Studio XE2016 beta) are missing any mask operations for sizes other than __mmask16.  I did find 8, 32, and 64 bit versions described in the instruction extensions manual (https://software.intel.com/en-us/intel-architecture-instruction-set-extensions-programming-reference, chapter 6) so I believe they really are supposed to exist.

Without these, doing conditional stuff on anything other then 32 bit "pockets" is really tough.

 

0 Kudos
Patrick_K_Intel
Employee
1,488 Views

There are 8, 32, and 64-bit instructions for mask operations, but the compiler only defines intrinsics for 16-bit.

0 Kudos
CFR
New Contributor II
1,488 Views

The word from the compiler folks is that they're not going to provide intrinsics for 8/32/64. If you just use the _mmask8/32/64 types then the compiler will try to implement things as actual k-instructions.  I've played around with it a bit and it mostly works.  In many cases the compiler will keep everything in k-registers.  In other cases it starts moving things back and forth (unnecessarily) between k and normal registers.

0 Kudos
Reply