_mm256_bslli_epi128 was added - Page 5

andysem · ‎01-30-2013

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

Philip_T_ · ‎01-13-2015

In v3.3.3, _mm_madd_epi16 claims (in the description and operation sections) to saturate the result of the addition. But the description of PMADDWD in the Software Developer's Manual doesn't say any saturation occurs, and actually says it will wrap (when the 16-bit inputs are all 0x8000, which I think is the only case where saturation/wrapping could possibly matter). Some test code confirms that it does wrap, not saturate, so it looks like a bug in the Intrinsics Guide.

(Same applies to all the other intrinsics for PMADDWD/VPMADDWD.)

Patrick_K_Intel · ‎01-13-2015

Thanks, I've corrected that as well.

Both the 8-bit immediate and pmaddwd issues should be corrected in data version 3.3.4. I don't believe it's live yet, sometimes it takes the web ops people a little while to publish the changes.

bronxzv · ‎01-15-2015

Patrick Konsor (Intel) wrote:
Thanks for reporting this issue. I have updated the documentation around immediate parameters to clarify this better.

now I see the changes online, neat!

Hans_P_Intel · ‎01-20-2015

It looks like _mm512_set1_pd is only marked as AVX512F although it is available since IMCI.

Jeremias_M_ · ‎01-21-2015

Hi,

i was trying to use the function _mm512_set1_epi32 in KNC, but I received the following error:

On the remote process, dlopen() failed. The error message sent back from the sink is /var/volatile/tmp/coi_procs/1/4087/load_lib/icpcoutMmwX7Q: undefined symbol: _mm512_maskz_sllv_epi32
offload error: cannot load library to the device 0 (error code 20)
On the sink, dlopen() returned NULL. The result of dlerror() is "/var/volatile/tmp/coi_procs/1/4087/load_lib/icpcoutMmwX7Q: undefined symbol: _mm512_maskz_sllv_epi32"

I believe that this function is only avaiable in AVX-512.

In the other hand, the function _mm512_set1_epi32 is available in KNC.

Thanks.

Jonathan_R_1 · ‎02-02-2015

Documentation for _mm_hsub_ps is wrong; The order on the operands in each pair-wise subtraction is reversed.

given sse registers A and B,

hsub(A, B) = [B[2] - B[3], B[0] - B[1], A[2] - A[3], A[0] - A[1]]

Patrick_K_Intel · ‎02-02-2015

Thanks, I've corrected the floating-point hsub intrinsics.

Kenny_S_ · ‎02-18-2015

The documentation for the SHA-1 instructions is wrong in several places.

Several times, the shift operation (<<) is written where rotate (<<<) is supposed to be. Such as in _mm_sha1rnds4_epu32:

A[1] := f(B, C, D) + (A << 5) + W[0] + K;
B[1] := A;
C[1] := B << 30;
D[1] := C;
E[1] := D;

FOR i = 1 to 3
  		A[i+1] := f(B, C, D) + (A << 5) + W + E + K;
  		B[i+1] := A;
  		C[i+1] := B << 30;
  		D[i+1] := C;
  		E[i+1] := D;
ENDFOR;

All of those << should be <<<.

Patrick_K_Intel · ‎02-20-2015

Thank you, I will correct these.

CFR · ‎06-15-2015

Could someone please check on the following? It looks like too big of an error to not have been already mentioned and fixed, but in the current version 3.3.8 of the Intrinsics Guide...

I believe there is an error in the description for _mm512_srlv_epi32 (and similar intrinsics). The "operation" says that the shift is based on count[i+4:i] but according to what I see in the instruction extension manual, and behavior of SDE (KNL and SKX in particular) and a test program on KNC, it looks like the shift is based on the whole value of the count field. It appears that a shift of >31 results in the bits being shifted off the end (result field is zero), not a shift by zero (result field unchanged). The instruction extension manual probably also needs an update as the EVEX description is not specific as to what happens if a shift is greater than 32 (it is specific for VEX.128, VEX.256, etc...)

I believe there are also errors in the operation sections for other sizes (_mm256_srlv_epi32) and other related instructions (_mm512_sllv_epi32).

Patrick_K_Intel · ‎06-16-2015

Thanks for reporting this, I have corrected this information.

Suf_I_ · ‎06-19-2015

Hi,

Wouldn't it be clearer in the Intrinsics Guide documentation if a "const" is added for the immediate value for the shuffle functions (SSE). For example:

___m128 _mm_shuffle_ps (__m128 a, __m128 b, const unsigned int imm8)

instead of

___m128 _mm_shuffle_ps (__m128 a, __m128 b, unsigned int imm8)

Thank you

andysem · ‎06-19-2015

It wouldn't change anything from the user's perspective. Parameter constness has no effect on the caller. Even if this is just a documentation change, it doesn't actually communicate the fact that an immediate constant (or, more generally, a constant expression) is required.

andysem · ‎06-24-2015

There are a few bugs in the description of the ADX intrinsics.

1. _addcarry_* intrinsics generate the classic adc instruction, not adcx. As such, it affects both CF and OF and uses CF for carry. The adc instruction has been available long before ADX extension (no cpuid feature required).

2. _addcarryx_* can generate either adcx or adox, at compiler's choice. As such it uses either CF or OF for carry. These intrinsics require ADX cpuid feature.

3. _subborrow_* indeed generates sbb, which is the counterpart of adc. This instruction is also available in classic IA-32 and does not require ADX cpuid flag.

4. SDM defines _addcarry_* and _subborrow_* intrinsics for different integer sizes, from 8 to 64 bit, and Intrinsics Guide only describes 32 and 64-bit ones.

Patrick_K_Intel · ‎06-24-2015

Thanks, I will make those corrections.

On point 4, I believe the SDM is in error here, the compiler only defines 32 and 64-bit versions; I will confirm this internally.

andysem · ‎06-24-2015

> On point 4, I believe the SDM is in error here, the compiler only defines 32 and 64-bit versions; I will confirm this internally.

Since the instruction does support 8 and 16-bit operands, it might be a good idea to add those intrinsics to the compiler if not already there.

CFR · ‎07-09-2015

Is _mm256_s[rl]li_si256 misnamed? In the other versions _epi16, _epi32, _epi64 all seem to indicate the "pocket". It seems to me that they should be _mm256_s[rl]i_si128 (or maybe epi128?) since the shift "pocket" is 128 bits and not 256. This would make them consistent with the convention and have the intrinsic name be reflective of the actual operation. (I also note that _mm256_bs[rl]i_epi128 seems to be the same underlying instruction and it is _epi128).

Vladimir_Sedach · ‎07-09-2015

_mm256_bslli_epi128 was added as a synonym of _mm256_slli_si256 because of the misleading name of the latter.

CFR · ‎07-13-2015

Missing mask ops (kand, kmov, knot, etc...) for other than __mmask16?

I think (can never really be sure I've check every last corner of the system ;^)) that the intrinsic guide (and Parallel Studio XE2016 beta) are missing any mask operations for sizes other than __mmask16. I did find 8, 32, and 64 bit versions described in the instruction extensions manual (https://software.intel.com/en-us/intel-architecture-instruction-set-extensions-programming-reference, chapter 6) so I believe they really are supposed to exist.

Without these, doing conditional stuff on anything other then 32 bit "pockets" is really tough.

Patrick_K_Intel · ‎07-17-2015

There are 8, 32, and 64-bit instructions for mask operations, but the compiler only defines intrinsics for 16-bit.

CFR · ‎07-20-2015

The word from the compiler folks is that they're not going to provide intrinsics for 8/32/64. If you just use the _mmask8/32/64 types then the compiler will try to implement things as actual k-instructions. I've played around with it a bit and it mostly works. In many cases the compiler will keep everything in k-registers. In other cases it starts moving things back and forth (unnecessarily) between k and normal registers.

Bugs in Intrinsics Guide