Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
New Contributor III
1,376 Views

Bugs in Intrinsics Guide

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

0 Kudos
216 Replies
Highlighted
Beginner
106 Views

Is _mm256_s[rl]li_si256 misnamed?  In the other versions _epi16, _epi32, _epi64 all seem to indicate the "pocket".  It seems to me that they should be _mm256_s[rl]i_si128 (or maybe epi128?) since the shift "pocket" is 128 bits and not 256.  This would make them consistent with the convention and have the intrinsic name be reflective of the actual operation.  (I also note that _mm256_bs[rl]i_epi128 seems to be the same underlying instruction and it is _epi128).
 

0 Kudos
Highlighted
New Contributor I
106 Views

_mm256_bslli_epi128 was added as a synonym of _mm256_slli_si256 because of the misleading name of the latter.

0 Kudos
Highlighted
Beginner
106 Views

Missing mask ops (kand, kmov, knot, etc...) for other than __mmask16?

I think (can never really be sure I've check every last corner of the system ;^)) that the intrinsic guide (and Parallel Studio XE2016 beta) are missing any mask operations for sizes other than __mmask16.  I did find 8, 32, and 64 bit versions described in the instruction extensions manual (https://software.intel.com/en-us/intel-architecture-instruction-set-extensions-programming-reference, chapter 6) so I believe they really are supposed to exist.

Without these, doing conditional stuff on anything other then 32 bit "pockets" is really tough.

 

0 Kudos
Highlighted
106 Views

There are 8, 32, and 64-bit instructions for mask operations, but the compiler only defines intrinsics for 16-bit.

0 Kudos
Highlighted
Beginner
106 Views

The word from the compiler folks is that they're not going to provide intrinsics for 8/32/64. If you just use the _mmask8/32/64 types then the compiler will try to implement things as actual k-instructions.  I've played around with it a bit and it mostly works.  In many cases the compiler will keep everything in k-registers.  In other cases it starts moving things back and forth (unnecessarily) between k and normal registers.

0 Kudos
Highlighted
New Contributor III
106 Views

Not all k-instructions have C/C++ equivalents, like kandn. You could try to emulate it with ~ and & operators and hope the compiler is smart enough to recognize the pattern, but I would prefer to have an intrinsic for that.

8/32/64-bit k-registers are essential for AVX-512BW/DQ, they should be properly supported in compilers, IMHO.

 

0 Kudos
Highlighted
Employee
106 Views

The _mm_fmaddsub_ps and _mm_fmsubadd_ps functions have the same content in the operation section yet different descriptions. Is that what it is supposed to be?

Thanks.

0 Kudos
Highlighted
106 Views

Arthur,

You're correct, it looks like the fmsubadd operations were incorrect. I have corrected them, the update should appear shortly.

0 Kudos
Highlighted
Beginner
106 Views

When are you planning to add instruction latencies for Broadwell? Thanks!

0 Kudos
Highlighted
Beginner
106 Views

The intrinsics guide description of _mm_testc_si128 says it NOTs the 2nd operand, but it's wrong.  Assuming the intrinsic is supposed to take its args in the same order as the PTEST instruction, then the docs are wrong and the actual behaviour is correct.

This was discovered during discussion at http://stackoverflow.com/questions/32072169/could-i-compare-to-zero-register-in-avx-correctly/320730...

The Intel insn ref manual says:

(V)PTEST (128-bit version)

	IF (SRC[127:0] BITWISE AND DEST[127:0] = 0)
	    THEN ZF  1;
	    ELSE ZF  0;
	IF (SRC[127:0] BITWISE AND NOT DEST[127:0] = 0)
	    THEN CF  1;
	    ELSE CF  0;

Where DEST is the first argument.  It is correct.

 

The intrinsics guide description for int _mm_testc_si128 (__m128i a, __m128i b) says:

    IF (a[127:0] AND NOT b[127:0] == 0)
        CF := 1
    ELSE
        CF := 0
    FI

This is the reverse of the instruction generated from the intrinsic.  I guess the doc writers got mixed up by the ref manual putting NOT DEST second.    The other ptest intrinsics (like _mm_test_mix_ones_zeros) have the same bug in their description

0 Kudos
Highlighted
New Contributor III
106 Views

In the description of the _mm_multishift_epi64_epi8 intrinsic there is this line:

tmp8 := b[q+((ctrl+k) & 63)]

However, k is not mentioned anywhere else. I believe, it should be l.

0 Kudos
Highlighted
New Contributor III
106 Views

The _mm_test_all_ones intrinsic description is not accurate. It says it performs a complement (i.e. XOR) of the xmm and 0xFFFFFFFF (which is a 32-bit constant, presumably). The actual behavior is different and amounts to returning 1 if the provided xmm contains 1 in all bits and 0 otherwise.

0 Kudos
Highlighted
Beginner
106 Views

I think the following (minor) errors exist in the guide (3.3.11):

1) _mm512_sad_epu8: says it produces "four unsigned 16-bit integers".  I think that should say "eight unsigned 16-bit integers".

2) _mm256_mpsadbw_epu8: i := imm8[2]*32 should be a_offset:=imm8[2]*32

0 Kudos
Highlighted
Beginner
106 Views

Let me say up front that I can't imagine doing low level code without the Intrinsic Guide, but there are a few enhancements I'd like to see (not in any particular order):

1) more complete information on latency and throughput

2) some sort of separate file containing the intrinsic, instruction family (AVX, SSE4.1, AVX-512VBMI, etc...) and the latency and throughput.  If I had such information I could more easily take my source code and annotate it with the information for each intrinsic I use; thus helping me keep track of architecture dependencies and performance.

3) a version of the guide that I could download and use locally when I don't have good network connection.

4) some sort of relational search or maybe regex search (i.e. __m512i and add, ^__m512i[.]*_add_).  Mostly I just want to be able to narrow down the results.  Examples are when I only care about a particular size operation (__m256 adds) or searching on information that exists in the intrinsic name text in the description (__mm512i permutes that work across lanes using 32-bit integers).

Just some thoughts.

0 Kudos
Highlighted
106 Views

Thanks for reporting these issues, I've resolved several issues, the new update should be posted shortly. 

I'll also get started a larger update to add some additional features that have been requested, both publicly and internally.

0 Kudos
Highlighted
New Contributor III
106 Views

Patrick, the new description of _mm_test_all_ones is still incorrect. The pseudocode contains:

IF (a[127:0] AND NOT 0xFFFFFFFF == 0)

First, instead of 0xFFFFFFFF, which is a 32-bit constant, tmp should be used. Second, this condition will always return true regardless of the value of a. The correct condition should be:

IF (tmp[127:0] AND NOT a[127:0] == 0)

 

0 Kudos
Highlighted
Beginner
106 Views

Hi, intrinsic functions for vmovsd are listed under AVX-512, although it is actually an AVX instruction. Best, André

0 Kudos
Highlighted
106 Views

The description part of _mm512_fmadd233_ps on KNC seems to be wrong. It does not match the description in EAS

 

0 Kudos
Highlighted
Employee
106 Views

Descriptions for _mm_test_all_zeros, _mm_test_mix_ones_zeros, and all testc intrinsics are still bad (they are not technically wrong, but they are badly misleading.)

As far as I can tell, they all assume that "a AND NOT b"  means "(~a) & b", because that's the way e.g. _mm_andnot_si128 works. But that's not a natural reading of the phrase and that's not how 99% of people would understand it. Probably best to spell this out explicitly in each case.

This also relates to andysem's comment above. "a AND NOT 0xFFFFFFFF" always evaluates to zero under natural interpretation of "AND NOT". Instead _mm_test_all_ones computes "IF ((NOT a) AND 0xFFFFFFFF == 0)" (or, in other words, simply IF ((NOT a) == 0) ).

0 Kudos
Highlighted
106 Views

Thanks all for the feedback. I've submitted an update to resolve these issues.

0 Kudos