I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):
1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.
2. __m256 _mm256_undefined_si256 () should return __m256i.
3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.
4. _mm_alignr_epi8 has two descriptions.
5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?
I didn't read all instructions so there may be more issues. I'll post if I find anything else.
PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.
Is _mm256_s[rl]li_si256 misnamed? In the other versions _epi16, _epi32, _epi64 all seem to indicate the "pocket". It seems to me that they should be _mm256_s[rl]i_si128 (or maybe epi128?) since the shift "pocket" is 128 bits and not 256. This would make them consistent with the convention and have the intrinsic name be reflective of the actual operation. (I also note that _mm256_bs[rl]i_epi128 seems to be the same underlying instruction and it is _epi128).
Missing mask ops (kand, kmov, knot, etc...) for other than __mmask16?
I think (can never really be sure I've check every last corner of the system ;^)) that the intrinsic guide (and Parallel Studio XE2016 beta) are missing any mask operations for sizes other than __mmask16. I did find 8, 32, and 64 bit versions described in the instruction extensions manual (https://software.intel.com/en-us/intel-architecture-instruction-set-extensions-programming-reference, chapter 6) so I believe they really are supposed to exist.
Without these, doing conditional stuff on anything other then 32 bit "pockets" is really tough.
The word from the compiler folks is that they're not going to provide intrinsics for 8/32/64. If you just use the _mmask8/32/64 types then the compiler will try to implement things as actual k-instructions. I've played around with it a bit and it mostly works. In many cases the compiler will keep everything in k-registers. In other cases it starts moving things back and forth (unnecessarily) between k and normal registers.
Not all k-instructions have C/C++ equivalents, like kandn. You could try to emulate it with ~ and & operators and hope the compiler is smart enough to recognize the pattern, but I would prefer to have an intrinsic for that.
8/32/64-bit k-registers are essential for AVX-512BW/DQ, they should be properly supported in compilers, IMHO.
The _mm_fmaddsub_ps and _mm_fmsubadd_ps functions have the same content in the operation section yet different descriptions. Is that what it is supposed to be?
The intrinsics guide description of _mm_testc_si128 says it NOTs the 2nd operand, but it's wrong. Assuming the intrinsic is supposed to take its args in the same order as the PTEST instruction, then the docs are wrong and the actual behaviour is correct.
This was discovered during discussion at http://stackoverflow.com/questions/32072169/could-i-compare-to-zero-register-in-avx-correctly/320730...
The Intel insn ref manual says:
(V)PTEST (128-bit version) IF (SRC[127:0] BITWISE AND DEST[127:0] = 0) THEN ZF 1; ELSE ZF 0; IF (SRC[127:0] BITWISE AND NOT DEST[127:0] = 0) THEN CF 1; ELSE CF 0;
Where DEST is the first argument. It is correct.
The intrinsics guide description for int _mm_testc_si128 (__m128i a, __m128i b) says:
IF (a[127:0] AND NOT b[127:0] == 0)
CF := 1
CF := 0
This is the reverse of the instruction generated from the intrinsic. I guess the doc writers got mixed up by the ref manual putting NOT DEST second. The other ptest intrinsics (like _mm_test_mix_ones_zeros) have the same bug in their description
In the description of the _mm_multishift_epi64_epi8 intrinsic there is this line:
:= b[q+((ctrl+k) & 63)]
However, k is not mentioned anywhere else. I believe, it should be l.
The _mm_test_all_ones intrinsic description is not accurate. It says it performs a complement (i.e. XOR) of the xmm and 0xFFFFFFFF (which is a 32-bit constant, presumably). The actual behavior is different and amounts to returning 1 if the provided xmm contains 1 in all bits and 0 otherwise.
I think the following (minor) errors exist in the guide (3.3.11):
1) _mm512_sad_epu8: says it produces "four unsigned 16-bit integers". I think that should say "eight unsigned 16-bit integers".
2) _mm256_mpsadbw_epu8: i := imm8*32 should be a_offset:=imm8*32
Let me say up front that I can't imagine doing low level code without the Intrinsic Guide, but there are a few enhancements I'd like to see (not in any particular order):
1) more complete information on latency and throughput
2) some sort of separate file containing the intrinsic, instruction family (AVX, SSE4.1, AVX-512VBMI, etc...) and the latency and throughput. If I had such information I could more easily take my source code and annotate it with the information for each intrinsic I use; thus helping me keep track of architecture dependencies and performance.
3) a version of the guide that I could download and use locally when I don't have good network connection.
4) some sort of relational search or maybe regex search (i.e. __m512i and add, ^__m512i[.]*_add_). Mostly I just want to be able to narrow down the results. Examples are when I only care about a particular size operation (__m256 adds) or searching on information that exists in the intrinsic name text in the description (__mm512i permutes that work across lanes using 32-bit integers).
Just some thoughts.
Thanks for reporting these issues, I've resolved several issues, the new update should be posted shortly.
I'll also get started a larger update to add some additional features that have been requested, both publicly and internally.
Patrick, the new description of _mm_test_all_ones is still incorrect. The pseudocode contains:
IF (a[127:0] AND NOT 0xFFFFFFFF == 0)
First, instead of 0xFFFFFFFF, which is a 32-bit constant, tmp should be used. Second, this condition will always return true regardless of the value of a. The correct condition should be:
IF (tmp[127:0] AND NOT a[127:0] == 0)
Descriptions for _mm_test_all_zeros, _mm_test_mix_ones_zeros, and all testc intrinsics are still bad (they are not technically wrong, but they are badly misleading.)
As far as I can tell, they all assume that "a AND NOT b" means "(~a) & b", because that's the way e.g. _mm_andnot_si128 works. But that's not a natural reading of the phrase and that's not how 99% of people would understand it. Probably best to spell this out explicitly in each case.
This also relates to andysem's comment above. "a AND NOT 0xFFFFFFFF" always evaluates to zero under natural interpretation of "AND NOT". Instead _mm_test_all_ones computes "IF ((NOT a) AND 0xFFFFFFFF == 0)" (or, in other words, simply IF ((NOT a) == 0) ).