Bugs in Intrinsics Guide - Page 10

andysem · ‎01-30-2013

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

Stefan_M_Intel1 · ‎10-07-2019

Descriptions of _mm_set_epi8() _mm256_set_epi8(), _mm512_set_epi8(), _mm512_set_epi16() all say "reverse order", which is incorrect.

Stefan_M_Intel1 · ‎10-07-2019

Pseudo code of all nine *_dpbusd_epi32() still looks incorrect, since the 4* for operand b is missing.

tmp1 := a.byte[4*j] * b.byte[4*j]
tmp2 := a.byte[4*j+1] * b.byte[4*j+1]
tmp3 := a.byte[4*j+2] * b.byte[4*j+2]
tmp4 := a.byte[4*j+3] * b.byte[4*j+3]

Guobing_C_Intel · ‎10-23-2019

There seems description error for all the VNNI intrinsics:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2557,4351,2195,2198,2204&avx512techs=AVX512_VNNI

For example:

__m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)

The description is as below, while all b.byte[????] is not corresponding with related a.byte. e.x.: tmp1 should be a.byte[4*j] * b.byte[4*j] instead of b.byte.

FOR j := 0 to 3
	tmp1 := a.byte[4*j] * b.byte
	tmp2 := a.byte[4*j+1] * b.byte[j+1]
	tmp3 := a.byte[4*j+2] * b.byte[j+2]
	tmp4 := a.byte[4*j+3] * b.byte[j+3]
	dst.dword := src.dword + tmp1 + tmp2 + tmp3 + tmp4
ENDFOR
dst[MAX:128] := 0

The same error happens for all other VNNI intrinsic descriptions.

Dobratz__Glenn · ‎10-30-2019

This issue applies to: _mm_xor_epi32, _mm256_xor_epi32, _mm_xor_epi64, and _mm256_xor_epi64

The intrinsics guide shows these functions as generating vpord/vporq instead of vpxord/vpxorq.

The wrong instruction appears both on the right side of the function name in the summary view and in the internal detailed descriptions.

Dobratz__Glenn · ‎11-21-2019

The description of the _ktestc_maskXX instructions in the online Intrinsics Guide disagrees with the Intel Architecture Software Developer Manual.

The Intrinsics Guide says that the function returns true if the NAND of the operands is all ones.

The Architecture Manual states that the 'CF' flag is lit if the result of the NAND operation is all zeros.

Presumably the Architecture Manual is correct, since the NAND producing an all ones result is meaningless.

Wegner__Zach · ‎12-16-2019

The second-to-last line of the code for _mm512_set_epi8 is this: dst[511:503] := e63 The indices should be 511:504, not 511:503.

Vamsi_S_Intel · ‎01-14-2020

All the dot product instructions (AVX512_BF16) operating on BF16 have incorrect offset for the second src operand --

For e.g in _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)

dst := src
FOR j := 0 to 15
	dst.fp32 += make_fp32(a.bf16[2*j+1]) * make_fp32(b.bf16[j+1])
	dst.fp32 += make_fp32(a.bf16[2*j+0]) * make_fp32(b.bf16[j+0])
ENDFOR
dst[MAX:512] := 0

All the offsets in to the "b" src should have the 2*j offset corresponding to the pair referred in the "a" src.

Wegner__Zach · ‎01-29-2020

The instruction listing for every 256-bit gather in AVX2 (example: _mm256_i64gather_epi64) use a wrong operand. The instructions are shown like "vpgatherqq ymm, vm64x, ymm", always with vm32x/vm64x for the second operand. These should be vm32y/vm64y, assuming the y suffix is YMM VSIB and x is XMM VSIB.

Patrick_K_Intel · ‎03-19-2020

The 3.5.0 update to Intrinsics Guide is live, and should address all the issues reported above.

Fogle__Miles · ‎03-19-2020

I noticed two issues, one major and one minor.

The major issue is that the instruction set checkboxes no longer filter the list, although the category checkboxes still do. Same result on both latest firefox and latest chrome. These worked a few weeks ago, so I think it's probably the recent update that broke them.

The minor issue is that the performance table for _mm_div_ps and _mm_div_ss lists the latency on Ivy Bridge as "14-Nov". I'd like to take this as a cheeky joke about how the div instruction is so slow it only executes once a year, but I'm guessing it's just a spreadsheet formatting bug.

gratton__bob · ‎03-23-2020

Eh ! Guys, the website is broken !!!!
Filtering does not work anymore, under 'Technologies'. If you click, say SSE2, to filter ONLY SSE2 instructions the list does not update and is filled with ALL intrinsics ! This is realy annoying for one cannot search for specific instructions goup.

It still work under 'Cathegories' tough.

Tnx for fixing this ASAP

Cox__Steven · ‎03-24-2020

I'm not a web-dev, but i think I see this issue in the search Intrinsic function

function searchIntrinsic(e) {
    var b = false;
    if (techs.length) {
        if ($.inArray(e.tech, techs) != -1 || (e.alsoKNC && $.inArray("KNC", techs) != -1)) {
            b = true
        }
    }
    if (othertechs.length && e.tech == "Other") {
        for (var c = 0; c < othertechs.length; c++) {
            if ($.inArray(othertechs, e.cpuids) != -1) {
                b = true
            }
        }
    }
    if (avx512techs.length && e.tech == "AVX_512") {
        for (var c = 0; c < avx512techs.length; c++) {
            if ($.inArray(avx512techs, e.cpuids) != -1) {
                b = true
            }
        }
        if ($.inArray("AVX512VL", avx512techs) == -1) {
            if ($.inArray("AVX512VL", e.cpuids) != -1) {}
        }
    }
    if (cats.length) {
        var f = false;
        for (var c = 0; c < cats.length; c++) {
            if ($.inArray(cats, e.categories) != -1) {
                f = true
            }
        }
        if (!f) {
            return false
        }
    }
    if (search_text.length != 0) {
        var a = search_text.split("*");
        var d = 0;
        var g = 0;
        for (var c = 0; c < a.length; c++) {
            g = e._text.indexOf(a, d);
            if (g < 0) {
                return false
            }
            d = g
        }
    }
    return true
}

We're setting b, but never using it. I inserted the following before like 25 in the above snippet and tech filtering worked for me again

    if (!b) {
        return false
    }

nemequ · ‎04-06-2020

The pseudo-code for _mm512_slli_epi64 shows that it only uses 8 bits of the imm8 argument (imm8[7:0]), but that doesn't seem accurate. If that were true I would expect _mm512_scli_epi64(a, 1066) to have the same result as _mm512_srli_epi64(a, 42) (1066 & 255 == 42), but compilers will just zero the register (see https://godbolt.org/z/2thNFh).

If I provide the count as a command line argument so the compiler can't know the value the result for any value > 63 is all zeros. Here is a quick test:

#include <immintrin.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>

__m512i foo(__m512i bar, unsigned int j) {
    return _mm512_srli_epi64(bar, j);
}

int main(int argc, char** argv) {
  __m512i v = _mm512_set1_epi64(~UINT64_C(0));
  __m512i r = _mm512_srli_epi64(v, (unsigned int) atoi(argv[1]));
  printf("0x%llx\n", ((uint64_t*) &r)[0]);

  return 0;
}

That makes sense to me, since _mm512_srl_epi64 says it uses count[63:0].

andysem · ‎04-07-2020

@Evan N. The second argument of `_mm512_srli_epi64` must be a 8-bit immediate. This is a precondition, meaning that behavior is undefined if these conditions are not met. This follows from the `vpsrlq` instruction, to which the intrinsic corresponds. A good compiler would issue a compile time error if you specify a runtime value or a constant out of bounds.

albert__tomas · ‎04-21-2020

It looks like there may be a typo in the latency for these instructions on Icelake:

_mm256_lddqu_si256

_mm256_loadu_si256

as it says the latency for the instruction is 7, when all pretty much all other processors (including AMD) the latency is ~1.

twest820 · ‎05-17-2020

Hi, it looks like the Intrinsics Guide indicates dependency on the AVX-512F CPUID flag only for F instructions and VL instruction variants. However, sections 15.2.1, 15.3, and 15.4 of the arch manual (Intel 64 and IA-32 Architectures Software Manual, volume 1) require software check the F flag before checking ER, PF, CD, DQ, BW, or VL flags.

Am I correct in thinking the Intrinsics Guide is missing a few thousand F dependencies and that this is maybe an incompletely implemented workaround for the way the guide's AVX-512 group checkboxes work? The guide also seems to be missing the required OSXSAVE check for AVX, AVX2, and AVX-512.

Figure 15-5 of the manual does indicate table 2-2, which I presume this is a typo for table 15-2, and figures 15-4 and 15-5 appear to misspell OSXSAVE as OXSAVE. So the current manual probably isn't 100% correct either. I suspect 15.3 also needs updating for IFMA52, VPOPCNTDQ, BF16, BITALG, VBMI, VBMI2, VNNI, VP2INTERSECT. Since, presumably, those instruction groups also require checking OXSAVE, F, and (at 128 and 256 bit width) VL flags. 4FMAPS and 4VNNIW are also missing but might fit better in 15.2.1.

Roland_S_Intel · ‎05-19-2020

albert, tomas wrote:
It looks like there may be a typo in the latency for these instructions on Icelake [] as it says the latency for the instruction is 7

Yes that is incorrect. You can find the correct latency numbers here: https://software.intel.com/content/www/us/en/develop/download/10th-generation-intel-core-processor-instruction-throughput-and-latency-docs.html . The intrinsic guide will be updated to match that.

Roland_S_Intel · ‎05-19-2020

Matthias Kretz wrote:
There's a bug either in ICC or the documentation. Consider https://godbolt.org/g/LYJjM2. The documentation for _mm_mask_mov_ps says "dst[MAX:128] := 0". The comments in the test case expect this behavior.

I don't think the test case shows this. The test case doesn't capture what _mm_mask_mov_ps does with the upper bits. Because it tries to read those upper bits with _mm512_castps128_ps512 but it is documented to have undefined values for the upper bits. And I don't think there is any way to get to dst[MAX:128] bits of a __m128 variable. Therefore it is irrelevant what _mm_mask_mov_ps does to the upper bits.

andysem · ‎05-21-2020

Roland S. (Intel) wrote:
You can find the correct latency numbers here: https://software.intel.com/content/www/us/en/develop/download/10th-gener... .

That link doesn't seem to work.

Rivera__Joao · ‎05-22-2020

Hi,

The website for the Intrinsic Guide seems to be broken (It's stuck in "loading" the intrinsics). I'm using Chrome 78. I'm not web developer but I think the problem is in the perf.json and perf2.json files which cannot be executed as javascript files (due to the .json extension I think). This is the error message when inspecting the website:

Refused to execute script from 'https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/perf2.json' because its MIME type ('application/json') is not executable, and strict MIME type checking is enabled.

I think it can be solved by changing the extension of the files (perf and perf2) to .js and changing the .html file accordingly as well.

Best.

Osiv__Oleksiy · ‎05-22-2020

Hey, the guide is not working at all today. I checked Chrome & Edge. Development console contains the following error:

Refused to execute script from 'https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/perf.json' because its MIME type ('application/json') is not executable, and strict MIME type checking is enabled.