Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Bugs in Intrinsics Guide

andysem
New Contributor III
36,100 Views

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

0 Kudos
221 Replies
Stefan_M_Intel1
Employee
3,962 Views

Descriptions of _mm_set_epi8() _mm256_set_epi8(), _mm512_set_epi8(), _mm512_set_epi16() all say "reverse order", which is incorrect.

0 Kudos
Stefan_M_Intel1
Employee
3,962 Views

Pseudo code of all nine *_dpbusd_epi32() still looks incorrect, since the 4* for operand b is missing.

tmp1 := a.byte[4*j] * b.byte[4*j]
tmp2 := a.byte[4*j+1] * b.byte[4*j+1]
tmp3 := a.byte[4*j+2] * b.byte[4*j+2]
tmp4 := a.byte[4*j+3] * b.byte[4*j+3]

 

0 Kudos
Guobing_C_Intel
Employee
3,962 Views

There seems description error for all the VNNI intrinsics:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2557,4351,2195,2198,2204&avx512techs=AVX512_VNNI

For example:

__m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)

The description is as below, while all b.byte[????] is not corresponding with related a.byte. e.x.: tmp1 should be a.byte[4*j] * b.byte[4*j] instead of b.byte.

FOR j := 0 to 3
	tmp1 := a.byte[4*j] * b.byte
	tmp2 := a.byte[4*j+1] * b.byte[j+1]
	tmp3 := a.byte[4*j+2] * b.byte[j+2]
	tmp4 := a.byte[4*j+3] * b.byte[j+3]
	dst.dword := src.dword + tmp1 + tmp2 + tmp3 + tmp4
ENDFOR
dst[MAX:128] := 0

The same error happens for all other VNNI intrinsic descriptions.

0 Kudos
Dobratz__Glenn
3,962 Views

This issue applies to:  _mm_xor_epi32, _mm256_xor_epi32, _mm_xor_epi64, and _mm256_xor_epi64

The intrinsics guide shows these functions as generating vpord/vporq instead of vpxord/vpxorq.

The wrong instruction appears both on the right side of the function name in the summary view and in the internal detailed descriptions.

0 Kudos
Dobratz__Glenn
3,962 Views

The description of the _ktestc_maskXX instructions in the online Intrinsics Guide disagrees with the Intel Architecture Software Developer Manual.

The Intrinsics Guide says that the function returns true if the NAND of the operands is all ones.

The Architecture Manual states that the 'CF' flag is lit if the result of the NAND operation is all zeros.

Presumably the Architecture Manual is correct, since the NAND producing an all ones result is meaningless.

0 Kudos
Wegner__Zach
Beginner
3,960 Views
The second-to-last line of the code for _mm512_set_epi8 is this: dst[511:503] := e63 The indices should be 511:504, not 511:503.
0 Kudos
Vamsi_S_Intel
Employee
3,962 Views

All the dot product instructions (AVX512_BF16) operating on BF16 have incorrect offset for the second src operand --

For e.g in _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)
 

dst := src
FOR j := 0 to 15
	dst.fp32 += make_fp32(a.bf16[2*j+1]) * make_fp32(b.bf16[j+1])
	dst.fp32 += make_fp32(a.bf16[2*j+0]) * make_fp32(b.bf16[j+0])
ENDFOR
dst[MAX:512] := 0

 

All the offsets in to the "b" src should have the 2*j offset corresponding to the pair referred in the "a" src.

0 Kudos
Wegner__Zach
Beginner
3,960 Views
The instruction listing for every 256-bit gather in AVX2 (example: _mm256_i64gather_epi64) use a wrong operand. The instructions are shown like "vpgatherqq ymm, vm64x, ymm", always with vm32x/vm64x for the second operand. These should be vm32y/vm64y, assuming the y suffix is YMM VSIB and x is XMM VSIB.
0 Kudos
Patrick_K_Intel
Employee
3,962 Views

The 3.5.0 update to Intrinsics Guide is live, and should address all the issues reported above.

0 Kudos
Fogle__Miles
Beginner
3,960 Views

I noticed two issues, one major and one minor.

The major issue is that the instruction set checkboxes no longer filter the list, although the category checkboxes still do. Same result on both latest firefox and latest chrome. These worked a few weeks ago, so I think it's probably the recent update that broke them.

The minor issue is that the performance table for _mm_div_ps and _mm_div_ss lists the latency on Ivy Bridge as "14-Nov". I'd like to take this as a cheeky joke about how the div instruction is so slow it only executes once a year, but I'm guessing it's just a spreadsheet formatting bug.

0 Kudos
gratton__bob
Beginner
3,962 Views

Eh ! Guys, the website is broken !!!!
Filtering does not work anymore, under 'Technologies'. If you click, say SSE2, to filter ONLY SSE2 instructions the list does not update and is filled with ALL intrinsics ! This is realy annoying for one cannot search for specific instructions goup.

It still work under 'Cathegories' tough.

 

Tnx for fixing this ASAP

0 Kudos
Cox__Steven
Beginner
4,029 Views

I'm not a web-dev, but i think I see this issue in the search Intrinsic function

function searchIntrinsic(e) {
    var b = false;
    if (techs.length) {
        if ($.inArray(e.tech, techs) != -1 || (e.alsoKNC && $.inArray("KNC", techs) != -1)) {
            b = true
        }
    }
    if (othertechs.length && e.tech == "Other") {
        for (var c = 0; c < othertechs.length; c++) {
            if ($.inArray(othertechs, e.cpuids) != -1) {
                b = true
            }
        }
    }
    if (avx512techs.length && e.tech == "AVX_512") {
        for (var c = 0; c < avx512techs.length; c++) {
            if ($.inArray(avx512techs, e.cpuids) != -1) {
                b = true
            }
        }
        if ($.inArray("AVX512VL", avx512techs) == -1) {
            if ($.inArray("AVX512VL", e.cpuids) != -1) {}
        }
    }
    if (cats.length) {
        var f = false;
        for (var c = 0; c < cats.length; c++) {
            if ($.inArray(cats, e.categories) != -1) {
                f = true
            }
        }
        if (!f) {
            return false
        }
    }
    if (search_text.length != 0) {
        var a = search_text.split("*");
        var d = 0;
        var g = 0;
        for (var c = 0; c < a.length; c++) {
            g = e._text.indexOf(a, d);
            if (g < 0) {
                return false
            }
            d = g
        }
    }
    return true
}

We're setting b, but never using it. I inserted the following before like 25 in the above snippet and tech filtering worked for me again

 

    if (!b) {
        return false
    }

 

0 Kudos
nemequ
New Contributor I
4,030 Views

The pseudo-code for _mm512_slli_epi64 shows that it only uses 8 bits of the imm8 argument (imm8[7:0]), but that doesn't seem accurate.  If that were true I would expect _mm512_scli_epi64(a, 1066) to have the same result as _mm512_srli_epi64(a, 42) (1066 & 255 == 42), but compilers will just zero the register (see https://godbolt.org/z/2thNFh).

If I provide the count as a command line argument so the compiler can't know the value the result for any value > 63 is all zeros.  Here is a quick test:

#include <immintrin.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>

__m512i foo(__m512i bar, unsigned int j) {
    return _mm512_srli_epi64(bar, j);
}

int main(int argc, char** argv) {
  __m512i v = _mm512_set1_epi64(~UINT64_C(0));
  __m512i r = _mm512_srli_epi64(v, (unsigned int) atoi(argv[1]));
  printf("0x%llx\n", ((uint64_t*) &r)[0]);

  return 0;
}

That makes sense to me, since _mm512_srl_epi64 says it uses count[63:0].

0 Kudos
andysem
New Contributor III
4,025 Views

@Evan N. The second argument of `_mm512_srli_epi64` must be a 8-bit immediate. This is a precondition, meaning that behavior is undefined if these conditions are not met. This follows from the `vpsrlq` instruction, to which the intrinsic corresponds. A good compiler would issue a compile time error if you specify a runtime value or a constant out of bounds.

0 Kudos
albert__tomas
Beginner
4,025 Views

It looks like there may be a typo in the latency for these instructions on Icelake:

_mm256_lddqu_si256

_mm256_loadu_si256

as it says the latency for the instruction is 7, when all pretty much all other processors (including AMD) the latency is ~1.

0 Kudos
twest820
Beginner
4,025 Views

Hi, it looks like the Intrinsics Guide indicates dependency on the AVX-512F CPUID flag only for F instructions and VL instruction variants. However, sections 15.2.1, 15.3, and 15.4 of the arch manual (Intel 64 and IA-32 Architectures Software Manual, volume 1) require software check the F flag before checking ER, PF, CD, DQ, BW, or VL flags.

Am I correct in thinking the Intrinsics Guide is missing a few thousand F dependencies and that this is maybe an incompletely implemented workaround for the way the guide's AVX-512 group checkboxes work? The guide also seems to be missing the required OSXSAVE check for AVX, AVX2, and AVX-512.

Figure 15-5 of the manual does indicate table 2-2, which I presume this is a typo for table 15-2, and figures 15-4 and 15-5 appear to misspell OSXSAVE as OXSAVE. So the current manual probably isn't 100% correct either. I suspect 15.3 also needs updating for IFMA52, VPOPCNTDQ, BF16, BITALG, VBMI, VBMI2, VNNI, VP2INTERSECT. Since, presumably, those instruction groups also require checking OXSAVE, F, and (at 128 and 256 bit width) VL flags. 4FMAPS and 4VNNIW are also missing but might fit better in 15.2.1.

0 Kudos
Roland_S_Intel
Employee
4,032 Views

albert, tomas wrote:

It looks like there may be a typo in the latency for these instructions on Icelake [] as it says the latency for the instruction is 7

Yes that is incorrect. You can find the correct latency numbers here: https://software.intel.com/content/www/us/en/develop/download/10th-generation-intel-core-processor-instruction-throughput-and-latency-docs.html . The intrinsic guide will be updated to match that.

0 Kudos
Roland_S_Intel
Employee
4,032 Views

Matthias Kretz wrote:

There's a bug either in ICC or the documentation. Consider https://godbolt.org/g/LYJjM2. The documentation for _mm_mask_mov_ps says "dst[MAX:128] := 0". The comments in the test case expect this behavior.

I don't think the test case shows this. The test case doesn't capture what _mm_mask_mov_ps does with the upper bits. Because it tries to read those upper bits with _mm512_castps128_ps512 but it is documented to have undefined values for the upper bits. And I don't think there is any way to get to dst[MAX:128] bits of a __m128 variable. Therefore it is irrelevant what _mm_mask_mov_ps does to the upper bits.

0 Kudos
andysem
New Contributor III
4,032 Views

Roland S. (Intel) wrote:

You can find the correct latency numbers here: https://software.intel.com/content/www/us/en/develop/download/10th-gener... .

That link doesn't seem to work.

 

0 Kudos
Rivera__Joao
Beginner
4,032 Views

Hi,

The website for the Intrinsic Guide seems to be broken (It's stuck in "loading" the intrinsics). I'm using Chrome 78. I'm not web developer but I think the problem is in the perf.json and perf2.json files which cannot be executed as javascript files (due to the .json extension I think). This is the error message when inspecting the website:

Refused to execute script from 'https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/perf2.json' because its MIME type ('application/json') is not executable, and strict MIME type checking is enabled.

I think it can be solved by changing the extension of the files (perf and perf2) to .js and changing the .html file accordingly as well.

Best.

0 Kudos
Osiv__Oleksiy
Beginner
4,032 Views

Hey, the guide is not working at all today. I checked Chrome & Edge. Development console contains the following error:

 

Refused to execute script from 'https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/perf.json' because its MIME type ('application/json') is not executable, and strict MIME type checking is enabled.
 

0 Kudos
Reply