- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):
1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.
2. __m256 _mm256_undefined_si256 () should return __m256i.
3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.
4. _mm_alignr_epi8 has two descriptions.
5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?
I didn't read all instructions so there may be more issues. I'll post if I find anything else.
PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Descriptions of _mm_set_epi8() _mm256_set_epi8(), _mm512_set_epi8(), _mm512_set_epi16() all say "reverse order", which is incorrect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Pseudo code of all nine *_dpbusd_epi32() still looks incorrect, since the 4* for operand b is missing.
tmp1 := a.byte[4*j] * b.byte[4*j]
tmp2 := a.byte[4*j+1] * b.byte[4*j+1]
tmp3 := a.byte[4*j+2] * b.byte[4*j+2]
tmp4 := a.byte[4*j+3] * b.byte[4*j+3]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There seems description error for all the VNNI intrinsics:
For example:
__m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
The description is as below, while all b.byte[????] is not corresponding with related a.byte. e.x.: tmp1 should be a.byte[4*j] * b.byte[4*j] instead of b.byte
FOR j := 0 to 3 tmp1 := a.byte[4*j] * b.bytetmp2 := a.byte[4*j+1] * b.byte[j+1] tmp3 := a.byte[4*j+2] * b.byte[j+2] tmp4 := a.byte[4*j+3] * b.byte[j+3] dst.dword := src.dword + tmp1 + tmp2 + tmp3 + tmp4 ENDFOR dst[MAX:128] := 0
The same error happens for all other VNNI intrinsic descriptions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This issue applies to: _mm_xor_epi32, _mm256_xor_epi32, _mm_xor_epi64, and _mm256_xor_epi64
The intrinsics guide shows these functions as generating vpord/vporq instead of vpxord/vpxorq.
The wrong instruction appears both on the right side of the function name in the summary view and in the internal detailed descriptions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The description of the _ktestc_maskXX instructions in the online Intrinsics Guide disagrees with the Intel Architecture Software Developer Manual.
The Intrinsics Guide says that the function returns true if the NAND of the operands is all ones.
The Architecture Manual states that the 'CF' flag is lit if the result of the NAND operation is all zeros.
Presumably the Architecture Manual is correct, since the NAND producing an all ones result is meaningless.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All the dot product instructions (AVX512_BF16) operating on BF16 have incorrect offset for the second src operand --
For e.g in _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)
dst := src FOR j := 0 to 15 dst.fp32+= make_fp32(a.bf16[2*j+1]) * make_fp32(b.bf16[j+1]) dst.fp32 += make_fp32(a.bf16[2*j+0]) * make_fp32(b.bf16[j+0]) ENDFOR dst[MAX:512] := 0
All the offsets in to the "b" src should have the 2*j offset corresponding to the pair referred in the "a" src.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 3.5.0 update to Intrinsics Guide is live, and should address all the issues reported above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I noticed two issues, one major and one minor.
The major issue is that the instruction set checkboxes no longer filter the list, although the category checkboxes still do. Same result on both latest firefox and latest chrome. These worked a few weeks ago, so I think it's probably the recent update that broke them.
The minor issue is that the performance table for _mm_div_ps and _mm_div_ss lists the latency on Ivy Bridge as "14-Nov". I'd like to take this as a cheeky joke about how the div instruction is so slow it only executes once a year, but I'm guessing it's just a spreadsheet formatting bug.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Eh ! Guys, the website is broken !!!!
Filtering does not work anymore, under 'Technologies'. If you click, say SSE2, to filter ONLY SSE2 instructions the list does not update and is filled with ALL intrinsics ! This is realy annoying for one cannot search for specific instructions goup.
It still work under 'Cathegories' tough.
Tnx for fixing this ASAP
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not a web-dev, but i think I see this issue in the search Intrinsic function
function searchIntrinsic(e) { var b = false; if (techs.length) { if ($.inArray(e.tech, techs) != -1 || (e.alsoKNC && $.inArray("KNC", techs) != -1)) { b = true } } if (othertechs.length && e.tech == "Other") { for (var c = 0; c < othertechs.length; c++) { if ($.inArray(othertechs, e.cpuids) != -1) { b = true } } } if (avx512techs.length && e.tech == "AVX_512") { for (var c = 0; c < avx512techs.length; c++) { if ($.inArray(avx512techs , e.cpuids) != -1) { b = true } } if ($.inArray("AVX512VL", avx512techs) == -1) { if ($.inArray("AVX512VL", e.cpuids) != -1) {} } } if (cats.length) { var f = false; for (var c = 0; c < cats.length; c++) { if ($.inArray(cats , e.categories) != -1) { f = true } } if (!f) { return false } } if (search_text.length != 0) { var a = search_text.split("*"); var d = 0; var g = 0; for (var c = 0; c < a.length; c++) { g = e._text.indexOf(a , d); if (g < 0) { return false } d = g } } return true }
We're setting b, but never using it. I inserted the following before like 25 in the above snippet and tech filtering worked for me again
if (!b) { return false }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The pseudo-code for _mm512_slli_epi64 shows that it only uses 8 bits of the imm8 argument (imm8[7:0]), but that doesn't seem accurate. If that were true I would expect _mm512_scli_epi64(a, 1066) to have the same result as _mm512_srli_epi64(a, 42) (1066 & 255 == 42), but compilers will just zero the register (see https://godbolt.org/z/2thNFh).
If I provide the count as a command line argument so the compiler can't know the value the result for any value > 63 is all zeros. Here is a quick test:
#include <immintrin.h> #include <stdlib.h> #include <stdint.h> #include <stdio.h> __m512i foo(__m512i bar, unsigned int j) { return _mm512_srli_epi64(bar, j); } int main(int argc, char** argv) { __m512i v = _mm512_set1_epi64(~UINT64_C(0)); __m512i r = _mm512_srli_epi64(v, (unsigned int) atoi(argv[1])); printf("0x%llx\n", ((uint64_t*) &r)[0]); return 0; }
That makes sense to me, since _mm512_srl_epi64 says it uses count[63:0].
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Evan N. The second argument of `_mm512_srli_epi64` must be a 8-bit immediate. This is a precondition, meaning that behavior is undefined if these conditions are not met. This follows from the `vpsrlq` instruction, to which the intrinsic corresponds. A good compiler would issue a compile time error if you specify a runtime value or a constant out of bounds.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like there may be a typo in the latency for these instructions on Icelake:
_mm256_lddqu_si256
_mm256_loadu_si256
as it says the latency for the instruction is 7, when all pretty much all other processors (including AMD) the latency is ~1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, it looks like the Intrinsics Guide indicates dependency on the AVX-512F CPUID flag only for F instructions and VL instruction variants. However, sections 15.2.1, 15.3, and 15.4 of the arch manual (Intel 64 and IA-32 Architectures Software Manual, volume 1) require software check the F flag before checking ER, PF, CD, DQ, BW, or VL flags.
Am I correct in thinking the Intrinsics Guide is missing a few thousand F dependencies and that this is maybe an incompletely implemented workaround for the way the guide's AVX-512 group checkboxes work? The guide also seems to be missing the required OSXSAVE check for AVX, AVX2, and AVX-512.
Figure 15-5 of the manual does indicate table 2-2, which I presume this is a typo for table 15-2, and figures 15-4 and 15-5 appear to misspell OSXSAVE as OXSAVE. So the current manual probably isn't 100% correct either. I suspect 15.3 also needs updating for IFMA52, VPOPCNTDQ, BF16, BITALG, VBMI, VBMI2, VNNI, VP2INTERSECT. Since, presumably, those instruction groups also require checking OXSAVE, F, and (at 128 and 256 bit width) VL flags. 4FMAPS and 4VNNIW are also missing but might fit better in 15.2.1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
albert, tomas wrote:It looks like there may be a typo in the latency for these instructions on Icelake [] as it says the latency for the instruction is 7
Yes that is incorrect. You can find the correct latency numbers here: https://software.intel.com/content/www/us/en/develop/download/10th-generation-intel-core-processor-instruction-throughput-and-latency-docs.html . The intrinsic guide will be updated to match that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Matthias Kretz wrote:There's a bug either in ICC or the documentation. Consider https://godbolt.org/g/LYJjM2. The documentation for _mm_mask_mov_ps says "dst[MAX:128] := 0". The comments in the test case expect this behavior.
I don't think the test case shows this. The test case doesn't capture what _mm_mask_mov_ps does with the upper bits. Because it tries to read those upper bits with _mm512_castps128_ps512 but it is documented to have undefined values for the upper bits. And I don't think there is any way to get to dst[MAX:128] bits of a __m128 variable. Therefore it is irrelevant what _mm_mask_mov_ps does to the upper bits.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roland S. (Intel) wrote:You can find the correct latency numbers here: https://software.intel.com/content/www/us/en/develop/download/10th-gener... .
That link doesn't seem to work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The website for the Intrinsic Guide seems to be broken (It's stuck in "loading" the intrinsics). I'm using Chrome 78. I'm not web developer but I think the problem is in the perf.json and perf2.json files which cannot be executed as javascript files (due to the .json extension I think). This is the error message when inspecting the website:
Refused to execute script from 'https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/perf2.json' because its MIME type ('application/json') is not executable, and strict MIME type checking is enabled.
I think it can be solved by changing the extension of the files (perf and perf2) to .js and changing the .html file accordingly as well.
Best.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey, the guide is not working at all today. I checked Chrome & Edge. Development console contains the following error:
Refused to execute script from 'https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/perf.json' because its MIME type ('application/json') is not executable, and strict MIME type checking is enabled.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page