Community
cancel
Showing results for 
Search instead for 
Did you mean: 
andysem
New Contributor III
3,247 Views

Bugs in Intrinsics Guide

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

0 Kudos
218 Replies
andysem
New Contributor III
308 Views

By the way, are there any updates planned to the Intrinsics Guide? There were a number of bug reports and performance info for Skylake is still missing.

 

Patrick_K_Intel
Employee
308 Views

Thanks for the feedback. I've posted an update that addresses all the reported issues. This does not include performance info for Skylake, although I may add that in the future.

andysem
New Contributor III
308 Views

Each of the _mm_storeu_si16/si32/si64 intrinsics are listed twice, some of them having slightly different instructions.

Patrick_K_Intel
Employee
308 Views

I have posted an update that includes updated latency/throughput. This removes data from pre-Sandybridge, and adds Broadwell, Skylake, and Knights Landing.

andysem
New Contributor III
308 Views

Thank you Patrick, although I think the removal of Sandy Bridge and Nehalem is a bit premature. Those CPUs are still relevant.

Jakob__Wenzel
Beginner
308 Views

I believe that the "_MM_CMPINT_NEQ" constant listed in various integer comparison operations should read _MM_CMPINT_NE. (At least this is what GCC, Clang, etc. implement)

Steve_W_
Beginner
308 Views

The guide has a significant mislabelling of throughput in all intrinsics which list them. Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput.    This is consistently misreported throughout the guide

For example, the guide reports Skylake having a lower throughput for pmulhuw than Haswell or Broadwell. It's the opposite, Skylake's thoughput is higher than the older architectures.  This mislabelling is repeated for about 100 other intrinsics.

Reporting reciprocal throughput is a good idea, since those values can be more easily compared to latency clocks.  But the labels in the whole guide must be updated to state "reciprocal throughput."   I was even reorganizing my AVX code to minimize calls to these certain apparently lower-throughput  changes to x86 vector math! 

Luckily I realized the mismatch with Agner Fox's independent tables.

andysem
New Contributor III
308 Views

Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput.

This is a common convention in most x86-related materials. I suppose, that's for historical reasons. You can see "Intel® 64 and IA-32 Architectures Software Developer’s Manual" for example, in quite a few places it uses the "n-cycle throughput" wording, which actually implies reciprocal throughput. Here's another resource: http://x86.renejeschke.de/html/file_module_x86_id_244.html. You can see it also uses the term throughput to describe the clock cycles.

I agree that the term is confusing and probably poorly chosen in the beginning. But for anyone familiar with the domain it should be understandable.

 

Jeffrey_H_Intel
Employee
308 Views

_mm512_stream_si512 should have "Instruction: vmovntdqa m512, zmm" not "Instruction: vmovntdqa zmm, m512".  The destination is first and the source is second.  This instruction stores to memory from a register.

Edit: it should instead be  "Instruction: vmovntdq m512, zmm".  VMOVNTDQA is a load and VMOVNTDQ is a store.

Kyle_S_
Beginner
308 Views

_mm_countbits_64 description says input is 32 bits, but should say 64 bits.

James_C_Intel2
Employee
308 Views

The description of _xbegin in the intrinsics guide does not say what the result of the intrinsic is.

believe that it returns -1 if execution is continuing inside a transaction, and the value of EAX on an abort, but none of that information is given here, and it should be!

Henk-Jan_L_
Beginner
308 Views

I'm confused by the operation of _mm512_i32extscatter_epi32.

If you use conv=_MM_DOWNCONV_EPI32_NONE and hint=_MM_HINT_NONE this intrinsic should be equal to _mm512_i32scatter_epi32.

For example, when using _MM_DOWNCONV_EPI32_UINT8, take j=15, then i=480 and n=120, and addr[127:120]:=UInt32ToUInt8(v1[511:480]), Are we really using 128-bit addresses? The operation of _mm512_i32scatter_epi32 does make a lot more sense. See below.

Can someone please explain how the operation of the _mm512_i32extscatter_epi32 should be read?

Regards Henk-Jan.

---
void _mm512_i32extscatter_epi32 (void * mv, __m512i index, __m512i v1, _MM_DOWNCONV_EPI32_ENUM conv, int scale, int hint)
Operation:

FOR j := 0 to 15
    addr := MEM[mv + index * scale]
    i := j*32
    CASE conv OF 
        _MM_DOWNCONV_EPI32_NONE: 
            addr[i+31:i] := v1[i+31:i]
        _MM_DOWNCONV_EPI32_UINT8: 
            n := j*8 
            addr[n+7:n] := UInt32ToUInt8(v1[i+31:i])
        _MM_DOWNCONV_EPI32_SINT8:
            n := j*8
            addr[n+7:n] := SInt32ToSInt8(v1[i+31:i])
        _MM_DOWNCONV_EPI32_UINT16:
            n := j*16 
            addr[n+15:n] := UInt32ToUInt16(v1[i+31:i]) 
        _MM_DOWNCONV_EPI32_SINT16: 
            n := j*16 
            addr[n+15:n] := SInt32ToSInt16(v1[n+15:n]) 
    ESAC 
ENDFOR 

---
void _mm512_i32scatter_epi32 (void* base_addr, __m512i vindex, __m512i a, int scale)
Operation: 

FOR j := 0 to 15
    i := j*32 
    MEM[base_addr + SignExtend(vindex[i+31:i])*scale] := a[i+31:i] 
ENDFOR

 

Gregg_S_Intel
Employee
308 Views

Selecting AVX512_4FMAPS instruction set, one intrinsic is missing:  _mm512_4fmadd_ps

That intrinsic is in the tool; it is just missing from the filtered list AVX512_4FMAPS.

 

Eden_S_Intel
Employee
308 Views

The entire instruction sets which uses masks as input, such as kmov, kshift, kand and so on is absent from the guide. I had to use those several times and it's harder with their absence.

andysem
New Contributor III
308 Views

Eden S. (Intel) wrote:

The entire instruction sets which uses masks as input, such as kmov, kshift, kand and so on is absent from the guide.

The guide does describe some intrinsics like _mm512_kmov, _mm512_kand, etc. but they mostly deal with 16-bit masks and are not extracted to a separate category, which would be useful for searching.

 

Zvi_D_Intel
Employee
308 Views

In this guide I see the description of __m256i _mm256_insert_epi64 (__m256i a__int64 iconst int index).

The MS compiler (Visual Studio 2015) doesn't identify such an intrinsic.
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (Order Number: 325462-063US, July 2017) doesn't have any information re this intrinsic name as well.
Whats does it mean - do we have such operation or not ?
 

andysem
New Contributor III
308 Views

Zvi Danovich (Intel) wrote:

In this guide I see the description of __m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index).

The MS compiler (Visual Studio 2015) doesn't identify such an intrinsic.
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (Order Number: 325462-063US, July 2017) doesn't have any information re this intrinsic name as well.
Whats does it mean - do we have such operation or not ?

gcc 7.2 does have this intrinsic. The intrinsic translates into not just one instruction but a sequence of them, that's why it's not present in the SDM. Also, it is only present in 64-bit mode. My guess is that the MSVC version you use lacks that intrinsic or you're compiling for 32-bit mode.

 

Kearney__Jim
Beginner
308 Views

Do all the _mask_.._mask operations have switched mask names in their descriptions?  In these, there is an input mask k1 and a result mask (the unnamed return value, but called k in the Description and Operation).  It is written that they "store the results in mask vector k1 using zeromask k", which seems to me the opposite of their actual role.  For example:

__mmask8 _mm_mask_cmpeq_epi32_mask (__mmask8 k1, __m128i a, __m128i b)

...

Description

Compare packed 32-bit integers in a and b for equality, and store the results in mask vector k1 using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

Operation

FOR j := 0 to 3
  i := j*32
  IF k1
    k := ( a[i+31:i] == b[i+31:i] ) ? 1 : 0
  ELSE
    k := 0
  FI
ENDFOR
k[MAX:4] := 0
Matthias_Kretz
New Contributor I
308 Views

There's a bug either in ICC or the documentation. Consider https://godbolt.org/g/LYJjM2. The documentation for _mm_mask_mov_ps says "dst[MAX:128] := 0". The comments in the test case expect this behavior. However, the compiler translates it to a no-op. It is certainly correct that the vmovaps instruction has the zeroing behavior, though. But as the Compiler Explorer example shows, not every call to _mm_mask_mov_ps leads to a vmovaps instruction in the resulting binary.

This issue needs to be resolved for all variants of *_mask_mov_*.

andysem
New Contributor III
308 Views

@Matthias Kretz

Your link doesn't show the problem. I've forced the compiler to generate code here: https://godbolt.org/g/7kgQtM. Note that gcc also produces the similar code. I tend to think that the compilers do not guarantee zeroing the upper bits of the vector register, leaving them "undefined". I think, what you want to do is this: https://godbolt.org/g/5gvTVn.

 

Reply