Description

andysem · ‎01-30-2013

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

andysem · ‎03-03-2017

Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput.

This is a common convention in most x86-related materials. I suppose, that's for historical reasons. You can see "Intel® 64 and IA-32 Architectures Software Developer’s Manual" for example, in quite a few places it uses the "n-cycle throughput" wording, which actually implies reciprocal throughput. Here's another resource: http://x86.renejeschke.de/html/file_module_x86_id_244.html. You can see it also uses the term throughput to describe the clock cycles.

I agree that the term is confusing and probably poorly chosen in the beginning. But for anyone familiar with the domain it should be understandable.

Jeffrey_H_Intel · ‎05-11-2017

_mm512_stream_si512 should have "Instruction: vmovntdqa m512, zmm" not "Instruction: vmovntdqa zmm, m512". The destination is first and the source is second. This instruction stores to memory from a register.

Edit: it should instead be "Instruction: vmovntdq m512, zmm". VMOVNTDQA is a load and VMOVNTDQ is a store.

Kyle_S_ · ‎08-16-2017

_mm_countbits_64 description says input is 32 bits, but should say 64 bits.

James_C_Intel2 · ‎08-22-2017

The description of _xbegin in the intrinsics guide does not say what the result of the intrinsic is.

I believe that it returns -1 if execution is continuing inside a transaction, and the value of EAX on an abort, but none of that information is given here, and it should be!

Henk-Jan_L_ · ‎09-07-2017

I'm confused by the operation of _mm512_i32extscatter_epi32.

If you use conv=_MM_DOWNCONV_EPI32_NONE and hint=_MM_HINT_NONE this intrinsic should be equal to _mm512_i32scatter_epi32.

For example, when using _MM_DOWNCONV_EPI32_UINT8, take j=15, then i=480 and n=120, and addr[127:120]:=UInt32ToUInt8(v1[511:480]), Are we really using 128-bit addresses? The operation of _mm512_i32scatter_epi32 does make a lot more sense. See below.

Can someone please explain how the operation of the _mm512_i32extscatter_epi32 should be read?

Regards Henk-Jan.

---
void _mm512_i32extscatter_epi32 (void * mv, __m512i index, __m512i v1, _MM_DOWNCONV_EPI32_ENUM conv, int scale, int hint)
Operation:

FOR j := 0 to 15
    addr := MEM[mv + index * scale]
    i := j*32
    CASE conv OF 
        _MM_DOWNCONV_EPI32_NONE: 
            addr[i+31:i] := v1[i+31:i]
        _MM_DOWNCONV_EPI32_UINT8: 
            n := j*8 
            addr[n+7:n] := UInt32ToUInt8(v1[i+31:i])
        _MM_DOWNCONV_EPI32_SINT8:
            n := j*8
            addr[n+7:n] := SInt32ToSInt8(v1[i+31:i])
        _MM_DOWNCONV_EPI32_UINT16:
            n := j*16 
            addr[n+15:n] := UInt32ToUInt16(v1[i+31:i]) 
        _MM_DOWNCONV_EPI32_SINT16: 
            n := j*16 
            addr[n+15:n] := SInt32ToSInt16(v1[n+15:n]) 
    ESAC 
ENDFOR

---
void _mm512_i32scatter_epi32 (void* base_addr, __m512i vindex, __m512i a, int scale)
Operation:

FOR j := 0 to 15
    i := j*32 
    MEM[base_addr + SignExtend(vindex[i+31:i])*scale] := a[i+31:i] 
ENDFOR

Gregg_S_Intel · ‎09-21-2017

Selecting AVX512_4FMAPS instruction set, one intrinsic is missing: _mm512_4fmadd_ps

That intrinsic is in the tool; it is just missing from the filtered list AVX512_4FMAPS.

Eden_S_Intel · ‎11-16-2017

The entire instruction sets which uses masks as input, such as kmov, kshift, kand and so on is absent from the guide. I had to use those several times and it's harder with their absence.

andysem · ‎11-17-2017

Eden S. (Intel) wrote:

The entire instruction sets which uses masks as input, such as kmov, kshift, kand and so on is absent from the guide.

The guide does describe some intrinsics like _mm512_kmov, _mm512_kand, etc. but they mostly deal with 16-bit masks and are not extracted to a separate category, which would be useful for searching.

Zvi_D_Intel · ‎12-19-2017

In this guide I see the description of __m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index).

The MS compiler (Visual Studio 2015) doesn't identify such an intrinsic.
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (Order Number: 325462-063US, July 2017) doesn't have any information re this intrinsic name as well.
Whats does it mean - do we have such operation or not ?

andysem · ‎12-19-2017

Zvi Danovich (Intel) wrote:

In this guide I see the description of __m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index).

The MS compiler (Visual Studio 2015) doesn't identify such an intrinsic.
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (Order Number: 325462-063US, July 2017) doesn't have any information re this intrinsic name as well.
Whats does it mean - do we have such operation or not ?

gcc 7.2 does have this intrinsic. The intrinsic translates into not just one instruction but a sequence of them, that's why it's not present in the SDM. Also, it is only present in 64-bit mode. My guess is that the MSVC version you use lacks that intrinsic or you're compiling for 32-bit mode.

Kearney__Jim · ‎03-05-2018

Do all the _mask_.._mask operations have switched mask names in their descriptions? In these, there is an input mask k1 and a result mask (the unnamed return value, but called k in the Description and Operation). It is written that they "store the results in mask vector k1 using zeromask k", which seems to me the opposite of their actual role. For example:

__mmask8 _mm_mask_cmpeq_epi32_mask (__mmask8 k1, __m128i a, __m128i b)

...

Description

Compare packed 32-bit integers in a and b for equality, and store the results in mask vector k1 using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

Operation

FOR j := 0 to 3

i := j*32

IF k1

k := ( a[i+31:i] == b[i+31:i] ) ? 1 : 0

ELSE

k := 0

FI

ENDFOR

k[MAX:4] := 0

Matthias_Kretz · ‎04-20-2018

There's a bug either in ICC or the documentation. Consider https://godbolt.org/g/LYJjM2. The documentation for _mm_mask_mov_ps says "dst[MAX:128] := 0". The comments in the test case expect this behavior. However, the compiler translates it to a no-op. It is certainly correct that the vmovaps instruction has the zeroing behavior, though. But as the Compiler Explorer example shows, not every call to _mm_mask_mov_ps leads to a vmovaps instruction in the resulting binary.

This issue needs to be resolved for all variants of *_mask_mov_*.

andysem · ‎04-20-2018

@Matthias Kretz

Your link doesn't show the problem. I've forced the compiler to generate code here: https://godbolt.org/g/7kgQtM. Note that gcc also produces the similar code. I tend to think that the compilers do not guarantee zeroing the upper bits of the vector register, leaving them "undefined". I think, what you want to do is this: https://godbolt.org/g/5gvTVn.

Matthias_Kretz · ‎04-20-2018

@andysem No, it seems you misunderstood the problem. What I want is https://godbolt.org/g/YCB3iD .

Well, to be clear. I don't want _mm_mask_mov_ps to always expand to vmovaps. I believe the documentation in the intrinsics guide is incorrect. Which is why I posted it on this thread and not as a compiler bug.

andysem · ‎04-20-2018

@Matthias Kretz

I think I understood you correctly. The vmovaps instruction is generated in the second link I gave, on Intel compiler. Gcc fails to recognize that the zero masking can be optimized to a simple "vmovaps %xmm1, %xmm1" and instead generates "vmovaps %zmm0, %zmm0{%k1}{z}", but that is a matter of QoI. The effect is still the same.

My main point is that I believe when you cast to a smaller vector (e.g. zmm to xmm), perform operations on that smaller part and then cast back, you shouldn't assume any particular contents in the higher bits. The compiler does not guarantee which CPU instructions it generates from intrinsics because it can use more context from the surrounding code and optimize better. For example, the vectorizer could optimize your loop and still use the full-width vectors. The Intrinsics Guide only gives a rough idea of which instructions might be involved, but you shouldn't expect it to be followed literally.

If you want particular content in the upper bits of a vector register, you should write the code so that it does operate on those bits and fill them accordingly. In your case it means operating on __m512 and not __m128. Or you could use inline assembler, of course, but that often precludes the compiler from doing optimizations that would be otherwise possible.

Matthias_Kretz · ‎04-20-2018

andysem wrote:
I think I understood you correctly.

That's a bold statement to make. You know better what I wanted to do than I do. :-)

Let me try again. I want to report that the documentation in the Intel Intrinsics Guide does not match the behavior ICC (and GCC) produces. That's all I wanted to do here in this forum. I was not looking for a solution to a problem you're trying to guess that I have. ;-)

I have solved my problem already. While solving it, I noticed the last line in the _mm_mask_mov_ps pseudo code, and thought that looks only 99% correct. Let's write an example that breaks it. And that's how this post happened.

andysem · ‎04-21-2018

I'm just saying that your original example is not correct, in my opinion. And that compiler behavior wrt. instructions choice for pretty much any intrinsic can be different from what is said in the Intrinsics Guide. I don't see that as an error in the Intrinsics Guide or in the compiler. That's all.

Matthias_Kretz · ‎04-23-2018

andysem wrote:
I'm just saying that your original example is not correct, in my opinion.

I believe ICC (and GCC) optimize correctly, in the Compiler Explorer example I provided. IIUC, you believe the same. I.e. you believe my code is wrong. My point all along, is not that the code is supposed to achieve anything other than contradict the documentation. Not to achieve the effect I wrote in the comments of the code. I didn't want to speak for the actual intent of Intel, when they designed the intrinsics. Which is why I left it open for them to decide whether they believe ICC is at fault here.

andysem wrote:
And that compiler behavior wrt. instructions choice for pretty much any intrinsic can be different from what is said in the Intrinsics Guide. I don't see that as an error in the Intrinsics Guide or in the compiler. That's all.

The Intrinsics Guide documents a specific "Operation" for the _mm_mask_mov_ps intrinsic. I believe what the Guide should say, is that this "Operation" is what vmovaps does. I.e. the logical "Operation" of _mm_mask_mov_ps is the "Operation" of vmovaps modulo zeroing of the high bits. Because the compiler should be free to replace the logical operation on the low bits with something equivalent that maybe doesn't zero the high bits.

Arthur_A_Intel · ‎06-25-2018

Hi I apologize if this was already requested, but is it possible to add the information about this instruction VPDPBUSD? You can find information already in the ISA extension manual: https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf

Intel C/C++ Compiler Intrinsic Equivalent

VPDPBUSD __m128i _mm_dpbusd_epi32(__m128i, __m128i, __m128i);
VPDPBUSD __m128i _mm_mask_dpbusd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSD __m128i _mm_maskz_dpbusd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSD __m256i _mm256_dpbusd_epi32(__m256i, __m256i, __m256i);
VPDPBUSD __m256i _mm256_mask_dpbusd_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSD __m256i _mm256_maskz_dpbusd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSD __m512i _mm512_dpbusd_epi32(__m512i, __m512i, __m512i);
VPDPBUSD __m512i _mm512_mask_dpbusd_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPBUSD __m512i _mm512_maskz_dpbusd_epi32(__mmask16, __m512i, __m512i, __m512i);

VPDPBUSDS __m128i _mm_dpbusds_epi32(__m128i, __m128i, __m128i);
VPDPBUSDS __m128i _mm_mask_dpbusds_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSDS __m128i _mm_maskz_dpbusds_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSDS __m256i _mm256_dpbusds_epi32(__m256i, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_mask_dpbusds_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_maskz_dpbusds_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSDS __m512i _mm512_dpbusds_epi32(__m512i, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_mask_dpbusds_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_maskz_dpbusds_epi32(__mmask16, __m512i, __m512i, __m512i);

Thank you in advance.

Jin__Wz · ‎07-26-2018

Question about _BitScanForward, and its friends:

The operations described in Intrinsics Guide:

unsigned char _BitScanForward (unsigned __int32* index, unsigned __int32 mask)
tmp := 0
IF mask = 0
	dst := 0
ELSE
	DO WHILE ((tmp < 32) AND mask[tmp] = 0)
		tmp := tmp + 1
		index := tmp
		dst := 1
	OD
FI

Although not very clear, It seems to me that if mask==0, then *index will be left unchanged.

In the newest Intel® C++ Compiler 18.0 Developer Guide and Reference (also in version 15.0~17.0), it's described very clear:

unsigned char _BitScanForward(unsigned __int32 *p, unsigned __int32 b);
Sets *p to the bit index of the least significant set bit of b or leaves it unchanged if b is zero. The function returns a non-zero result when b is non-zero and returns zero when b is zero.

So this behavior is well defined (according to the documents).

However when I compile the following test code (using icc 18.0, option -O2 for brevity):

unsigned __int32 trailing_zeros(unsigned __int32 x)
{
	unsigned __int32 index = 32;
	_BitScanForward(&index, x);
	return index;
}

It's compiled into:

trailing_zeros(unsigned int):
  bsf eax, edi
  ret

Where is the initial value 32? If x==0, then eax is undefined, but it's returned directly!

Jin__Wz · ‎07-26-2018

I'm also very curious why _BitScanForward is documented like this.

As far as I know, the BSF instruction is still documented as "return undefined value when source is 0", so why made _BitScanForward different?

Bugs in Intrinsics Guide

...

Description

Operation