Arthur, - Страница 6

andysem · ‎01-30-2013

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

andysem · ‎07-20-2015

Not all k-instructions have C/C++ equivalents, like kandn. You could try to emulate it with ~ and & operators and hope the compiler is smart enough to recognize the pattern, but I would prefer to have an intrinsic for that.

8/32/64-bit k-registers are essential for AVX-512BW/DQ, they should be properly supported in compilers, IMHO.

Arthur_A_Intel · ‎07-24-2015

The _mm_fmaddsub_ps and _mm_fmsubadd_ps functions have the same content in the operation section yet different descriptions. Is that what it is supposed to be?

Thanks.

Patrick_K_Intel · ‎07-27-2015

Arthur,

You're correct, it looks like the fmsubadd operations were incorrect. I have corrected them, the update should appear shortly.

Carlos_C_2 · ‎08-20-2015

When are you planning to add instruction latencies for Broadwell? Thanks!

Peter_Cordes · ‎08-28-2015

The intrinsics guide description of _mm_testc_si128 says it NOTs the 2nd operand, but it's wrong. Assuming the intrinsic is supposed to take its args in the same order as the PTEST instruction, then the docs are wrong and the actual behaviour is correct.

This was discovered during discussion at http://stackoverflow.com/questions/32072169/could-i-compare-to-zero-register-in-avx-correctly/32073056?noredirect=1#comment52132842_32073056

The Intel insn ref manual says:

(V)PTEST (128-bit version)

	IF (SRC[127:0] BITWISE AND DEST[127:0] = 0)
	    THEN ZF  1;
	    ELSE ZF  0;
	IF (SRC[127:0] BITWISE AND NOT DEST[127:0] = 0)
	    THEN CF  1;
	    ELSE CF  0;

Where DEST is the first argument. It is correct.

The intrinsics guide description for int _mm_testc_si128 (__m128i a, __m128i b) says:

    IF (a[127:0] AND NOT b[127:0] == 0)
      CF := 1
    ELSE
        CF := 0
    FI

This is the reverse of the instruction generated from the intrinsic. I guess the doc writers got mixed up by the ref manual putting NOT DEST second. The other ptest intrinsics (like _mm_test_mix_ones_zeros) have the same bug in their description

andysem · ‎08-30-2015

In the description of the _mm_multishift_epi64_epi8 intrinsic there is this line:

tmp8 := b[q+((ctrl+k) & 63)]

However, k is not mentioned anywhere else. I believe, it should be l.

andysem · ‎09-13-2015

The _mm_test_all_ones intrinsic description is not accurate. It says it performs a complement (i.e. XOR) of the xmm and 0xFFFFFFFF (which is a 32-bit constant, presumably). The actual behavior is different and amounts to returning 1 if the provided xmm contains 1 in all bits and 0 otherwise.

CFR · ‎09-26-2015

I think the following (minor) errors exist in the guide (3.3.11):

1) _mm512_sad_epu8: says it produces "four unsigned 16-bit integers". I think that should say "eight unsigned 16-bit integers".

2) _mm256_mpsadbw_epu8: i := imm8[2]*32 should be a_offset:=imm8[2]*32

CFR · ‎09-26-2015

Let me say up front that I can't imagine doing low level code without the Intrinsic Guide, but there are a few enhancements I'd like to see (not in any particular order):

1) more complete information on latency and throughput

2) some sort of separate file containing the intrinsic, instruction family (AVX, SSE4.1, AVX-512VBMI, etc...) and the latency and throughput. If I had such information I could more easily take my source code and annotate it with the information for each intrinsic I use; thus helping me keep track of architecture dependencies and performance.

3) a version of the guide that I could download and use locally when I don't have good network connection.

4) some sort of relational search or maybe regex search (i.e. __m512i and add, ^__m512i[.]*_add_). Mostly I just want to be able to narrow down the results. Examples are when I only care about a particular size operation (__m256 adds) or searching on information that exists in the intrinsic name text in the description (__mm512i permutes that work across lanes using 32-bit integers).

Just some thoughts.

Patrick_K_Intel · ‎09-28-2015

Thanks for reporting these issues, I've resolved several issues, the new update should be posted shortly.

I'll also get started a larger update to add some additional features that have been requested, both publicly and internally.

andysem · ‎09-29-2015

Patrick, the new description of _mm_test_all_ones is still incorrect. The pseudocode contains:

IF (a[127:0] AND NOT 0xFFFFFFFF == 0)

First, instead of 0xFFFFFFFF, which is a 32-bit constant, tmp should be used. Second, this condition will always return true regardless of the value of a. The correct condition should be:

IF (tmp[127:0] AND NOT a[127:0] == 0)

Andre_M_1 · ‎10-23-2015

Hi, intrinsic functions for vmovsd are listed under AVX-512, although it is actually an AVX instruction. Best, André

Jingwei_Z_Intel · ‎10-28-2015

The description part of _mm512_fmadd233_ps on KNC seems to be wrong. It does not match the description in EAS

EUGENE_K_Intel · ‎10-29-2015

Descriptions for _mm_test_all_zeros, _mm_test_mix_ones_zeros, and all testc intrinsics are still bad (they are not technically wrong, but they are badly misleading.)

As far as I can tell, they all assume that "a AND NOT b" means "(~a) & b", because that's the way e.g. _mm_andnot_si128 works. But that's not a natural reading of the phrase and that's not how 99% of people would understand it. Probably best to spell this out explicitly in each case.

This also relates to andysem's comment above. "a AND NOT 0xFFFFFFFF" always evaluates to zero under natural interpretation of "AND NOT". Instead _mm_test_all_ones computes "IF ((NOT a) AND 0xFFFFFFFF == 0)" (or, in other words, simply IF ((NOT a) == 0) ).

Patrick_K_Intel · ‎12-02-2015

Thanks all for the feedback. I've submitted an update to resolve these issues.

Vaclav_H_ · ‎01-18-2016

These intrinsics repeat two times:

      _mm_loadu_si16
      _mm_loadu_si32
      _mm_loadu_si64
      _mm_storeu_si16
      _mm_storeu_si32
      _mm_storeu_si64

One copy is missing CPUID flags and some details differ (e.g. machine instruction for _mm_*_si16). Maybe the intention was to have two versions depending on CPUID flags, as is the case with for example _mm_prefetch or _mm512_cmplt_epi32_mask.

Best Regards,

Vaclav

P.S. By the way - big thanks for this guide!! It is far better than anything else I've seen so far.

Patrick_K_Intel · ‎01-19-2016

Yes, the intention is these intrinsics can work on SSE-supporting systems using SSE instructions, but they will also work on non-SSE-supporting systems, it's up to the compiler how they will be interpreted and what instructions will be emitted.

andysem · ‎01-20-2016

All these intrinsics involve movd or movq to move the data to an xmm register. SSE2 is required for that. I guess, you could also use movss and reduce the requirement to SSE, but still the requirement is there. How can these intrinsics be implemented without SSE when their purpose is to initialize an xmm register?

Anyway, I think duplicating intrinsics is not the correct choice.

SkyLake · ‎02-02-2016

in this intrinsic :

__m128i _mm_mpsadbw_epu8 (__m128i a, __m128i b, const int imm8)

CPUID Flags: SSE4.1

.

In this section

tmp[i+15:i] := ABS(a[k+7:k] - b[l+7:l]) + ABS(a[k+15:k+8] - b[l+15:l+8]) + ABS(a[k+23:k+16] - b[l+23:l+16]) + ABS(a[k+31:k+24] - b[l+31:l+24])

...

I think it should be tmp[i*2+15:i*2], Am I wrong?

Hans_P_Intel · ‎02-08-2016

I found the issue behind comment #88 (reported 01/20/2015) is still present in the Intrinsics Guide 3.3.14 (1/12/2016).

andysem · ‎02-21-2016

For each F16C intrinsic, the timing info contains duplicated entries for different CPU architectures - with and without throughput.

Bugs in Intrinsics Guide