Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
New Contributor III
1,248 Views

Bugs in Intrinsics Guide

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

0 Kudos
216 Replies
Highlighted
Beginner
108 Views

These intrinsics repeat two times:

      _mm_loadu_si16
      _mm_loadu_si32
      _mm_loadu_si64
      _mm_storeu_si16
      _mm_storeu_si32
      _mm_storeu_si64

One copy is missing CPUID flags and some details differ (e.g. machine instruction for _mm_*_si16). Maybe the intention was to have two versions depending on CPUID flags, as is the case with for example _mm_prefetch or _mm512_cmplt_epi32_mask.

Best Regards,

Vaclav

P.S. By the way - big thanks for this guide!! It is far better than anything else I've seen so far.

 

0 Kudos
Highlighted
108 Views

Yes, the intention is these intrinsics can work on SSE-supporting systems using SSE instructions, but they will also work on non-SSE-supporting systems, it's up to the compiler how they will be interpreted and what instructions will be emitted.

0 Kudos
Highlighted
New Contributor III
108 Views

All these intrinsics involve movd or movq to move the data to an xmm register. SSE2 is required for that. I guess, you could also use movss and reduce the requirement to SSE, but still the requirement is there. How can these intrinsics be implemented without SSE when their purpose is to initialize an xmm register?

Anyway, I think duplicating intrinsics is not the correct choice.

0 Kudos
Highlighted
Beginner
108 Views

in this intrinsic :
__m128i _mm_mpsadbw_epu8 (__m128i a, __m128i b, const int imm8)

CPUID Flags: SSE4.1
 
.
.
.
In this section
tmp[i+15:i] := ABS(a[k+7:k] - b[l+7:l]) + ABS(a[k+15:k+8] - b[l+15:l+8]) + ABS(a[k+23:k+16] - b[l+23:l+16]) + ABS(a[k+31:k+24] - b[l+31:l+24])
...
I think it should be tmp[i*2+15:i*2], Am I wrong?
0 Kudos
Highlighted
Employee
108 Views

I found the issue behind comment #88 (reported 01/20/2015) is still present in the Intrinsics Guide 3.3.14 (1/12/2016).

0 Kudos
Highlighted
New Contributor III
108 Views

For each F16C intrinsic, the timing info contains duplicated entries for different CPU architectures - with and without throughput.

0 Kudos
Highlighted
Beginner
108 Views

 

Version 3.3.14 (currently live on the site):

The vpermi2w / vpermt2w / vpermw intrinsics are categorized as "misc", not "swizzle".  The other element-sizes of permi/t2 and vpermb/d/q are correctly categorized as shuffles.

e.g.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512&text=permutex2var_epi16&...

0 Kudos
Highlighted
Beginner
108 Views

HI,

First of all I would like to thank you for this great tool. I often use it in my HPC class at university because it can help my students to understand what is going on.

But I am curious, are there any efforts going on to add latencies and throughputs for new processor generations like broadwell or skylake?

I'm asking because I have the impression that the latencies for VSQRTPD and VDIVPD have dramatically changed in the past and I would really like to know what their current values are in modern hardware.

0 Kudos
Highlighted
Black Belt
108 Views

The latencies and throughputs for most instructions are included in Appendix C of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966, currently available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimizati....

Using this data, I recently posted some graphs of the relative throughput of scalar, 128-bit, and 256-bit VDIVPS and VDIVPD instructions for Core2, Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake (client) at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/623366#comment-1866703  ;

0 Kudos
Highlighted
Beginner
108 Views

Thanks yoir for the links, John!

This definitely emphasizes my suspicion that Intel has really tuned their Instructions division and squareroot computation in the past.

0 Kudos
Highlighted
Beginner
108 Views

 

I just discovered this great tool!

I have two feature requests:

1. List the category (used by the filter) in the detailed description of each item.  "swizzle" vs "convert" vs "miscellaneous" can be tricky.  If these were discoverable (other than by trying all of the checkboxes), then users could limit results to "ones like this result"

2. Add additional filters for integer vs. floating point.  Even better would be filter on various characteristics of input and output: width of packed value, signed/unsigned, etc.

 

0 Kudos
Highlighted
Beginner
108 Views

There is a typo in the __m128i _mm_madd_epi16 and __m256i _mm256_madd_epi16 intrinsics operation description.

st[i+31:i] should be dst[i+31:i] of course

0 Kudos
Highlighted
Employee
108 Views

This description talks about a “dst” operand which isn’t in the formal argument list, so something is wrong somewhere…

 

__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)

Synopsis

__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)
#include "immintrin.h"
CPUID Flags: AVX512F

Description

Multiplies elements in packed 64-bit integer vectors a and b together, storing the lower 64 bits of the result in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

Operation

FOR j := 0 to 7

      i := j*64

      IF k

            dst[i+63:i] := a[i+63:i] * b[i+63:i]

      ELSE

            dst[i+63:i] := src[i+63:i]

      FI

ENDFOR

dst[MAX:512] := 0

 

0 Kudos
Highlighted
Beginner
108 Views

Hi,
I think I have found some "bugs" in the current online version (3.3.14) of the guide :

  • __m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
  • __m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale) :
    • Instruction: vgatherqps xmm, vm32x, xmm
      • vm32x should be vm64x
    • dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+63:i])*scale]
      • vindex[i+63:i] should be vindex[m+63:m]
  • __m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
  • __m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
  • __m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
    • Instruction: vgatherdpd xmm, vm64x, xmm
      • vm64x should be vm32x

Anyway, many thanks for this useful tool.

0 Kudos
Highlighted
Beginner
108 Views

I think there is an error for _mm256_shuffle_epi8 intrinsic instruction. Currently it is:

dst[128+i+7:i] := a[128+index*8+7:128+index*8]

but I think it should be:

dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]

For _mm512_shuffle_epi8 intrinsic instruction, I am not sure to understand correctly the pseudo code:

FOR j := 0 to 63
    i := j*8
    IF b[i+7] == 1
        dst[i+7:i] := 0
    ELSE
        index[3:0] := b[i+3:i]
        dst[i+7:i] := a[index*8+7:index*8]
    FI
ENDFOR
dst[MAX:512] := 0

It seems like only the first 128 bits of a can be shuffled?

0 Kudos
Highlighted
Beginner
108 Views

First of all - thanks so much for this guide, I have found it to be invaluable!

I think I found a small error in version 3.3.14 for _mm_sqrt_sd. The guide claims that:

__m128d _mm_sqrt_sd (__m128d a, __m128d b)

computes the sqrt of the lower double from a and copies the lower double from b to the upper double of the result. However, it actually seems to be the opposite (the lower double from a is copied, and the sqrt of the lower double from b is computed). I am using clang on OSX. I don't have access to Windows or ICC, but for what it's worth, the MSN documentation at https://msdn.microsoft.com/en-us/library/1994h1ay(v=vs.90).aspx seems to agree with me.

Cheers,

Serge

0 Kudos
Highlighted
Beginner
108 Views

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

0 Kudos
Highlighted
Beginner
108 Views

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

0 Kudos
Highlighted
Employee
108 Views

Hi,

Description of _mm256_extractf128_si256 states  (composed of integer data), which seems confusing given the F for float?  Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?

-Harry

0 Kudos
Highlighted
New Contributor III
108 Views

Harry V. (Intel) wrote:

Description of _mm256_extractf128_si256 states  (composed of integer data), which seems confusing given the F for float?  Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?

There are two instructions: vextractf128 and vextracti128. The former is part of AVX and is generated by _mm256_extractf128_* and the latter is only added in AVX2 and is generated by _mm256_extracti128_si256. The effect of both instructions is the same and _mm256_extractf128_si256 is a convenient wrapper to allow interaction between __m256i and __m128i even on systems lacking AVX2.

 

0 Kudos