Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Bugs in Intrinsics Guide

andysem
New Contributor III
24,750 Views

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

0 Kudos
220 Replies
Peter_Cordes
Beginner
2,063 Views

 

Version 3.3.14 (currently live on the site):

The vpermi2w / vpermt2w / vpermw intrinsics are categorized as "misc", not "swizzle".  The other element-sizes of permi/t2 and vpermb/d/q are correctly categorized as shuffles.

e.g.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512&text=permutex2var_epi16&expand=3918

0 Kudos
Sven_C_
Beginner
2,063 Views

HI,

First of all I would like to thank you for this great tool. I often use it in my HPC class at university because it can help my students to understand what is going on.

But I am curious, are there any efforts going on to add latencies and throughputs for new processor generations like broadwell or skylake?

I'm asking because I have the impression that the latencies for VSQRTPD and VDIVPD have dramatically changed in the past and I would really like to know what their current values are in modern hardware.

0 Kudos
McCalpinJohn
Honored Contributor III
2,063 Views

The latencies and throughputs for most instructions are included in Appendix C of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966, currently available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html).

Using this data, I recently posted some graphs of the relative throughput of scalar, 128-bit, and 256-bit VDIVPS and VDIVPD instructions for Core2, Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake (client) at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/623366#comment-1866703  ;

0 Kudos
Sven_C_
Beginner
2,063 Views

Thanks yoir for the links, John!

This definitely emphasizes my suspicion that Intel has really tuned their Instructions division and squareroot computation in the past.

0 Kudos
Peter_B_2
Beginner
2,063 Views

 

I just discovered this great tool!

I have two feature requests:

1. List the category (used by the filter) in the detailed description of each item.  "swizzle" vs "convert" vs "miscellaneous" can be tricky.  If these were discoverable (other than by trying all of the checkboxes), then users could limit results to "ones like this result"

2. Add additional filters for integer vs. floating point.  Even better would be filter on various characteristics of input and output: width of packed value, signed/unsigned, etc.

 

0 Kudos
Gert-Jan
Beginner
2,063 Views

There is a typo in the __m128i _mm_madd_epi16 and __m256i _mm256_madd_epi16 intrinsics operation description.

st[i+31:i] should be dst[i+31:i] of course

0 Kudos
James_C_Intel2
Employee
2,063 Views

This description talks about a “dst” operand which isn’t in the formal argument list, so something is wrong somewhere…

 

__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)

Synopsis

__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)
#include "immintrin.h"
CPUID Flags: AVX512F

Description

Multiplies elements in packed 64-bit integer vectors a and b together, storing the lower 64 bits of the result in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

Operation

FOR j := 0 to 7

      i := j*64

      IF k

            dst[i+63:i] := a[i+63:i] * b[i+63:i]

      ELSE

            dst[i+63:i] := src[i+63:i]

      FI

ENDFOR

dst[MAX:512] := 0

 

0 Kudos
Hugh_D_
Beginner
2,063 Views

Hi,
I think I have found some "bugs" in the current online version (3.3.14) of the guide :

  • __m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
  • __m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale) :
    • Instruction: vgatherqps xmm, vm32x, xmm
      • vm32x should be vm64x
    • dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+63:i])*scale]
      • vindex[i+63:i] should be vindex[m+63:m]
  • __m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
  • __m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
  • __m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
    • Instruction: vgatherdpd xmm, vm64x, xmm
      • vm64x should be vm32x

Anyway, many thanks for this useful tool.

0 Kudos
Catree_C_
Beginner
2,063 Views

I think there is an error for _mm256_shuffle_epi8 intrinsic instruction. Currently it is:

dst[128+i+7:i] := a[128+index*8+7:128+index*8]

but I think it should be:

dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]

For _mm512_shuffle_epi8 intrinsic instruction, I am not sure to understand correctly the pseudo code:

FOR j := 0 to 63
    i := j*8
    IF b[i+7] == 1
        dst[i+7:i] := 0
    ELSE
        index[3:0] := b[i+3:i]
        dst[i+7:i] := a[index*8+7:index*8]
    FI
ENDFOR
dst[MAX:512] := 0

It seems like only the first 128 bits of a can be shuffled?

0 Kudos
Serge_M_
Beginner
2,063 Views

First of all - thanks so much for this guide, I have found it to be invaluable!

I think I found a small error in version 3.3.14 for _mm_sqrt_sd. The guide claims that:

__m128d _mm_sqrt_sd (__m128d a, __m128d b)

computes the sqrt of the lower double from a and copies the lower double from b to the upper double of the result. However, it actually seems to be the opposite (the lower double from a is copied, and the sqrt of the lower double from b is computed). I am using clang on OSX. I don't have access to Windows or ICC, but for what it's worth, the MSN documentation at https://msdn.microsoft.com/en-us/library/1994h1ay(v=vs.90).aspx seems to agree with me.

Cheers,

Serge

0 Kudos
Islam_A_
Beginner
2,063 Views

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

0 Kudos
Islam_A_
Beginner
2,063 Views

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

0 Kudos
Harry_V_Intel
Employee
2,063 Views

Hi,

Description of _mm256_extractf128_si256 states  (composed of integer data), which seems confusing given the F for float?  Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?

-Harry

0 Kudos
andysem
New Contributor III
2,063 Views

Harry V. (Intel) wrote:

Description of _mm256_extractf128_si256 states  (composed of integer data), which seems confusing given the F for float?  Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?

There are two instructions: vextractf128 and vextracti128. The former is part of AVX and is generated by _mm256_extractf128_* and the latter is only added in AVX2 and is generated by _mm256_extracti128_si256. The effect of both instructions is the same and _mm256_extractf128_si256 is a convenient wrapper to allow interaction between __m256i and __m128i even on systems lacking AVX2.

 

0 Kudos
andysem
New Contributor III
2,105 Views

By the way, are there any updates planned to the Intrinsics Guide? There were a number of bug reports and performance info for Skylake is still missing.

 

0 Kudos
Patrick_K_Intel
Employee
2,105 Views

Thanks for the feedback. I've posted an update that addresses all the reported issues. This does not include performance info for Skylake, although I may add that in the future.

0 Kudos
andysem
New Contributor III
2,105 Views

Each of the _mm_storeu_si16/si32/si64 intrinsics are listed twice, some of them having slightly different instructions.

0 Kudos
Patrick_K_Intel
Employee
2,105 Views

I have posted an update that includes updated latency/throughput. This removes data from pre-Sandybridge, and adds Broadwell, Skylake, and Knights Landing.

0 Kudos
andysem
New Contributor III
2,105 Views

Thank you Patrick, although I think the removal of Sandy Bridge and Nehalem is a bit premature. Those CPUs are still relevant.

0 Kudos
Jakob__Wenzel
Beginner
2,105 Views

I believe that the "_MM_CMPINT_NEQ" constant listed in various integer comparison operations should read _MM_CMPINT_NE. (At least this is what GCC, Clang, etc. implement)

0 Kudos
Steve_W_
Beginner
2,105 Views

The guide has a significant mislabelling of throughput in all intrinsics which list them. Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput.    This is consistently misreported throughout the guide

For example, the guide reports Skylake having a lower throughput for pmulhuw than Haswell or Broadwell. It's the opposite, Skylake's thoughput is higher than the older architectures.  This mislabelling is repeated for about 100 other intrinsics.

Reporting reciprocal throughput is a good idea, since those values can be more easily compared to latency clocks.  But the labels in the whole guide must be updated to state "reciprocal throughput."   I was even reorganizing my AVX code to minimize calls to these certain apparently lower-throughput  changes to x86 vector math! 

Luckily I realized the mismatch with Agner Fox's independent tables.

0 Kudos
Reply