I believe that the "_MM - Page 7

andysem · ‎01-30-2013

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

Peter_Cordes · ‎03-13-2016

Version 3.3.14 (currently live on the site):

The vpermi2w / vpermt2w / vpermw intrinsics are categorized as "misc", not "swizzle". The other element-sizes of permi/t2 and vpermb/d/q are correctly categorized as shuffles.

e.g.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512&text=permutex2var_epi16&expand=3918

Sven_C_ · ‎04-21-2016

HI,

First of all I would like to thank you for this great tool. I often use it in my HPC class at university because it can help my students to understand what is going on.

But I am curious, are there any efforts going on to add latencies and throughputs for new processor generations like broadwell or skylake?

I'm asking because I have the impression that the latencies for VSQRTPD and VDIVPD have dramatically changed in the past and I would really like to know what their current values are in modern hardware.

McCalpinJohn · ‎04-21-2016

The latencies and throughputs for most instructions are included in Appendix C of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966, currently available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html).

Using this data, I recently posted some graphs of the relative throughput of scalar, 128-bit, and 256-bit VDIVPS and VDIVPD instructions for Core2, Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake (client) at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/623366#comment-1866703 ;

Sven_C_ · ‎04-22-2016

Thanks yoir for the links, John!

This definitely emphasizes my suspicion that Intel has really tuned their Instructions division and squareroot computation in the past.

Peter_B_2 · ‎05-02-2016

I just discovered this great tool!

I have two feature requests:

1. List the category (used by the filter) in the detailed description of each item. "swizzle" vs "convert" vs "miscellaneous" can be tricky. If these were discoverable (other than by trying all of the checkboxes), then users could limit results to "ones like this result"

2. Add additional filters for integer vs. floating point. Even better would be filter on various characteristics of input and output: width of packed value, signed/unsigned, etc.

Gert-Jan · ‎05-09-2016

There is a typo in the __m128i _mm_madd_epi16 and __m256i _mm256_madd_epi16 intrinsics operation description.

st[i+31:i] should be dst[i+31:i] of course

James_C_Intel2 · ‎05-19-2016

This description talks about a “dst” operand which isn’t in the formal argument list, so something is wrong somewhere…

__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)

Synopsis

__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)
#include "immintrin.h"
CPUID Flags: AVX512F

Description

Multiplies elements in packed 64-bit integer vectors a and b together, storing the lower 64 bits of the result in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

Operation

FOR j := 0 to 7

i := j*64

IF k

dst[i+63:i] := a[i+63:i] * b[i+63:i]

ELSE

dst[i+63:i] := src[i+63:i]

FI

ENDFOR

dst[MAX:512] := 0

Hugh_D_ · ‎06-06-2016

Hi,
I think I have found some "bugs" in the current online version (3.3.14) of the guide :

__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale) :
- Instruction: vgatherqps xmm, vm32x, xmm
  - vm32x should be vm64x
- dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+63:i])*scale]
  - vindex[i+63:i] should be vindex[m+63:m]
__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
__m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
- Instruction: vgatherdpd xmm, vm64x, xmm
  - vm64x should be vm32x

Anyway, many thanks for this useful tool.

Catree_C_ · ‎07-30-2016

I think there is an error for _mm256_shuffle_epi8 intrinsic instruction. Currently it is:

dst[128+i+7:i] := a[128+index*8+7:128+index*8]

but I think it should be:

dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]

For _mm512_shuffle_epi8 intrinsic instruction, I am not sure to understand correctly the pseudo code:

FOR j := 0 to 63
   i := j*8
   IF b[i+7] == 1
       dst[i+7:i] := 0
   ELSE
       index[3:0] := b[i+3:i]
       dst[i+7:i] := a[index*8+7:index*8]
   FI
ENDFOR
dst[MAX:512] := 0

It seems like only the first 128 bits of a can be shuffled?

Serge_M_ · ‎08-02-2016

First of all - thanks so much for this guide, I have found it to be invaluable!

I think I found a small error in version 3.3.14 for _mm_sqrt_sd. The guide claims that:

__m128d _mm_sqrt_sd (__m128d a, __m128d b)

computes the sqrt of the lower double from a and copies the lower double from b to the upper double of the result. However, it actually seems to be the opposite (the lower double from a is copied, and the sqrt of the lower double from b is computed). I am using clang on OSX. I don't have access to Windows or ICC, but for what it's worth, the MSN documentation at https://msdn.microsoft.com/en-us/library/1994h1ay(v=vs.90).aspx seems to agree with me.

Cheers,

Serge

Islam_A_ · ‎08-09-2016

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

Islam_A_ · ‎08-09-2016

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

Harry_V_Intel · ‎09-13-2016

Hi,

Description of _mm256_extractf128_si256 states (composed of integer data), which seems confusing given the F for float? Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?

-Harry

andysem · ‎09-13-2016

Harry V. (Intel) wrote:

Description of _mm256_extractf128_si256 states (composed of integer data), which seems confusing given the F for float? Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?

There are two instructions: vextractf128 and vextracti128. The former is part of AVX and is generated by _mm256_extractf128_* and the latter is only added in AVX2 and is generated by _mm256_extracti128_si256. The effect of both instructions is the same and _mm256_extractf128_si256 is a convenient wrapper to allow interaction between __m256i and __m128i even on systems lacking AVX2.

andysem · ‎09-13-2016

By the way, are there any updates planned to the Intrinsics Guide? There were a number of bug reports and performance info for Skylake is still missing.

Patrick_K_Intel · ‎09-27-2016

Thanks for the feedback. I've posted an update that addresses all the reported issues. This does not include performance info for Skylake, although I may add that in the future.

andysem · ‎12-13-2016

Each of the _mm_storeu_si16/si32/si64 intrinsics are listed twice, some of them having slightly different instructions.

Patrick_K_Intel · ‎01-26-2017

I have posted an update that includes updated latency/throughput. This removes data from pre-Sandybridge, and adds Broadwell, Skylake, and Knights Landing.

andysem · ‎01-27-2017

Thank you Patrick, although I think the removal of Sandy Bridge and Nehalem is a bit premature. Those CPUs are still relevant.

Jakob__Wenzel · ‎02-13-2017

I believe that the "_MM_CMPINT_NEQ" constant listed in various integer comparison operations should read _MM_CMPINT_NE. (At least this is what GCC, Clang, etc. implement)

Steve_W_ · ‎02-25-2017

The guide has a significant mislabelling of throughput in all intrinsics which list them. Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput. This is consistently misreported throughout the guide

For example, the guide reports Skylake having a lower throughput for pmulhuw than Haswell or Broadwell. It's the opposite, Skylake's thoughput is higher than the older architectures. This mislabelling is repeated for about 100 other intrinsics.

Reporting reciprocal throughput is a good idea, since those values can be more easily compared to latency clocks. But the labels in the whole guide must be updated to state "reciprocal throughput." I was even reorganizing my AVX code to minimize calls to these certain apparently lower-throughput changes to x86 vector math!

Luckily I realized the mismatch with Agner Fox's independent tables.

Bugs in Intrinsics Guide