 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Hi,
I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):
1. When the window is maximized, the search field is stretched vertically while still being a oneline edit box. It sould probably be sized accordingly.
2. __m256 _mm256_undefined_si256 () should return __m256i.
3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.
4. _mm_alignr_epi8 has two descriptions.
5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of singleprecision floats. Shouldn't it be doubleprecision?
I didn't read all instructions so there may be more issues. I'll post if I find anything else.
PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.
Link Copied
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Version 3.3.14 (currently live on the site):
The vpermi2w / vpermt2w / vpermw intrinsics are categorized as "misc", not "swizzle". The other elementsizes of permi/t2 and vpermb/d/q are correctly categorized as shuffles.
e.g.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
HI,
First of all I would like to thank you for this great tool. I often use it in my HPC class at university because it can help my students to understand what is going on.
But I am curious, are there any efforts going on to add latencies and throughputs for new processor generations like broadwell or skylake?
I'm asking because I have the impression that the latencies for VSQRTPD and VDIVPD have dramatically changed in the past and I would really like to know what their current values are in modern hardware.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
The latencies and throughputs for most instructions are included in Appendix C of the "Intel 64 and IA32 Architectures Optimization Reference Manual" (document 248966, currently available at http://www.intel.com/content/www/us/en/architectureandtechnology/64ia32architecturesoptimizati....
Using this data, I recently posted some graphs of the relative throughput of scalar, 128bit, and 256bit VDIVPS and VDIVPD instructions for Core2, Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake (client) at https://software.intel.com/enus/forums/intelisaextensions/topic/623366#comment1866703 ;
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Thanks yoir for the links, John!
This definitely emphasizes my suspicion that Intel has really tuned their Instructions division and squareroot computation in the past.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
I just discovered this great tool!
I have two feature requests:
1. List the category (used by the filter) in the detailed description of each item. "swizzle" vs "convert" vs "miscellaneous" can be tricky. If these were discoverable (other than by trying all of the checkboxes), then users could limit results to "ones like this result"
2. Add additional filters for integer vs. floating point. Even better would be filter on various characteristics of input and output: width of packed value, signed/unsigned, etc.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
There is a typo in the __m128i _mm_madd_epi16 and __m256i _mm256_madd_epi16 intrinsics operation description.
st[i+31:i] should be dst[i+31:i] of course
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
This description talks about a “dst” operand which isn’t in the formal argument list, so something is wrong somewhere…
__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)
Synopsis
__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)
#include "immintrin.h"
CPUID Flags: AVX512F
Description
Multiplies elements in packed 64bit integer vectors a and b together, storing the lower 64 bits of the result in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).
Operation
FOR j := 0 to 7
i := j*64
IF k
dst[i+63:i] := a[i+63:i] * b[i+63:i]
ELSE
dst[i+63:i] := src[i+63:i]
FI
ENDFOR
dst[MAX:512] := 0
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Hi,
I think I have found some "bugs" in the current online version (3.3.14) of the guide :

__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)

__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale) :

Instruction: vgatherqps xmm, vm32x, xmm

vm32x should be vm64x


dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+63:i])*scale]

vindex[i+63:i] should be vindex[m+63:m]



__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
 __m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
 __m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Instruction: vgatherdpd xmm, vm64x, xmm

vm64x should be vm32x


Anyway, many thanks for this useful tool.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
I think there is an error for _mm256_shuffle_epi8 intrinsic instruction. Currently it is:
dst[128+i+7:i] := a[128+index*8+7:128+index*8]
but I think it should be:
dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]
For _mm512_shuffle_epi8 intrinsic instruction, I am not sure to understand correctly the pseudo code:
FOR j := 0 to 63
i := j*8
IF b[i+7] == 1
dst[i+7:i] := 0
ELSE
index[3:0] := b[i+3:i]
dst[i+7:i] := a[index*8+7:index*8]
FI
ENDFOR
dst[MAX:512] := 0
It seems like only the first 128 bits of a can be shuffled?
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
First of all  thanks so much for this guide, I have found it to be invaluable!
I think I found a small error in version 3.3.14 for _mm_sqrt_sd. The guide claims that:
__m128d _mm_sqrt_sd (__m128d a, __m128d b)
computes the sqrt of the lower double from a and copies the lower double from b to the upper double of the result. However, it actually seems to be the opposite (the lower double from a is copied, and the sqrt of the lower double from b is computed). I am using clang on OSX. I don't have access to Windows or ICC, but for what it's worth, the MSN documentation at https://msdn.microsoft.com/enus/library/1994h1ay(v=vs.90).aspx seems to agree with me.
Cheers,
Serge
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Thanks for the feedback, most of this will be addressed in the next release.
1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?
2. This will be resolved in the next release.
3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.
4. This will be resolved in the next release.
5. This will be resolved in the next release.
I have not added any additional latency and throughput data yet, but I may get to this soon.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Thanks for the feedback, most of this will be addressed in the next release.
1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?
2. This will be resolved in the next release.
3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.
4. This will be resolved in the next release.
5. This will be resolved in the next release.
I have not added any additional latency and throughput data yet, but I may get to this soon.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Hi,
Description of _mm256_extractf128_si256 states (composed of integer data), which seems confusing given the F for float? Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?
Harry
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Harry V. (Intel) wrote:
Description of _mm256_extractf128_si256 states (composed of integer data), which seems confusing given the F for float? Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?
There are two instructions: vextractf128 and vextracti128. The former is part of AVX and is generated by _mm256_extractf128_* and the latter is only added in AVX2 and is generated by _mm256_extracti128_si256. The effect of both instructions is the same and _mm256_extractf128_si256 is a convenient wrapper to allow interaction between __m256i and __m128i even on systems lacking AVX2.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
By the way, are there any updates planned to the Intrinsics Guide? There were a number of bug reports and performance info for Skylake is still missing.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Thanks for the feedback. I've posted an update that addresses all the reported issues. This does not include performance info for Skylake, although I may add that in the future.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Each of the _mm_storeu_si16/si32/si64 intrinsics are listed twice, some of them having slightly different instructions.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
I have posted an update that includes updated latency/throughput. This removes data from preSandybridge, and adds Broadwell, Skylake, and Knights Landing.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
Thank you Patrick, although I think the removal of Sandy Bridge and Nehalem is a bit premature. Those CPUs are still relevant.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
I believe that the "_MM_CMPINT_NEQ" constant listed in various integer comparison operations should read _MM_CMPINT_NE. (At least this is what GCC, Clang, etc. implement)
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Email to a Friend
 Report Inappropriate Content
The guide has a significant mislabelling of throughput in all intrinsics which list them. Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput. This is consistently misreported throughout the guide
For example, the guide reports Skylake having a lower throughput for pmulhuw than Haswell or Broadwell. It's the opposite, Skylake's thoughput is higher than the older architectures. This mislabelling is repeated for about 100 other intrinsics.
Reporting reciprocal throughput is a good idea, since those values can be more easily compared to latency clocks. But the labels in the whole guide must be updated to state "reciprocal throughput." I was even reorganizing my AVX code to minimize calls to these certain apparently lowerthroughput changes to x86 vector math!
Luckily I realized the mismatch with Agner Fox's independent tables.
 Subscribe to RSS Feed
 Mark Topic as New
 Mark Topic as Read
 Float this Topic for Current User
 Bookmark
 Subscribe
 Printer Friendly Page