- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):
1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.
2. __m256 _mm256_undefined_si256 () should return __m256i.
3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.
4. _mm_alignr_epi8 has two descriptions.
5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?
I didn't read all instructions so there may be more issues. I'll post if I find anything else.
PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.
링크가 복사됨
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Version 3.3.14 (currently live on the site):
The vpermi2w / vpermt2w / vpermw intrinsics are categorized as "misc", not "swizzle". The other element-sizes of permi/t2 and vpermb/d/q are correctly categorized as shuffles.
e.g.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
HI,
First of all I would like to thank you for this great tool. I often use it in my HPC class at university because it can help my students to understand what is going on.
But I am curious, are there any efforts going on to add latencies and throughputs for new processor generations like broadwell or skylake?
I'm asking because I have the impression that the latencies for VSQRTPD and VDIVPD have dramatically changed in the past and I would really like to know what their current values are in modern hardware.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
The latencies and throughputs for most instructions are included in Appendix C of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966, currently available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html).
Using this data, I recently posted some graphs of the relative throughput of scalar, 128-bit, and 256-bit VDIVPS and VDIVPD instructions for Core2, Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake (client) at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/623366#comment-1866703 ;
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
I just discovered this great tool!
I have two feature requests:
1. List the category (used by the filter) in the detailed description of each item. "swizzle" vs "convert" vs "miscellaneous" can be tricky. If these were discoverable (other than by trying all of the checkboxes), then users could limit results to "ones like this result"
2. Add additional filters for integer vs. floating point. Even better would be filter on various characteristics of input and output: width of packed value, signed/unsigned, etc.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
This description talks about a “dst” operand which isn’t in the formal argument list, so something is wrong somewhere…
__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)
Synopsis
__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)
#include "immintrin.h"
CPUID Flags: AVX512F
Description
Multiplies elements in packed 64-bit integer vectors a and b together, storing the lower 64 bits of the result in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).
Operation
FOR j := 0 to 7
i := j*64
IF k
dst[i+63:i] := a[i+63:i] * b[i+63:i]
ELSE
dst[i+63:i] := src[i+63:i]
FI
ENDFOR
dst[MAX:512] := 0
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
I think I have found some "bugs" in the current online version (3.3.14) of the guide :
-
__m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
-
__m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale) :
-
Instruction: vgatherqps xmm, vm32x, xmm
-
vm32x should be vm64x
-
-
dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+63:i])*scale]
-
vindex[i+63:i] should be vindex[m+63:m]
-
-
-
__m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
- __m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
- __m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
-
Instruction: vgatherdpd xmm, vm64x, xmm
-
vm64x should be vm32x
-
-
Anyway, many thanks for this useful tool.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
I think there is an error for _mm256_shuffle_epi8 intrinsic instruction. Currently it is:
dst[128+i+7:i] := a[128+index*8+7:128+index*8]
but I think it should be:
dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]
For _mm512_shuffle_epi8 intrinsic instruction, I am not sure to understand correctly the pseudo code:
FOR j := 0 to 63
i := j*8
IF b[i+7] == 1
dst[i+7:i] := 0
ELSE
index[3:0] := b[i+3:i]
dst[i+7:i] := a[index*8+7:index*8]
FI
ENDFOR
dst[MAX:512] := 0
It seems like only the first 128 bits of a can be shuffled?
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
First of all - thanks so much for this guide, I have found it to be invaluable!
I think I found a small error in version 3.3.14 for _mm_sqrt_sd. The guide claims that:
__m128d _mm_sqrt_sd (__m128d a, __m128d b)
computes the sqrt of the lower double from a and copies the lower double from b to the upper double of the result. However, it actually seems to be the opposite (the lower double from a is copied, and the sqrt of the lower double from b is computed). I am using clang on OSX. I don't have access to Windows or ICC, but for what it's worth, the MSN documentation at https://msdn.microsoft.com/en-us/library/1994h1ay(v=vs.90).aspx seems to agree with me.
Cheers,
Serge
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Thanks for the feedback, most of this will be addressed in the next release.
1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?
2. This will be resolved in the next release.
3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.
4. This will be resolved in the next release.
5. This will be resolved in the next release.
I have not added any additional latency and throughput data yet, but I may get to this soon.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Thanks for the feedback, most of this will be addressed in the next release.
1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?
2. This will be resolved in the next release.
3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.
4. This will be resolved in the next release.
5. This will be resolved in the next release.
I have not added any additional latency and throughput data yet, but I may get to this soon.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
Description of _mm256_extractf128_si256 states (composed of integer data), which seems confusing given the F for float? Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?
-Harry
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Harry V. (Intel) wrote:
Description of _mm256_extractf128_si256 states (composed of integer data), which seems confusing given the F for float? Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?
There are two instructions: vextractf128 and vextracti128. The former is part of AVX and is generated by _mm256_extractf128_* and the latter is only added in AVX2 and is generated by _mm256_extracti128_si256. The effect of both instructions is the same and _mm256_extractf128_si256 is a convenient wrapper to allow interaction between __m256i and __m128i even on systems lacking AVX2.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Thanks for the feedback. I've posted an update that addresses all the reported issues. This does not include performance info for Skylake, although I may add that in the future.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
I have posted an update that includes updated latency/throughput. This removes data from pre-Sandybridge, and adds Broadwell, Skylake, and Knights Landing.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
I believe that the "_MM_CMPINT_NEQ" constant listed in various integer comparison operations should read _MM_CMPINT_NE. (At least this is what GCC, Clang, etc. implement)
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
The guide has a significant mislabelling of throughput in all intrinsics which list them. Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput. This is consistently misreported throughout the guide
For example, the guide reports Skylake having a lower throughput for pmulhuw than Haswell or Broadwell. It's the opposite, Skylake's thoughput is higher than the older architectures. This mislabelling is repeated for about 100 other intrinsics.
Reporting reciprocal throughput is a good idea, since those values can be more easily compared to latency clocks. But the labels in the whole guide must be updated to state "reciprocal throughput." I was even reorganizing my AVX code to minimize calls to these certain apparently lower-throughput changes to x86 vector math!
Luckily I realized the mismatch with Agner Fox's independent tables.
