- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are two versions for the same intrinsic. for example vpaddw and paddw. Is there any performance gain if vpaddw is used instead of paddw (_mm_add_epi16). Are there intrinsic for vpaddw.
VPADDW (VEX.128 encoded version)
DEST[15:0]-- SRC1[15:0]+SRC2[15:0]
DEST[31:16]-- SRC1[31:16]+SRC2[31:16]
DEST[47:32]-- SRC1[47:32]+SRC2[47:32]
DEST[63:48]-- SRC1[63:48]+SRC2[63:48]
DEST[79:64]-- SRC1[79:64]+SRC2[79:64]
DEST[95:80]-- SRC1[95:80]+SRC2[95:80]
DEST[111:96]-- SRC1[111:96]+SRC2[111:96]
DEST[127:112]-- SRC1[127:112]+SRC2[127:112]
DEST[255:128]-- 0
PADDW (128-bit Legacy SSE version)
DEST[15:0]-- DEST[15:0]+SRC[15:0]
DEST[31:16]-- DEST[31:16]+SRC[31:16]
DEST[47:32]-- DEST[47:32]+SRC[47:32]
DEST[63:48]-- DEST[63:48]+SRC[63:48]
DEST[79:64]-- DEST[79:64]+SRC[79:64]
DEST[95:80]-- DEST[95:80]+SRC[95:80]
DEST[111:96]-- DEST[111:96]+SRC[111:96]
DEST[127:112]-- DEST[127:112]+SRC[127:112]
DEST[255:128] (Unmodified)
PADDW __m128i _mm_add_epi16 ( __m128i a, __m128i b)
thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Performance just does not depend on the instruction and also in the context which it is used. You need to give a shot on your application. compiler can generate AVX instruction for same application if you compile with arch:AVX.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page