How to extract DWORD from upper half of 256-bit register?

levicki · ‎07-02-2013

Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?

Any ideas/tricks to work around this engineering "oversight"?

SergeyKostrov · ‎07-03-2013

Igor, There are many intrinsic functions for extraction in immintrin.h header file ( search for all places where a word 'extract' is used ). If the instruction you've expected to see is missing why wouldn't you apply a workaround and use what is available now. I understood that you need to extract signed or unsigned 32-bit values from __m256i union: ... typedef union _MMINTRIN_TYPE(32) __m256i { #if !defined(_MSC_VER) /* * To support GNU compatible intialization with initializers list, * make first union member to be of int64 type. */ __int64 m256i_gcc_compatibility[4]; #endif __int8 m256i_i8[32]; __int16 m256i_i16[16]; __int32 m256i_i32[8]; __int64 m256i_i64[4]; unsigned __int8 m256i_u8[32]; unsigned __int16 m256i_u16[16]; unsigned __int32 m256i_u32[8]; unsigned __int64 m256i_u64[4]; } __m256i; ... Is that correct?

levicki · ‎07-03-2013

Sergey,

What I want is to extract arbitrary DWORD from say YMM0 register. For XMM0 register, the instruction for extracting DWORD 3 is PEXTRD eax, XMM0, 3 while there is no such instruction to extract DWORD 7 from YMM0.

Yes, I could use intrinsics, write __m256i val = _mm256_load_si256(mem) and then DWORD part = val.m256i_u32[7] but that does not translate to a single assembler instruction. You can understand my post as a complaint about non-orthogonality of AVX2 extensions.

SergeyKostrov · ‎07-03-2013

Hi Igor, >>...What I want is to extract arbitrary DWORD from say YMM0 register. For XMM0 register, the instruction for extracting >>DWORD 3 is PEXTRD eax, XMM0, 3 while there is no such instruction to extract DWORD 7 from YMM0. >> >>Yes, I could use intrinsics, write __m256i val = _mm256_load_si256(mem) and then DWORD part = val.m256i_u32[7] but >>that does not translate to a single assembler instruction. You can understand my post as a complaint about >>non-orthogonality of AVX2 extensions. Thanks for the clarification. I'll take a look at Instructions Set Manual and I'm surprised that such extraction is Not available.

levicki · ‎07-03-2013

You will notice that is not the only one missing instruction.

The whole AVX business reminds me of extending AX to EAX -- you get access to 32 bits (EAX), 16 bits (AX), but there is no cheap access to the upper 16-bit register half except through shifts and masks. Same with AVX, just instead of 32 and 16 it is 256 and 128.

Another part where they did not make instruction set orthogonal is parallel bit shift -- does not exist for words and bytes which in my opinion would be the most common use cases.

Final part of my complaint is that if they already decide not to implement VPEXTRD eax, ymm0, 7 they could at least document the fastest alternative with 2 or 3 instructions instead of having all of us guess and test.

SergeyKostrov · ‎07-03-2013

What about these two intrinsic functions? [ immintrin.h ( Intel version ) ] ... extern __m128i __ICL_INTRINCC _mm256_extractf128_si256( __m256i, const int ); ... extern __m128i __ICL_INTRINCC _mm256_extracti128_si256( __m256i, const int ); ... I think they almost what you need but still don't return a DWORD type. Note: Microsoft's version of immintrin.h doesn't have declaration for the 2nd function, that is _mm256_extracti128_si256.

bronxzv · ‎07-03-2013

Igor Levicki wrote:
why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?

to be consistent with the AVX2 philosophy for all promoted SSEn instructions (same behavior for both 128-bit lanes with no cross-lane dependency) 256-bit VPEXTRD will have to return 2 results in two detination GPRs which isn't possible with VEX encoding

Igor Levicki wrote:
Any ideas/tricks to work around this

extracts: depending on your use case a single VPERMD will do the trick (with proper indices in a register initialized out of your critical loop), you'll have your result in the low double word of the destination YMM, if you really need the result in a GPR the fastest sequence AFAIK is VEXTRACTI128 followed by VPEXTRD

inserts: for your insertions from a GPR I suggest to use a VPINSRD, VINSERTI128 sequence

levicki · ‎07-04-2013

bronxzv wrote:
to be consistent with the AVX2 philosophy for all promoted SSEn instructions (same behavior for both 128-bit lanes with no cross-lane dependency) 256-bit VPEXTRD will have to return 2 results in two detination GPRs which isn't possible with VEX encoding

But I disagree!

While for other instructions doing the same thing in lower and upper lane is essential, INSERT/EXTRACT instructions are a different thing alltogether -- they should not be promoted in the same way. Their purpose is scalar access to vector elements, not parallel processing so they should just be extended to allow access to all elements.

bronxzv wrote:
extracts: depending on your use case a single VPERMD will do the trick (with proper indices in a register initialized out of your critical loop), you'll have your result in the low double word of the destination YMM, if you really need the result in a GPR the fastest sequence AFAIK is VEXTRACTI128 followed by VPEXTRD

inserts: for your insertions from a GPR I suggest to use a VPINSRD, VINSERTI128 sequence

Yes, I figured that out but still it would be better if the set was made orthogonal to begin with. I see no good reason not to expand PEXTRD/PINSRD to allow indices from 4 to 7.

bronxzv · ‎07-04-2013

Igor Levicki wrote:
But I disagree!

the choice was probably done to simplify hardware design more than programmer's convenience, one can also argue that pack/unpack isn't convenient the way it was expanded to 256-bit or that 128-bit shifts aren't promoted to 256-bit shifts which isn't "orthogonal"

all in all I'll say that VPERMD is more convenient than legacy extracts since the element index can be set dynamically (ymm idx register) instead of statically (immediate value), it is incredibly useful for a lot of other use cases, I found a new use for it yesterday for example: dynamically specified broadcast, unlike native broadcast where the low element is replicated you can specify the index of the element to be replicated

levicki · ‎07-04-2013

I wonder... did you manage to get theoretical 50% speedup with AVX2 integer code compared to SSE2/SSSE3/SSE4.1 integer code?

I am seeing ~33% so far, this may well be caused by the "simplified hardware design" you mention.

bronxzv · ‎07-04-2013

Igor Levicki wrote:
I wonder... did you manage to get theoretical 50% speedup with AVX2 integer code compared to SSE2/SSSE3/SSE4.1 integer code?

actually the max theoretical speedup is 2x i.e. 100% (even more with new instructions like VPERMD) but I have no single test with only integer instructions so I can't report any real world values for integer only, the best speedup I measured with production code is 1.82x (82%) for mixed int and fp when comparing a SSE2 path with an AVX2 path (incl. FMA), note that this is for a single kernel with high L1D cache locality, not a full application

Igor Levicki wrote:
I am seeing ~33% so far, this may well be caused by the "simplified hardware design" you mention.

my "simplified design" remark was for the two fully distinct execution stacks with duplicated 128-bit execution units, it has nothing to do with any throughput limitation, your deceptive speedup may be due to incomplete vectorization (hint: you mentioned scalar inserts/extracts as important for you so I suppose they are used in some of your hotspots) or L2$/LLC$/memory bandwidth limitation (or both)

if you want better optimization advices I'll suggest to post code snippets of your hotspots

levicki · ‎07-04-2013

When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.

Attached is the code with simple test driver. My results are:

test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
test_AVX2 : 2190.420 ms

Difference is 1.80x here too, but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE with VEX prefix and 3-operand syntax. However, that also exposes an issue with intrinsics and arch optimization -- compiler uses vpbroadcastb which is not in SSE4.1 set. I didn't bother to check whether speedup is due to vpbroadcastb use or due to VEX+3op but I personally doubt vpbroadcastb is that much faster. Also, there is a much more sinister issue with intrinsics -- if you don't specify arch compiler will generate plain SSE2/SSSE3 instructions for _mm256_set1_epi8() in the middle of AVX2+VEX+3op code causing severe performance penalty by state transitions.

The CPI for test_AVX2() is 0.345 out of theoretical 0.250. Not sure if it can get any better than that, but you are welcome to try.

Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned -- it still uses vmovdqu. I think I will just go back to using pure assembler and living with a nightmare of maintaining two versions of ASM code for 32-bit and 64-bit rather then letting compiler do whatever it wants with intrinsics.

bronxzv · ‎07-04-2013

Igor Levicki wrote:
When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.

so the 33% you were mentioning stands for a x1.49 speedup as per this definition http://en.wikipedia.org/wiki/Speedup this looks pretty good already

Igor Levicki wrote:
Attached is the code with simple test driver. My results are:

test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
test_AVX2 : 2190.420 ms

Difference is 1.80x here too,

a 1.80x speedup looks very good to me, there is maybe not much room for improvement, probably nothing obvious I suppose

Igor Levicki wrote:
but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE with VEX prefix and 3-operand syntax. However, that also exposes an issue with intrinsics and arch optimization -- compiler uses vpbroadcastb which is not in SSE4.1 set. I didn't bother to check whether speedup is due to vpbroadcastb use or due to VEX+3op but I personally doubt vpbroadcastb is that much faster. Also, there is a much more sinister issue with intrinsics -- if you don't specify arch compiler will generate plain SSE2/SSSE3 instructions for _mm256_set1_epi8() in the middle of AVX2+VEX+3op code causing severe performance penalty by state transitions.

The CPI for test_AVX2() is 0.345 out of theoretical 0.250. Not sure if it can get any better than that, but you are welcome to try.

this CPI looks indeed very good, so I suppose your optimizations are already well done

Igor Levicki wrote:
Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned -- it still uses vmovdqu.

because the encoding is more compact AFAIK (so potentially slightly less uopcache/icache misses on a big application), besides second order effect like icache misses vmovdqu speed is exactly the same than vmovdqa, note that it is the same with vmovups preferred (by the Intel compiler) over vmovaps for fp code

andysem · ‎07-08-2013

> Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned

AFAIK, in Sandy Bridge and later CPUs, movdqa and movdqu are equivalent, when memory is aligned. See Architecture Optimization Manual, Table C-12a. vmovdqa and vmovdqu are even closer as vmovdqa doesn't fail on unaligned memory. I think I even saw a recommendation to always use vmovdqu somewhere, but I can't remember the document now.

levicki · ‎07-08-2013

andysem wrote:
AFAIK, in Sandy Bridge and later CPUs, movdqa and movdqu are equivalent, when memory is aligned. See Architecture Optimization Manual, Table C-12a. vmovdqa and vmovdqu are even closer as vmovdqa doesn't fail on unaligned memory. I think I even saw a recommendation to always use vmovdqu somewhere, but I can't remember the document now.

Well, 14.0 beta on Linux seems to emit aligned loads for those constants. I guess we will never know what is right.

perfwise · ‎07-08-2013

There's no un-aligned penalty upon SB, IB, and HW (for 128-bit loads), so long as you're within the same cacheline. When you have a memory access that spans a cacheline or a page you take a significant hit in latency of ~5 and ~28 clks on that load. So.. as long as you don't span across cachelines or pages.. you're loads, whether aligned or unaligned in SSE/AVX.. will not take longer.

Perfwise

bronxzv · ‎07-08-2013

Igor Levicki wrote:
Well, 14.0 beta on Linux seems to emit aligned loads for those constants. I guess we will never know what is right.

as I posted above (sorry but my post was delayed by moderation for several days!) maybe the compiler use unaligned moves because the encoding is more compact (to be verified)

perfwise · ‎07-08-2013

Just an update.. upon HW in 256-bits there's no alignment penalty for loads which are mis-aligned from 256-bit alignment when using VMOVUPS.. but there's a penalty for spanning a cachline boundary and a page boundary.

Perfwise

levicki · ‎07-08-2013

But if you write const __m256i var = something; isnt the compiler free to align/order that value properly in read-only data segment?
Why would it ever need to use unaligned loads then when it can guarantee that the data will be properly aligned even without explicitly specifying __declspec(align(32))?

By the way, specifying alignment on __m256i variable doesn't force aligned loads in 13.1 update 5.

SergeyKostrov · ‎07-09-2013

This is a short follow up on Igor's test results: >>Attached is the code with simple test driver. My results are: >> >> test_C : 6345.035 ms >>test_SSE4.1 : 3944.771 ms >> test_AVX2 : 2190.420 ms Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 [ 64-bit Windows 7 Professional / 64-bit test ] test_C : 12904.534 ms test_SSE4.1 : 6502.829 ms [ 64-bit Windows 7 Professional / 32-bit test ] test_C : 12423.721 ms test_SSE4.1 : 7097.624 ms

Bernard · ‎07-10-2013

Regarding test_C function I wonder if removing a branch from within the loop and putting for_ loop inside the if_else block could lead to some execution speed up.