- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Congratulations to Intel CPU instruction set engineers for managing to make YET ANOTHER non-orthogonal instruction set extension -- why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?
Any ideas/tricks to work around this engineering "oversight"?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
What I want is to extract arbitrary DWORD from say YMM0 register. For XMM0 register, the instruction for extracting DWORD 3 is PEXTRD eax, XMM0, 3 while there is no such instruction to extract DWORD 7 from YMM0.
Yes, I could use intrinsics, write __m256i val = _mm256_load_si256(mem) and then DWORD part = val.m256i_u32[7] but that does not translate to a single assembler instruction. You can understand my post as a complaint about non-orthogonality of AVX2 extensions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You will notice that is not the only one missing instruction.
The whole AVX business reminds me of extending AX to EAX -- you get access to 32 bits (EAX), 16 bits (AX), but there is no cheap access to the upper 16-bit register half except through shifts and masks. Same with AVX, just instead of 32 and 16 it is 256 and 128.
Another part where they did not make instruction set orthogonal is parallel bit shift -- does not exist for words and bytes which in my opinion would be the most common use cases.
Final part of my complaint is that if they already decide not to implement VPEXTRD eax, ymm0, 7 they could at least document the fastest alternative with 2 or 3 instructions instead of having all of us guess and test.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
why PEXTRD/PINSRD (among many others) were not promoted to 256 bits in AVX2?
to be consistent with the AVX2 philosophy for all promoted SSEn instructions (same behavior for both 128-bit lanes with no cross-lane dependency) 256-bit VPEXTRD will have to return 2 results in two detination GPRs which isn't possible with VEX encoding
Igor Levicki wrote:
Any ideas/tricks to work around this
extracts: depending on your use case a single VPERMD will do the trick (with proper indices in a register initialized out of your critical loop), you'll have your result in the low double word of the destination YMM, if you really need the result in a GPR the fastest sequence AFAIK is VEXTRACTI128 followed by VPEXTRD
inserts: for your insertions from a GPR I suggest to use a VPINSRD, VINSERTI128 sequence
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bronxzv wrote:
to be consistent with the AVX2 philosophy for all promoted SSEn instructions (same behavior for both 128-bit lanes with no cross-lane dependency) 256-bit VPEXTRD will have to return 2 results in two detination GPRs which isn't possible with VEX encoding
But I disagree!
While for other instructions doing the same thing in lower and upper lane is essential, INSERT/EXTRACT instructions are a different thing alltogether -- they should not be promoted in the same way. Their purpose is scalar access to vector elements, not parallel processing so they should just be extended to allow access to all elements.
bronxzv wrote:
extracts: depending on your use case a single VPERMD will do the trick (with proper indices in a register initialized out of your critical loop), you'll have your result in the low double word of the destination YMM, if you really need the result in a GPR the fastest sequence AFAIK is VEXTRACTI128 followed by VPEXTRDinserts: for your insertions from a GPR I suggest to use a VPINSRD, VINSERTI128 sequence
Yes, I figured that out but still it would be better if the set was made orthogonal to begin with. I see no good reason not to expand PEXTRD/PINSRD to allow indices from 4 to 7.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
But I disagree!
the choice was probably done to simplify hardware design more than programmer's convenience, one can also argue that pack/unpack isn't convenient the way it was expanded to 256-bit or that 128-bit shifts aren't promoted to 256-bit shifts which isn't "orthogonal"
all in all I'll say that VPERMD is more convenient than legacy extracts since the element index can be set dynamically (ymm idx register) instead of statically (immediate value), it is incredibly useful for a lot of other use cases, I found a new use for it yesterday for example: dynamically specified broadcast, unlike native broadcast where the low element is replicated you can specify the index of the element to be replicated
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wonder... did you manage to get theoretical 50% speedup with AVX2 integer code compared to SSE2/SSSE3/SSE4.1 integer code?
I am seeing ~33% so far, this may well be caused by the "simplified hardware design" you mention.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
I wonder... did you manage to get theoretical 50% speedup with AVX2 integer code compared to SSE2/SSSE3/SSE4.1 integer code?
actually the max theoretical speedup is 2x i.e. 100% (even more with new instructions like VPERMD) but I have no single test with only integer instructions so I can't report any real world values for integer only, the best speedup I measured with production code is 1.82x (82%) for mixed int and fp when comparing a SSE2 path with an AVX2 path (incl. FMA), note that this is for a single kernel with high L1D cache locality, not a full application
Igor Levicki wrote:
I am seeing ~33% so far, this may well be caused by the "simplified hardware design" you mention.
my "simplified design" remark was for the two fully distinct execution stacks with duplicated 128-bit execution units, it has nothing to do with any throughput limitation, your deceptive speedup may be due to incomplete vectorization (hint: you mentioned scalar inserts/extracts as important for you so I suppose they are used in some of your hotspots) or L2$/LLC$/memory bandwidth limitation (or both)
if you want better optimization advices I'll suggest to post code snippets of your hotspots
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.
Attached is the code with simple test driver. My results are:
test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
test_AVX2 : 2190.420 ms
Difference is 1.80x here too, but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE with VEX prefix and 3-operand syntax. However, that also exposes an issue with intrinsics and arch optimization -- compiler uses vpbroadcastb which is not in SSE4.1 set. I didn't bother to check whether speedup is due to vpbroadcastb use or due to VEX+3op but I personally doubt vpbroadcastb is that much faster. Also, there is a much more sinister issue with intrinsics -- if you don't specify arch compiler will generate plain SSE2/SSSE3 instructions for _mm256_set1_epi8() in the middle of AVX2+VEX+3op code causing severe performance penalty by state transitions.
The CPI for test_AVX2() is 0.345 out of theoretical 0.250. Not sure if it can get any better than that, but you are welcome to try.
Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned -- it still uses vmovdqu. I think I will just go back to using pure assembler and living with a nightmare of maintaining two versions of ASM code for 32-bit and 64-bit rather then letting compiler do whatever it wants with intrinsics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
When I said 50% I actually meant 50% shorter execution time which would translate into 2x speedup. Sorry for confusion.
so the 33% you were mentioning stands for a x1.49 speedup as per this definition http://en.wikipedia.org/wiki/Speedup this looks pretty good already
Igor Levicki wrote:
Attached is the code with simple test driver. My results are:test_C : 6345.035 ms
test_SSE4.1 : 3944.771 ms
test_AVX2 : 2190.420 msDifference is 1.80x here too,
a 1.80x speedup looks very good to me, there is maybe not much room for improvement, probably nothing obvious I suppose
Igor Levicki wrote:
but that difference gets smaller (1.51x) if you change pragma for SSE4.1 function and let compiler generate 128-bit SSE with VEX prefix and 3-operand syntax. However, that also exposes an issue with intrinsics and arch optimization -- compiler uses vpbroadcastb which is not in SSE4.1 set. I didn't bother to check whether speedup is due to vpbroadcastb use or due to VEX+3op but I personally doubt vpbroadcastb is that much faster. Also, there is a much more sinister issue with intrinsics -- if you don't specify arch compiler will generate plain SSE2/SSSE3 instructions for _mm256_set1_epi8() in the middle of AVX2+VEX+3op code causing severe performance penalty by state transitions.The CPI for test_AVX2() is 0.345 out of theoretical 0.250. Not sure if it can get any better than that, but you are welcome to try.
this CPI looks indeed very good, so I suppose your optimizations are already well done
Igor Levicki wrote:
Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned -- it still uses vmovdqu.
because the encoding is more compact AFAIK (so potentially slightly less uopcache/icache misses on a big application), besides second order effect like icache misses vmovdqu speed is exactly the same than vmovdqa, note that it is the same with vmovups preferred (by the Intel compiler) over vmovaps for fp code
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> Finally, I don't understand why compiler is avoiding aligned memory access in AVX2 code when memory is aligned
AFAIK, in Sandy Bridge and later CPUs, movdqa and movdqu are equivalent, when memory is aligned. See Architecture Optimization Manual, Table C-12a. vmovdqa and vmovdqu are even closer as vmovdqa doesn't fail on unaligned memory. I think I even saw a recommendation to always use vmovdqu somewhere, but I can't remember the document now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem wrote:
AFAIK, in Sandy Bridge and later CPUs, movdqa and movdqu are equivalent, when memory is aligned. See Architecture Optimization Manual, Table C-12a. vmovdqa and vmovdqu are even closer as vmovdqa doesn't fail on unaligned memory. I think I even saw a recommendation to always use vmovdqu somewhere, but I can't remember the document now.
Well, 14.0 beta on Linux seems to emit aligned loads for those constants. I guess we will never know what is right.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There's no un-aligned penalty upon SB, IB, and HW (for 128-bit loads), so long as you're within the same cacheline. When you have a memory access that spans a cacheline or a page you take a significant hit in latency of ~5 and ~28 clks on that load. So.. as long as you don't span across cachelines or pages.. you're loads, whether aligned or unaligned in SSE/AVX.. will not take longer.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor Levicki wrote:
Well, 14.0 beta on Linux seems to emit aligned loads for those constants. I guess we will never know what is right.
as I posted above (sorry but my post was delayed by moderation for several days!) maybe the compiler use unaligned moves because the encoding is more compact (to be verified)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just an update.. upon HW in 256-bits there's no alignment penalty for loads which are mis-aligned from 256-bit alignment when using VMOVUPS.. but there's a penalty for spanning a cachline boundary and a page boundary.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But if you write const __m256i var = something; isnt the compiler free to align/order that value properly in read-only data segment?
Why would it ever need to use unaligned loads then when it can guarantee that the data will be properly aligned even without explicitly specifying __declspec(align(32))?
By the way, specifying alignment on __m256i variable doesn't force aligned loads in 13.1 update 5.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding test_C function I wonder if removing a branch from within the loop and putting for_ loop inside the if_else block could lead to some execution speed up.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page