- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have:
xmm0/mem128 = A3 A2 A1 A0
And I want to have:
ymm0 = A3 A3 A2 A2 A1 A1 A0 A0
Question #2
I have:
ymm0 = B3 A3 B2 A2 B1 A1 B0 A0
And I want to have:
xmm1/mem128 = A3 A2 A1 A0
xmm2/mem128 = B3 B2 B1 B0
Question #3
I have:
xmm1/mem128 = A3 A2 A1 A0
xmm2/mem128 = B3 B2 B1 B0
And I want to have:
ymm0 = B3 A3 B2 A2 B1 A1 B0 A0
How to accomplish those seemingly trivial transformations having in mind AVX cross-lane limitations?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#1:
org: ymm0 = x x x x a3 a2 a1 a0
vpermq ymm0,ymm0,0x10 => ymm0 = x x a3 a2 x x a1 a0 / select qwords x1x0
vpunpckldq ymm0,ymm0,ymm0 => ymm0 = a3 a3 a2 a2 a1 a1 a0 a0 / interlace low dwords
#2:
org: ymm0 = b3 a3 b2 a2 b1 a1 b0 a0
vpshufd ymm1,ymm0,0x08 => ymm1 = x x a3 a2 x x a1 a0 / select dwords xx20
vpshufd ymm2,ymm0,0x0d => ymm2 = x x b3 b2 x x b1 b0 / select dwords xx31
vpermq ymm1,ymm1,0x08 => ymm1 = x x x x a3 a2 a1 a0 / select qwords xx20
vpermq ymm2,ymm2,0x08 => ymm2 = x x x x b3 b2 b1 b0 / select qwords xx20
#3:
org: ymm1 = x x x x a3 a2 a1 a0; ymm2 = x x x x b3 b2 b1 b0
vpermq ymm1,ymm1,0x10 => ymm1 = x x a3 a2 x x a1 a0 / select qwords x1x0
vpermq ymm2,ymm2,0x10 => ymm2 = x x b3 b2 x x b1 b0 / select qwords x1x0
vpunpckldq ymm0,ymm1,ymm2 => ymm0 = b3 a3 b2 a2 b1 a1 b0 a0 / interlace low dwords
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. That doesn't help a bit with AVX where it is not possible to cross lanes.
2. It takes too much instructions even with AVX2
I really don't know what Intel CPU engineers were thinking when they designed AVX.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#2: vpermilps ymm0,ymm0,216 ; %11011000 vextractf128 xmm3,ymm0,1 vunpcklpd xmm1,xmm0,xmm3 vunpckhpd xmm2,xmm0,xmm3
#3: vunpcklps xmm3,xmm1,xmm2 vunpckhps xmm4,xmm1,xmm2 vinsertf128 ymm0,ymm3,xmm4,1 [/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
with AVX2 you can enjoytight code with the 8x32 generic permute, for example for #1simply writing:
vpermps ymm0,ymm1,ymm0
will do the trick, ymm1 should be initialized (typically a loop invariant initialized once) with the proper offsets, i.e. 3 3 2 2 1 1 0 0 in this case
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AVX is just a step toward AVX2. A lot of developers are skipping AVX because AVX2 is clearly a much more complete instruction set.
I think the Intel engineers have always envisioned AVX2 from the start, but it wasn't feasible to implement it all in one go, so they had to choose what parts of it to implement first. I think extending the registers to 256-bit and implementing the floating-point instructions first (by making the integer SIMD stack capable of floating-point operations), was the best compromise they could have made. But even so, AVX is unfortunately only useful for a relatively small range of applications.
That said, AVX2 is intended to be 'vertical' SIMD instruction set, to enable efficient SPMD programming. Think of OpenCL. Each lane executes the same operation on different data elements (i.e. different iterations of a loop). So you're not really supposed to do much if any cross lane work.
It's pretty brilliant to bring such GPU technology within the CPU, but you have to let go of old 'horizontal' SIMD programming models to get the most out of it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have to disagree with this -- you need cross lane to get data in proper position for processing, especially if you are not in control of data layout in memory.
@bronxzy:
Thanks, I will take a look and try your suggestions in some code to see how it performs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's what gather is for.
And yes, I know it's not part of AVX. But that brings us back to AVX being an intermediate step toward AVX2. It's just not suited for all cases of SPMD programming. Having wide floating-point vectors but no gather limits its usability. You'll have to accept to stick to SSE (or AVX-128) in some situations. Besides, Sandy/Ivy Bridge don't have sufficient cache bandwidth for a large speedup anyway. Haswell is expected to double it.
If AVX naturally fits your use case, great, but otherwise just wait for AVX2 instead of messing around with cross lane operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Don't get me wrong, Sandy Bridge is an awesome CPU.
But putting in 256-bit vectors where majority of opcodes can only operate on their 128-bit halves, where you have no cross-lane operations, and where you don't have enough bandwidth for 2x speedup compared to SSE except in synthetic benchmarks with most contrived conditions is something I would personally call "beta" and wouldn't even bother to release and sell.
Can't wait for Haswell.
With that out of the way, I really don't understand why they didn't make shuffle instructions with GPR instead of immediate to begin with -- they would now have up to 64 bits (in x64 mode) for element reordering indices.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For me AVX seems to be a bit like SSE (and partially AMD's 3Dnow!): An appetizer and test ballon allowing for some impressive benchmarks.
I simply skipped SSE because I almost exclusively work with integers; with SSE2 and especially SSSE3 (pshufb) the things got much better.
Also I don't like to be forced to do almost all work in-line and I already had complained about that.
On the other hand the decidedly non-orthogonal and mostly line-wise MMX/SSE/AVX command sets almost always simply get the job done with reasonable effort.
Restricting most commands to in-line operation makes the CPUs much simpler allowing e.g. for future Atoms acting on YMM or even larger registers (e.g. Larrabee / Knight's family) as well without making their dies much larger.
As you have probably noticed assuming unlimited parallel processing and 1 cycle per command the AVX2 solution costs 2/2/2 cycles and the AVX solution (bronxzy) 2/3/2 cycles. Using vpermps with preinitialized shuffle masks it even gets the cycle count down to 1/1/1, however the performance of vpermps on simple CPUs probably will be low.
BTW: My AVX2 integer solution is easily transformed to a float solution by replacing vpermq=>vpermpd, vpunpckldq=>vunpcklpd, and vpshufd=>vshufps. Unfortunately vpermpd shuffles every two singles as one double (type mismatch); I'm not sure whether this costs cycles.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page