AVX is just a step toward AVX2. A lot of developers are skipping AVX because AVX2 is clearly a much more complete instruction set.
I think the Intel engineers have always envisioned AVX2 from the start, but it wasn't feasible to implement it all in one go, so they had to choose what parts of it to implement first. I think extending the registers to 256-bit and implementing the floating-point instructions first (by making the integer SIMD stack capable of floating-point operations), was the best compromise they could have made. But even so, AVX is unfortunately only useful for a relatively small range of applications.
That said, AVX2 is intended to be 'vertical' SIMD instruction set, to enable efficient SPMD programming. Think of OpenCL. Each lane executes the same operation on different data elements (i.e. different iterations of a loop). So you're not really supposed to do much if any cross lane work.
It's pretty brilliant to bring such GPU technology within the CPU, but you have to let go of old 'horizontal' SIMD programming models to get the most out of it.
That's what gather is for.
And yes, I know it's not part of AVX. But that brings us back to AVX being an intermediate step toward AVX2. It's just not suited for all cases of SPMD programming. Having wide floating-point vectors but no gather limits its usability. You'll have to accept to stick to SSE (or AVX-128) in some situations. Besides, Sandy/Ivy Bridge don't have sufficient cache bandwidth for a large speedup anyway. Haswell is expected to double it.
If AVX naturally fits your use case, great, but otherwise just wait for AVX2 instead of messing around with cross lane operations.