There should be a command like _MM256_SHUFFLE because shufps was promoted, albeit acting in-lane with the same masks for the lower and the higher halves.
If you proceed like before but for YMM instead of XMM, you will get 4 singles summed up values in the lowest single of each lane. Alternatively you could apply vhaddps twice with essentially the same result since all these commands work in-lane. The time needed for vhaddps should be about the same as for vshufps and vaddps together (Agner Fog's timing measurements say so), but vhaddps produces less code.
If you want to add all 8 values, you can thereafter use vperm2f128 (I don't know all the _MM* functions) and a simple vaddps.
In the instruction set reference manual the vshufps command is described as follows: VSHUFPS (VEX.256 encoded version) DEST[31:0] Select4(SRC1[127:0], imm8[1:0]); DEST[63:32] Select4(SRC1[127:0], imm8[3:2]); DEST[95:64] Select4(SRC2[127:0], imm8[5:4]); DEST[127:96] Select4(SRC2[127:0], imm8[7:6]); DEST[159:128] Select4(SRC1[255:128], imm8[1:0]); DEST[191:160] Select4(SRC1[255:128], imm8[3:2]); DEST[223:192] Select4(SRC2[255:128], imm8[5:4]); DEST[255:224] Select4(SRC2[255:128], imm8[7:6]);
In other words: The upper 4 singles are shuffled in exactly the same way as the lower ones (the shuffle mask is used twice). There is no possibility to shuffle the upper ones differently from the lower ones with this command.
I will suggest avoid vhaddps. Instead bring down upperlane to lower lane (vperm2f128) and add verticall. Then using vmohps to move the upper 2 elementes down and add. Then shuffle and add. Pseudo code like this vperm2f128(ymm1, ymm0, ymm0, selectupper half); // ymm0 -> x0 x1 x2 x3 x4 x5 x6 x7 // ymm1 -> x4 x5 x6 x7 x4 x5 x6 x7 Vaddps (ymm1, ymm0); ymm1 -> x0+x4, x1+x5, x2+x6, x3+x7 vmovhps(ymm0, ymm1); ymm0 -> x2+x6, x3+x7, - ---- vaddps(ymm1, ymm0); ymm1 -> x0+x4+x2+x6; x1+x5+x3+x7 vshuffleps(ymm0, ymm1, 0x22); -> ymm0 -> x1+x5+x3+x7 vaddps(ymm1,ymm0) -> x0+x4+x2+x6+x1+x5+x3+x7
I beleive this will be faster than HADDPS approach. However keep in mind this is a dependency chain, so you may want to do some extra stuff inbettwen these instructions. As most of these instructions are using only, port1, port5. Any instruction which runs on port 0, 3,4(multiply, load, store for next loop) will increaase the performance. e.g. you can load the data for next loop, or calculate the indexes for next loop.
It probably depends on the processor whether vshufps+vaddps or vhaddps is faster. At least vhaddps has the potential to become faster than vshufps+vaddps because it's simply only one command and there is not more to be calculated than with a simple add. On the other hand vhaddps (like the other "horizontal" commands) is often implemented poorly. On e.g. Atom the "horizontal" commands are a sheer catastrophe, performance-like. It's sort of a self-fulfilling prophecy: The command is slow and thus avoided and hence processor optimizations thereof are omitted; repeat this until forever. This is the same sad story as for other handy commands like loop/jecxz/enter/leave or even inc/dec/lea...
For the current Sandy Bridge implementation things look like this: command: latency / reciprocal throughput vshufps y,y,y,i: 1/1 vaddps y,y,y: 3/1 vhaddps y,y,y: 5/2
As you can see the summed up throughput is the same (albeit vhaddps has one cycle longer latency than vshufps+vaddps). Less commands mean less memory and cache usage. I hope in future processors vhaddps will become as fast as vaddps - I don't see any reason against it. Thus let's use vhaddps and hope for a better future implementation.