What is the best way to sum up values in __m256 ?

missing__zlw · ‎09-26-2011

When using SSE, I used to use _mm_add_ps twice and _mm_shuffle_ps to sum up all 4 values in __m128.

For _m256, what is the best way?

Also, I used to have mask like _MM_SHUFFLE(3,2,1,0) to create a mask for my _mm_shuffle_ps.

How should I create mask for _mm256_shuffle_ps now? I don't see a _MM256_SHUFFLE?

Thanks.

sirrida · ‎09-27-2011

There should be a command like _MM256_SHUFFLE because shufps was promoted, albeit acting in-lane with the same masks for the lower and the higher halves.

If you proceed like before but for YMM instead of XMM, you will get 4 singles summed up values in the lowest single of each lane. Alternatively you could apply vhaddps twice with essentially the same result since all these commands work in-lane.
The time needed for vhaddps should be about the same as for vshufps and vaddps together (Agner Fog's timing measurements say so), but vhaddps produces less code.

If you want to add all 8 values, you can thereafter use vperm2f128 (I don't know all the _MM* functions) and a simple vaddps.

My solution (untested):
vhaddps ymm0,ymm0,ymm0
vhaddps ymm0,ymm0,ymm0
vperm2f128 ymm1,ymm0,ymm0,0x11
vaddps ymm0,ymm0,ymm1

missing__zlw · ‎09-27-2011

Thank you for your reply.

This is what I tried to sum up all the values.

x3 = _mm256_add_ps(x0, _mm256_movehdup_ps(x0));

x4 = _mm256_unpackhi_ps(x3, x3) ;

x4 = _mm256_add_ps(x3, x4) ;

x5 = _mm256_permute2f128_ps(x4, x4, 0x01) ;

x5 = _mm256_add_ps(x5, x4) ;

One question, I still don't know how the control value (mask) works in permute. Any document? I read the program guide and I still don't get it.

sirrida · ‎09-27-2011

In the instruction set reference manual the vshufps command is described as follows:
VSHUFPS (VEX.256 encoded version)
DEST[31:0] Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96] Select4(SRC2[127:0], imm8[7:6]);
DEST[159:128] Select4(SRC1[255:128], imm8[1:0]);
DEST[191:160] Select4(SRC1[255:128], imm8[3:2]);
DEST[223:192] Select4(SRC2[255:128], imm8[5:4]);
DEST[255:224] Select4(SRC2[255:128], imm8[7:6]);

In other words: The upper 4 singles are shuffled in exactly the same way as the lower ones (the shuffle mask is used twice). There is no possibility to shuffle the upper ones differently from the lower ones with this command.

To get manuals you can follow the link http://www.intel.com/products/processor/manuals/index.htm.

Brijender_B_Intel · ‎09-27-2011

I will suggest avoid vhaddps. Instead bring down upperlane to lower lane (vperm2f128) and add verticall. Then using vmohps to move the upper 2 elementes down and add. Then shuffle and add.
Pseudo code like this
vperm2f128(ymm1, ymm0, ymm0, selectupper half); // ymm0 -> x0 x1 x2 x3 x4 x5 x6 x7
// ymm1 -> x4 x5 x6 x7 x4 x5 x6 x7
Vaddps (ymm1, ymm0); ymm1 -> x0+x4, x1+x5, x2+x6, x3+x7
vmovhps(ymm0, ymm1); ymm0 -> x2+x6, x3+x7, - ----
vaddps(ymm1, ymm0); ymm1 -> x0+x4+x2+x6; x1+x5+x3+x7
vshuffleps(ymm0, ymm1, 0x22); -> ymm0 -> x1+x5+x3+x7
vaddps(ymm1,ymm0) -> x0+x4+x2+x6+x1+x5+x3+x7

I beleive this will be faster than HADDPS approach. However keep in mind this is a dependency chain, so you may want to do some extra stuff inbettwen these instructions. As most of these instructions are using only, port1, port5. Any instruction which runs on port 0, 3,4(multiply, load, store for next loop) will increaase the performance. e.g. you can load the data for next loop, or calculate the indexes for next loop.

sirrida · ‎09-27-2011

It probably depends on the processor whether vshufps+vaddps or vhaddps is faster.
At least vhaddps has the potential to become faster than vshufps+vaddps because it's simply only one command and there is not more to be calculated than with a simple add. On the other hand vhaddps (like the other "horizontal" commands) is often implemented poorly.
On e.g. Atom the "horizontal" commands are a sheer catastrophe, performance-like.
It's sort of a self-fulfilling prophecy: The command is slow and thus avoided and hence processor optimizations thereof are omitted; repeat this until forever.
This is the same sad story as for other handy commands like loop/jecxz/enter/leave or even inc/dec/lea...

For the current Sandy Bridge implementation things look like this:
command: latency / reciprocal throughput
vshufps y,y,y,i: 1/1
vaddps y,y,y: 3/1
vhaddps y,y,y: 5/2

As you can see the summed up throughput is the same (albeit vhaddps has one cycle longer latency than vshufps+vaddps). Less commands mean less memory and cache usage.
I hope in future processors vhaddps will become as fast as vaddps - I don't see any reason against it.
Thus let's use vhaddps and hope for a better future implementation.