- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

i need to do many times an intraregister sum with intrinsic. For example:

x += a[0]+ a[1] + a[2] + a[3]

and a should be _m128 type.

How can i do that? Which is the faster way?

Thanks in advance!

Link Copied

5 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

x += (a[0]+ a[1]) + (a[2] + a[3]);

are likely to be ignored by icc -fast (default) or even gcc -ffast-math.

If you are using SSE3, you can write in horizontal add, which will not be the fastest on all CPU types, although it should produce minimum number of instructions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**cikikamakuro,**

the definition of hadd with two vector a and b is:

results= b2+b3 | b1+b0 | a2+a3 | a1+a0

this is not i want:

a0+a1+a2+a3

I can do using some shift or other, but not in only one assembly operation.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting unrue

a0+a1+a2+a3

I can do using some shift or other, but not in only one assembly operation.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

//accumulating xmm[0]+xmm[1]+xmm[2]+xmm[3] into xmm[0]

//SSE3:

haddps xmm0,xmm0

haddps xmm0,xmm0

//SSE2:

movhlps xmm1, xmm0 // Get bit 64-127 from xmm1

addps xmm0, xmm1 // Sums are in 2 dwords

pshufd xmm1, xmm0, 1 // Get bit 32-63 from xmm0

addss xmm0, xmm1 // Sum is in one dword

//SSE:

movaps xmm1, xmm0

shufps xmm1, xmm1,(2+4*3+16*0+64*1)

addps xmm0, xmm1

movaps xmm1, xmm0

shufps xmm1, xmm0,(1+4*1+16*3+64*3)

addss xmm0, xmm1

------------------------------

I did not found yet how to do the same think on AVX m256 (ymm[0]+ymm[1]+...+ymm[7])

if anyone has done it. please let me know here

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

AVX has lane concept. 4 elements are in upper lane(x4-x7) and 4 elements are in lower lane (x0-x3). first you need to bring 4 elements down.

__m256 uLane = _mm256_permute2f128_ps(ymm0, 0x01);

// depending how you want to add 2 elements - result may differ as pointed out by Tim earlier.

//efficeint way is add two now:

ymm0 = _mm256_add_ps(ymm0, uLane);

follow SSE2 code now (as lower lane of ymm0 has 4 elements).

....

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page