- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a question received by Intel Developer Services Support, along with the response provided by our Application Engineers:
Q. I'm using the SSE2 extensions quite heavily, and there's one step of my calculations that I can't find a good way to do, but it's simple enough that I think there's a "right" way to do it. I've got 8 signed 16 bit words in a SSE2 register, and I want to obtain the sum of all of the words held in the register. That's it. My current method is using the shift and add commands until I get the whole sum, which takes about 9 instructions. These 8 values are the result of a bunch of parallel instructions that I'm using the SSE2 instructions for, so I'm pretty much stuck with that layout. If you could give me suggestions, I would very much appreciate it.
A. I would recommend using the PSHUFD instruction to create a second copy of the four high words in the lower 64 bits of a second XMM register and then using the PSHUFLW and PADD instructions to combine the results. Something like this (assume the eight 16-bit words are located in the xmm0 register):
PSHUFDxmm1, xmm0, 0xEE
PADDWxmm0, xmm1
PSHUFLWxmm1, xmm0, 0xEE
PADDWxmm0, xmm1
PSHUFLWxmm1, xmm0, 0xEE
PADDWxmm0, xmm1
PADDWxmm0, xmm1
PSHUFLWxmm1, xmm0, 0xEE
PADDWxmm0, xmm1
PSHUFLWxmm1, xmm0, 0xEE
PADDWxmm0, xmm1
At the end, the sum of the eight 16-bit values should be located in the lower 16 bits of the xmm0 register. This uses six instructions (including the initial PSHUFD instruction), and the PSHUFLW instructions are more efficient than the byte-wise shift instructions. This should be more efficient that your current implementation.
==
Lexi S.
Message Edited by intel.software.network.support on 12-07-2005 04:46 PM
Link Copied
0 Replies

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page