SSE accelerated atoi/atof?

cfspc · ‎04-29-2009

Greetings,

Does the intel compiler provide some SSE(2) accelerated implementation of atoi/atof?

If not do you know of such an implementation?

Ineed something that works with SSE2and up.

Best regards,

Carlos

TimP · ‎04-29-2009

As a matter of curiosity, do you mean something like vector+parallel implementation of strtod?
The only implementation of strtod I find in the icc libraries is in libdecimal, maybe similar to the bid in gcc.
It doesn't seem in general like something to be accelerated by SSE2.

cfspc · ‎04-29-2009

Hi Tim,

I meant vector implementation of strtod, or at least strtoi. Here is
a very handwavy description:

Take:
char str[] = "1234" ;

Using a vector instruction subtract '0' from each of the entries.
Using a vector instruction multiply the entries by (1000,100,10,1).
Add up the entries "horizontally".

For something link "1234567" I could use theoperation above to
process "1234" and "567" (maybe padded with 0?) or "0123" and
"4567".

Thanks,

Carlos

JD_Patel · ‎04-29-2009

We're evaluating the potential to accelerate these functions via new instr set extensions, but nothing available yet.

cfspc · ‎04-29-2009

Hi Patel,

Even though it would be nice to get more acceleration by using SSE4.* instructions,
I am actually interested in something that works with the "old" SSE2.

Thanks,

Carlos

TimP · ‎04-29-2009

SSE2 doesn't include "horizontal add" or other instructions implied in your description. Nor is horizontal add particularly efficient on Intel CPUs. You are welcome to compile your own version and check whether SSE2 has an advantage.

SHIH_K_Intel · ‎04-29-2009

Quoting - cfspc

Hi Patel,

Even though it would be nice to get more acceleration by using SSE4.* instructions,
I am actually interested in something that works with the "old" SSE2.

Thanks,

Carlos

Hi Although runtime C libary doesn't include SIMD enhanced version of string to decimal conversion functions. Functionalities provided by strtod, atol, atof can be accelerated using combinations of the newer extensions.

Here' some hints based on my own experimentation, which may be helpful if you like to roll your own customized version.

To implement the equivalent functionality provided by stotod/atol, the things to consider include

1. validating the proper character stream meeting the decimal string representation (0, 9, + and -), this is where SSE4.2 provides excellent primitives for that. If you have control of the input data stream and able to take a shortcut to bypass subsetting of character set validation, you can do without SSE4.2 in you customized implementation.

2. The validated chunk of characters can be variable lengths, you probably want to pad 0 bytes to the end of your valid byte chunks. I find pblendv useful, which would require SSE4.1, but you can find workarounds using SSE2.

3. if you want to support result range beyond +- 2G, you will need to do int64 arithmetic, so you find PMULUDQ and PMULLD useful, in addition to PMADDWD. If you are satisfied with result range within +- 2G, you can probably do without PMULLD (SSE4.1 instruction).

4. You will need PHADDD lik Tim mentioned, which is an SSSE3 instructions. Although PHADD latency ranges from 3 to 5 cycles across the 3 generations of HW that supports it, decimal conversion requires quite a bit of other SIMD instructions and the dependency chains are more piecewise than one big continuous chain, so the frequency the PHADD is needed in the code should not mqke PHADDD latency an issue.

if you need further help, you can email me at shihjong.kuo@intel.com