- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

So I'm optimizing a series of slow math operations into a single vectorized routine using SSE

From timing tests I've done, avoiding the FPU load & store that this int->float conversion requires would make the routine almost three times as fast with large data sets.

This routine does a bunch of 64 bit integer math with SSE, converts to floating point (doing a store / load / store / load combo with the FPU), does a bunch of floating point with SSE, then finally returns to 64 bit integers and stores the results.

I've got that last step going from 64 bit floats to 64 bit ints quite fast within SSE, since I know the range of the data (<2^52) and can just add in rounding & normalizing constants and mask out the mantissa to extract it as an integer.

Does anyone know of any easy tricks (

*i.e.*, one that stays within xmm registers and doesn't require branching) to convert 64 bit ints to 64 bit floats, given that there's no SSE instruction for doing this (and please correct me if I'm wrong)?

I was thinking that the CVTDQ2PD (convert DWord ints to doubles) could be utilised for this by doing the high dwords and low dwords separately and combining them after adjusting the exponent, but that got messy. There's also the BSR - bit scan reverse - instruction which works on regular registers which could help figure out the exponent (once the value was made positive and the sign stored away), but that was also less than satisfying. Either of these lines of thought were so complicated that I'm not sure they're faster than just going back and forth to memory.

Any ideas?

Thanks!!

-PV

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Here is C code that perform conversion of **unsiged** ints < 2^52. First step need shift and bitwise or, second fp-subtraction and third integer addition; I think it could be coded with max 10 instructions.

---cut---

#include

#include

#include

int main(int argc, char* argv[]) {

// convert UNSINGED ints

// assumption: x < 2^52

uint64_t x = strtoll(argv[1], NULL, 10);

uint64_t x_double;

// 1. construct double 2^0 * (1 + (x * 2^12) / 2^52) = 1 + x/2^40

x_double = ((uint64_t)0x3ff << 52) | (x << 12);

// ^^^^^ exponent ^^^^^^^^ mantissa * 2^12

// 2. extract 1 from double -- mantissa will be automatically normalized

(*(double*)&x_double) -= 1.0;

// at this point float values is x/2^40

// 3. multiply by 2^40- -- just adjust exponent

x_double += (uint64_t)40 << 52;

// all done

printf("hack = %f ", *(double*)&x_double);

printf("FPU = %f ", (double)x);

return 0;

}

---cut---

HTH

w.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

uint64_t x = strtoll(argv[1], NULL, 10);

uint64_t x_double;

// 1. construct double x + 2^53

x_double = ((uint64_t)(0x3ff + 52) << 52) | x;

// ^^^^^ exponent ^^^^^^^^ mantissa

// 2. extract 2^53 from double -- mantissa will be automatically normalized

(*(double*)&x_double) -= (double)(1LL << 52);

// all done

That works for unsigned, but getting this to work for signed integers is also conceptually straightforward - store the sign bit of the source integer, use the absolute value of the integer for this trick mentioned above, then set the sign bit of the double to match the source. Easy. Or so I'd hope...

The thing I'm struggling with is the fact that there's no PABSQ instruction - no instruction to get the absolute value of a 64 bit integer. There're several SSE4 instructions that would help to get the absolute value - PCMPLTQ (compare quadword to see if it's less than zero), or BLENDVPD (swap out values based on the sign bit - this would be perfect), but I can't use those since I'm limited to using only SSE2 & 3 instructions (need to run on older machines).

But I can get the absolute value through a bunch of shifts and logic ops and a multiply by -1 - But d'oh! There's no 64-bit integer multiply in SSE! So I can do the multiply by doing 3 32-bit integer partial multiplies and adds, but this is starting to get complex, and makes me wonder if there's a better way. Maybe doing comparisons and branching in the inner loop is faster? I'm not sure; it's conceptually less than satisfying.

FWIW, here's the signed integer version of that same code:

int main(int argc, char* argv[]) {

// convert SIGNED ints

// assumption: x < 2^52

int64_t x = strtoll(argv[1], NULL, 10);

uint64_t x_double;

// sign = 1 for positive value, -1 for negative. Shift sign bit all

// the way over, then back to equal 2 or 0. Subtract from 1 to make -1 or 1

int64_t sign = 1 - (((*(uint64_t *)&x) >> 63) << 1);

// Get absolute value - But we can't do this in SSE without a bunch of steps!

int64_t abs = x * sign;

// 1. construct double x + 2^53

x_double = ((uint64_t)(0x3ff + 52) << 52) | abs;

// ^^^^^ exponent ^^^^^^^^ mantissa

// 2. extract 2^53 from d ouble -- mantissa will be automatically normalized

(*(double*)&x_double) -= (double)(1LL << 52);

// 3. Set sign bit based on the original.

x_double |= (*(uint64_t *)&x) & (1LL << 63);

// all done

printf("hack = %f ", *(double*)&x_double);

printf("FPU = %f ", (double)x);

return 0;

}

Cheers,

-PV

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

To get absolute value you can use this branch-less approach:

mov %eax, %ebx ; copy eaxand SSE2 code:

sarl $32, %ebx ; fill ebx with sign bit

xorl %ebx, %eax ; negate eax (if negative)

subl %ebx, %eax ; add 1 to eax (if negative)

pshufd $0b11110101, %xmm0, %xmm1 ; populate dwords 3 and 1w.

psrad $32, %xmm1 ; fill quad words with sign bit

pxor %xmm1, %xmm0 ; negate

psubq %xmm1, %xmm0 ; add 1

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**pvercello!**I've developed very fast method for signed int conversion. Let start with feature of U2: negative number treated as unsigned has value (1 << bits_in_word) - abs(x) -- in our case bits_in_word=52.

First we construct double: (1 << 52) + (x & mask_lower_52_bits).

For positive x value of such double is

`(1 << 52) + x`-- to get x double(1 << 52) have to be subtracted.

For negative x value is

`(1 << 52) + ((1 << 52) - abs(x)) = (1 << 53) - abs(x)`--- to get -abs(x) we have to subtract double(1 << 53).

double(1 << 52) = 0x43300..00, and double(1 << 52) = 0x43400..00 -- difference is 0x00100..00. We can load double(1 << 52) and then add 0x00100..00 if x is negative -- this don't need any branches, method I've shown in earlier post could be used.

SSE2 implementation:

void SSE_convert(int64_t in[2], double out[2]) {Above procedure comes from test program I wrote (it has 4kB, so I didn't paste here): http://republika.pl/wmula/snippets/asm/convert_int52_double.c

__asm__ volatile (

" movdqa (%%eax), %%xmm0 "

" movdqa %%xmm0, %%xmm1 "

" movdqa CONST, %%xmm2 "

// step 1.

" pand MASK, %%xmm0 "

" por EXP, %%xmm0 "

// step 2.

" psrad $32, %%xmm1 "

" pand NEG, %%xmm1 "

" "

" paddd %%xmm1, %%xmm2 "

// step 3.

" subpd %%xmm2, %%xmm0 "

" movapd %%xmm0, (%%ebx) "

: /* no output */

: "a" (in),

"b" (out)

: "memory"

);

}

Results from my machine (C2D E8200 @ 2.6GHz):

$ gcc -O3 -DASM convert_int52_double.c -o test

$ ./test 2 0

verify

all ok!

$ time ./test 0 100000

FPU, 100000 iterations

real 0m2.487s

user 0m2.476s

sys 0m0.000s

$ time ./test 1 100000

SSE, 100000 iterations

real 0m0.987s

user 0m0.984s

sys 0m0.000s

In every iteration 4096 pairs of ints were converted. SSE proc. is 2,4 times faster then FPU (for FPU conv. GCC generated just two pairs of fild/fstp).

Note: SSE procedure reloads all masks on every call, in real application masks should be stored in registers.

HTH

w.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

That's clever, and it saves two instuctions over the arithmetic shift->shuffle->xor approach. That's the solution I was looking for.

Now unfortunately, by my timing tests, it's still just a little bit slower than using the FPU.

count max min total average

6 52 51 307 **51** Convert_64s64f - FPU

6 79 75 468 ** 78** Convert_64s64f2 - SSE

That's running six times on a buffer with 40000 elements. Also, that's with a fresh data cache (that is, the buffer was processed once before calling these routines. Times are both about double when data is not in the cache)

That's also with an pipeline-friendly FPU loop that looks like this:

Convert_64s64f(const Int64* pSrc, Float64* pDst, Int32 len)

{

Int32 lenDiv4 = len >> 2;

while( lenDiv4-- )

{

*pDst = (Float64)*pSrc;// A little loop unwinding here(makes a big difference)

*(pDst+1) = (Float64)*(pSrc+1);

*(pDst+2) = (Float64)*(pSrc+2);

*(pDst+3) = (Float64)*(pSrc+3);

pSrc += 4;

pDst += 4;

}

len = len & 3;

while( len-- )

*(pDst++) = (Float64)*(pSrc++);

}

So I think I might have to stick to using the FPU conversions after all. Oh well, was a fun exercise.

-pv

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

However, when all mask/constants are stored in XMM registers differences are not significant; for example on my machine FPU conv. of 128000 vectors took 2.61 s, SSE -- 2.64 s.

BTW step 2 of SSE conversion could be done bit simplier. Since |x| < 2^52, if x < 0 then bits 52:61 are set -- shift isn't needed to populate sign bit. Integer addition can be replaced with xor: 0x433.. xor 0x007.. = 0x434. Result: no transitions between integer and fp-unit. Code outline: "pand packed_qword(0x007..000), %xmm1; pxor %xmm1, %xmm2".

w.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page