Re: Int64 <-> double conversion with SSE?

pvercello · ‎06-07-2008

Hi there,

So I'm optimizing a series of slow math operations into a single vectorized routine using SSE instructions / intrinsics. This is in 32 bit code. I'm trying to figure out an elegant way to convert 64 bit integers to double precision floating point without using the FPU and incurring the hit of having to store into memory then reload into xmm registers to get back into the SSE world. SSE doesn't support 64 bit int <-> double conversion, as far as I can tell, so using the FPU is the only obvious way. Does anyone know of an elegant way to do this that would be faster than going to memory with the FPU?

From timing tests I've done, avoiding the FPU load & store that this int->float conversion requires would make the routine almost three times as fast with large data sets.

This routine does a bunch of 64 bit integer math with SSE, converts to floating point (doing a store / load / store / load combo with the FPU), does a bunch of floating point with SSE, then finally returns to 64 bit integers and stores the results.

I've got that last step going from 64 bit floats to 64 bit ints quite fast within SSE, since I know the range of the data (<2^52) and can just add in rounding & normalizing constants and mask out the mantissa to extract it as an integer.

Does anyone know of any easy tricks (i.e., one that stays within xmm registers and doesn't require branching) to convert 64 bit ints to 64 bit floats, given that there's no SSE instruction for doing this (and please correct me if I'm wrong)?

I was thinking that the CVTDQ2PD (convert DWord ints to doubles) could be utilised for this by doing the high dwords and low dwords separately and combining them after adjusting the exponent, but that got messy. There's also the BSR - bit scan reverse - instruction which works on regular registers which could help figure out the exponent (once the value was made positive and the sign stored away), but that was also less than satisfying. Either of these lines of thought were so complicated that I'm not sure they're faster than just going back and forth to memory.

Any ideas?

Thanks!!

-PV

wmula · ‎06-07-2008

Hi!

Here is C code that perform conversion of **unsiged** ints < 2^52. First step need shift and bitwise or, second fp-subtraction and third integer addition; I think it could be coded with max 10 instructions.

---cut---

#include 
#include 
#include 

int main(int argc, char* argv[]) {
	// convert UNSINGED ints
	// assumption: x < 2^52
	
	uint64_t x = strtoll(argv[1], NULL, 10);
	uint64_t x_double;

	// 1. construct double 2^0 * (1 + (x * 2^12) / 2^52) = 1 + x/2^40
	x_double = ((uint64_t)0x3ff << 52) |   (x << 12);
	//         ^^^^^ exponent ^^^^^^^^   mantissa * 2^12

	// 2. extract 1 from double -- mantissa will be automatically normalized
	(*(double*)&x_double) -= 1.0;
	// at this point float values is x/2^40

	// 3. multiply by 2^40- -- just adjust exponent
	x_double += (uint64_t)40 << 52;

	// all done
	printf("hack = %f
", *(double*)&x_double);
	printf("FPU  = %f
", (double)x);

	return 0;
}
---cut---

HTH
w.

pvercello · ‎06-08-2008

Thanks - that's great. Yeah that's the type of approach I'm looking for. Now FWIW, the one you post there only works for 40 bits of integer because of the "x << 12" - you can actually save a step and get the full 52 bits by doing the following:

uint64_t x = strtoll(argv[1], NULL, 10);
uint64_t x_double;

// 1. construct double x + 2^53
x_double = ((uint64_t)(0x3ff + 52) << 52) | x;
// ^^^^^ exponent ^^^^^^^^ mantissa

// 2. extract 2^53 from double -- mantissa will be automatically normalized
(*(double*)&x_double) -= (double)(1LL << 52);

// all done

That works for unsigned, but getting this to work for signed integers is also conceptually straightforward - store the sign bit of the source integer, use the absolute value of the integer for this trick mentioned above, then set the sign bit of the double to match the source. Easy. Or so I'd hope...

The thing I'm struggling with is the fact that there's no PABSQ instruction - no instruction to get the absolute value of a 64 bit integer. There're several SSE4 instructions that would help to get the absolute value - PCMPLTQ (compare quadword to see if it's less than zero), or BLENDVPD (swap out values based on the sign bit - this would be perfect), but I can't use those since I'm limited to using only SSE2 & 3 instructions (need to run on older machines).

But I can get the absolute value through a bunch of shifts and logic ops and a multiply by -1 - But d'oh! There's no 64-bit integer multiply in SSE! So I can do the multiply by doing 3 32-bit integer partial multiplies and adds, but this is starting to get complex, and makes me wonder if there's a better way. Maybe doing comparisons and branching in the inner loop is faster? I'm not sure; it's conceptually less than satisfying.

FWIW, here's the signed integer version of that same code:

int main(int argc, char* argv[]) {
// convert SIGNED ints
// assumption: x < 2^52

int64_t x = strtoll(argv[1], NULL, 10);
uint64_t x_double;

// sign = 1 for positive value, -1 for negative. Shift sign bit all
// the way over, then back to equal 2 or 0. Subtract from 1 to make -1 or 1
int64_t sign = 1 - (((*(uint64_t *)&x) >> 63) << 1);

// Get absolute value - But we can't do this in SSE without a bunch of steps!
int64_t abs = x * sign;

// 1. construct double x + 2^53
x_double = ((uint64_t)(0x3ff + 52) << 52) | abs;
// ^^^^^ exponent ^^^^^^^^ mantissa

// 2. extract 2^53 from d ouble -- mantissa will be automatically normalized
(*(double*)&x_double) -= (double)(1LL << 52);

// 3. Set sign bit based on the original.
x_double |= (*(uint64_t *)&x) & (1LL << 63);

// all done
printf("hack = %f ", *(double*)&x_double);
printf("FPU = %f ", (double)x);

return 0;
}

Cheers,

-PV

wmula · ‎06-08-2008

Shifting x by 12 was mistake, sorry -- at 2-3AM brain don't work well. :)

To get absolute value you can use this branch-less approach:

mov  %eax, %ebx	; copy eax
sarl  $32, %ebx	; fill ebx with sign bit
xorl %ebx, %eax	; negate eax (if negative)
subl %ebx, %eax	; add 1 to eax (if negative)

and SSE2 code:

pshufd $0b11110101, %xmm0, %xmm1 ; populate dwords 3 and 1
psrad    $32, %xmm1	; fill quad words with sign bit
pxor   %xmm1, %xmm0	; negate
psubq  %xmm1, %xmm0	; add 1

w.

wmula · ‎06-09-2008

Hi pvercello!

I've developed very fast method for signed int conversion. Let start with feature of U2: negative number treated as unsigned has value (1 << bits_in_word) - abs(x) -- in our case bits_in_word=52.

First we construct double: (1 << 52) + (x & mask_lower_52_bits).

For positive x value of such double is (1 << 52) + x -- to get x double(1 << 52) have to be subtracted.
For negative x value is (1 << 52) + ((1 << 52) - abs(x)) = (1 << 53) - abs(x) --- to get -abs(x) we have to subtract double(1 << 53).

double(1 << 52) = 0x43300..00, and double(1 << 52) = 0x43400..00 -- difference is 0x00100..00. We can load double(1 << 52) and then add 0x00100..00 if x is negative -- this don't need any branches, method I've shown in earlier post could be used.

SSE2 implementation:

void SSE_convert(int64_t in[2], double out[2]) {
	__asm__ volatile (
	"	movdqa (%%eax), %%xmm0		
"
	"	movdqa  %%xmm0, %%xmm1		
"
	"	movdqa   CONST, %%xmm2		
"

	// step 1.
	"	pand      MASK, %%xmm0		
"
	"	por        EXP, %%xmm0		
"

	// step 2.
	"	psrad      $32, %%xmm1		
"
	"	pand       NEG, %%xmm1		
"
	"					
"
	"	paddd     %%xmm1, %%xmm2	
"

	// step 3.
	"	subpd     %%xmm2, %%xmm0	
"
	"	movapd    %%xmm0, (%%ebx)	
"

	: /* no output */
	: "a" (in),
	  "b" (out)
	: "memory"
	);
}

Above procedure comes from test program I wrote (it has 4kB, so I didn't paste here): http://republika.pl/wmula/snippets/asm/convert_int52_double.c

Results from my machine (C2D E8200 @ 2.6GHz):

$ gcc -O3 -DASM convert_int52_double.c -o test
$ ./test 2 0
verify
all ok!

$ time ./test 0 100000
FPU, 100000 iterations

real    0m2.487s
user    0m2.476s
sys     0m0.000s

$ time ./test 1 100000
SSE, 100000 iterations

real    0m0.987s
user    0m0.984s
sys     0m0.000s

In every iteration 4096 pairs of ints were converted. SSE proc. is 2,4 times faster then FPU (for FPU conv. GCC generated just two pairs of fild/fstp).

Note: SSE procedure reloads all masks on every call, in real application masks should be stored in registers.

HTH
w.

pvercello · ‎06-09-2008

That's clever, and it saves two instuctions over the arithmetic shift->shuffle->xor approach. That's the solution I was looking for.

Now unfortunately, by my timing tests, it's still just a little bit slower than using the FPU.

count max min total average
6 52 51 307 51 Convert_64s64f - FPU
6 79 75 468 78 Convert_64s64f2 - SSE

That's running six times on a buffer with 40000 elements. Also, that's with a fresh data cache (that is, the buffer was processed once before calling these routines. Times are both about double when data is not in the cache)

That's also with an pipeline-friendly FPU loop that looks like this:

Convert_64s64f(const Int64* pSrc, Float64* pDst, Int32 len)
{
Int32 lenDiv4 = len >> 2;
while( lenDiv4-- )
{
*pDst = (Float64)*pSrc;// A little loop unwinding here(makes a big difference)
*(pDst+1) = (Float64)*(pSrc+1);
*(pDst+2) = (Float64)*(pSrc+2);
*(pDst+3) = (Float64)*(pSrc+3);
pSrc += 4;
pDst += 4;
}
len = len & 3;
while( len-- )
*(pDst++) = (Float64)*(pSrc++);
}

So I think I might have to stick to using the FPU conversions after all. Oh well, was a fun exercise.

-pv

wmula · ‎06-13-2008

Indeed, for large data SSE method is slower. Almost 50% of time is consumed by memory writes! Even if non-temporal stores are used SSE code is not faster.

However, when all mask/constants are stored in XMM registers differences are not significant; for example on my machine FPU conv. of 128000 vectors took 2.61 s, SSE -- 2.64 s.

BTW step 2 of SSE conversion could be done bit simplier. Since |x| < 2^52, if x < 0 then bits 52:61 are set -- shift isn't needed to populate sign bit. Integer addition can be replaced with xor: 0x433.. xor 0x007.. = 0x434. Result: no transitions between integer and fp-unit. Code outline: "pand packed_qword(0x007..000), %xmm1; pxor %xmm1, %xmm2".

w.