Re: bit interleave instruction

torusle · ‎05-17-2008

Hi there.

This is the place to ask for new instructions, right? Here is something that I want for years. Bit interleaving.

The idea is to take the lower 16 bits of an integer, and muxing them into all odd bits of the result and, setting all even bits of the result to zero.

Such an instruction can give huge performance improvements for all kind of two dimensional array access:

Assume we have a big matrix to work on, big enough that it does not fit into the L1D-cache anymore. Also for simplicity assume width and heights are powers of two.

The addressing of such a matrix looks like this:

index = x * elements_per_line + y;
element = matrix_base[index];

Accessing this matrix works very fast if we access data sequential, e.g. in rows. This is not always possible and there are access-patterns that give near worst case performance figures (matrix transpose is one of them).

With bit interleaving we could simply store the matrix in morton order. This will get around the worst case cache performance problem for matrix transpose and will give good results for other access-patterns as well as long as x and y have some locality.

index = bit_interleave(x) + bit_interleave(y)*2;
element = matrix_base[index];

In some cases one can work around the problem and do the morton-ordering incremental, but that is unfortunately not always possible. Interleaving bits with the CPU otoh is very slow (still worth it if it saves a cache-miss though). I know that the matrix transpose is a bad example because there are better ways to do this, but I think it's ok to show the principle.

Btw - Bit interleaving has other uses as well. I use to do arithmetic in multiple bitfields in a single integer, just like MMX with the difference that my bitfields are arbitrary and not always of identical size. The zero-bit gaps between the bits can be used to stop the carry moving from one overflowing bit-field into the next during additions. In general it has great bit-twiddeling potential.

The TI C64x+ DSP has such instructions called deal and shfl. At first I thought these are special purpose instructions to speed up some common DSP algorithm, but after a year of DSP assembler coding I found out how usefull they are in general.

I really miss them on the x86.

Nils

capens__nicolas · ‎05-20-2008

Hi Nils,

That's an interesting suggestion. It could potentially improve cache coherency a lot for multi-dimensional data structures. I havethree remarks though:

Why not do the interleaving of the 'x' and 'y' coordinates in one instruction? Instead of inserting 0 bits, insert the bits of the other operand. This saves the shift and add. If you still want 0's, just use a zeroed second operand.

I also wonder to what extend we can already accomplish the address swizzling using the pshufb instruction. Instead of interleaving each bit, interleave bytes.

Last but not least, I wonder what effect this has on speculative prefetch. An L1 cache miss is hidden by out-of-order execution but an L2 miss can be devastating for performance. If the accesses happen in a predictable way the data can be prefetched to avoid misses. Interleaving bits (or bytes) breaks predictability so automatic prefeching no longer works.

Cheers,

Nicolas Capens

torusle · ‎05-20-2008

HI Nicolas:

c0d1f1ed:
Why not do the interleaving of the 'x' and 'y' coordinates in one instruction? Instead of inserting 0 bits, insert the bits of the other operand. This saves the shift and add. If you still want 0's, just use a zeroed second operand.

Good idea. Makes it even more flexible.

c0d1f1ed:
I also wonder to what extend we can already accomplish the address swizzling using the pshufb instruction. Instead of interleaving each bit, interleave bytes.

Consider this little test here (which just does matrix copy, once in best case, once on worst case order):

#include

#include


float matrix1[256*256];

float matrix2[256*256];


void test_copy1 (void)

 int x,y;

 for (x=0; x<256; x++)

 for (y=0; y<256; y++)

 matrix2[y*256+x] = matrix1[y*256+x];


void test_copy2 (void)

 int x,y;

 for (x=0; x<256; x++)

 for (y=0; y<256; y++)

 matrix2[x*256+y] = matrix1[x*256+y];


int main (int argc, char **args)

 int i, t1;


 (void) argc;

 (void) args;


 t1 = GetTickCount();

 for (i=0; i<1000; i++) test_copy1();

 printf ("time test_copy1 = %d
", GetTickCount()-t1);

 t1 = GetTickCount();

 for (i=0; i<1000; i++) test_copy2();

 printf ("time test_copy2 = %d
", GetTickCount()-t1);

 return 0;

I know - not the scientific way to measure the time but it shows (on my system at last) that the straight copy version is still 6 times faster than the cache-killer version. The larger the matrices are, the worse the cache-effect gets. 8 bit address chunk swizzling does not cut it for a small first level data cache.

This synthetic test can't show the benefit of bit-interleaving, but I've seen a performance degration of only 4 to 10 percent when using swizzled accees with respect to a straight linear access pattern. Granted - on a entirely different platform where bit-interleaving instructions are available. The cache mechanism was more or less compareable though.

The speculative prefetching mechanism is great and does a good job for general purpose tasks, but for such test-cases such as my code above it does not help much. Keep in mind that after the first few iterations I kick out an entire cache-line worth of data to make space for new data to arrive.I'll only read one humble dword from each cache-line, so I consume a lot more of bandwidth than nessesary.

Bit-swizzeled addressing will not turn local matrix-walking into linear access, but it will increase the locality of data-accesses a lot. I tried it, I do it on the x86 as well because you can do quite a bit or fast arithmetic with pre-swizzled integers (needs some bit-twiddeling but it's well worth it). However these tricks only work when walking in fixed steps t hrough the matrices. For random access bit-swizzeling would be a great thing.

The point is: I'd be glad to just use a general purpose bit-interleaving function, but writing one creates a horrible dependency chain. It won't run fast and there is little for the CPU to run in parallel. It's still better than the penalty of a cache miss though.

Cheers,

 Nils

torusle · ‎05-20-2008

For reference (and the intel-guys)....

Here's the version with linear access, worst case access and one that does accesses with bit-interleaving via cpu.. I know that in this case it could be done incremental, but in the general case it can't and that's what I care for..

compiled with GCC 4.3.0 -O3:

Nils@Nils-PC ~
$ ./a.exe
time test_copy1 (bad access) = 563
time test_copy2 (linear access) = 94
time test_copy3 (swizzled access) = 453

Code snippet:

#include

#include


float matrix1[256*256];

float matrix2[256*256];


int bit_interleave (int u, int v)

 // bit-Interleave to lsb16 numbers into a

 // morton-order index using a magic number

 // bit twiddeling hack.

 u = (u | (u << 8)) & 0x00FF00FF;

 u = (u | (u << 4)) & 0x0F0F0F0F;

 u = (u | (u << 2)) & 0x33333333;

 u = (u | (u << 1)) & 0x55555555;

 v = (v | (v << 8)) & 0x00FF00FF;

 v = (v | (v << 4)) & 0x0F0F0F0F;

 v = (v | (v << 2)) & 0x33333333;

 v = (v | (v << 1)) & 0x55555555;

 return (u | (v << 1));


void test_copy1 (void)

 int x,y;

 for (x=0; x<256; x++)

 for (y=0; y<256; y++)

 matrix2[y*256+x] = matrix1[y*256+x];


void test_copy2 (void)

 int x,y;

 for (x=0; x<256; x++)

 for (y=0; y<256; y++)

 matrix2[x*256+y] = matrix1[x*256+y];


void test_copy3 (void)

 int x,y;

 for (x=0; x<256; x++)

 for (y=0; y<256; y++)

  int idx = bit_interleave (x,y);

 matrix2[idx] = matrix1[idx];

int main (int argc, char **args)

 int i, t1;


 (void) argc;

 (void) args;


 t1 = GetTickCount();

 for (i=0; i<1000; i++)

 test_copy1();

 printf ("time test_copy1 (bad access) = %d
", GetTickCount()-t1);

 t1 = GetTickCount();

 for (i=0; i<1000; i++)

 test_copy2();

 printf ("time test_copy2 (linear access) = %d
", GetTickCount()-t1);


 t1 = GetTickCount();

 for (i=0; i<1000; i++)

 test_copy3();

 printf ("time test_copy3 (swizzled access) = %d
", GetTickCount()-t1);


 return 0;

Intel Low-Level guys: Wouldn't test_copy3 run a lot faster if we could do the bit_interleave thing in a single cycle? What do you think?

Cheers,
Nils

levicki · ‎05-21-2008

I might be wrong but it looks to me that your bit_interleave doesn't interleave bits but does bit_rotate instead.

jimdempseyatthecove · ‎05-21-2008

Nils,

The interleaving of the bits of the (x,y) index does mean that small vector instructions could not be used. Maybe that is not of concerne for the code you intend to use this on.

If you consider adding "spread" then to make it orthoginal you would also want "squish" where each instruction would have the number of bits in the spread/squish and potentially the shift.

This can get complicated real fast. You might want to spread 1-bit fields, 2-bit fields, ..., n-bit fields, with each bit field with being spread 1, 2,... n bits. There may be arguments for just about anything.

Also, you can effectively to the interleaving by advancing the x and y index by other than 1 and/or use barrel shifter prior to add.

Another technique to use toreduce worst case cache performance would be to randomize the index(s) in a reproducable manner (and without collision). A CRC function might do the trick here. In your first post your main interest was in reducing probability of worst case cache interaction. A randomizing(hashing) function should do the trick.

Jim

wmula · ‎06-16-2008

torusle:
This is the place to ask for new instructions, right? Here is something that I want for years. Bit interleaving.

The idea is to take the lower 16 bits of an integer, and muxing them into all odd bits of the result and, setting all even bits of the result to zero.

It can be done for 128-bit words with few instructions:

; input - xmm0

movdqa %xmm0, %xmm1
psrlw     $4, %xmm1
pand   packed_byte(0x0f), %xmm0 ; xmm0 - lower nibbles
pand   packed_byte(0x0f), %xmm1 ; xmm1 - higher nibbles

movdqa lookup_table, %xmm2
movdqa lookup_table, %xmm3
pshufb %xmm0, %xmm2             ; xmm2 - lower nibbles interleaved
pshufb %xmm1, %xmm3		; xmm3 - higher nibbles interleaved

lookup_table contains precalculated results for nibbles

w.

levicki · ‎06-16-2008

Hilbert space-filling curve gives much better locality than Z curve and index is easier to calculate.

Max_L · ‎04-07-2009

Nils, hi

it may be interesting for you to find out that Larabee actually implements few forms of bit interleave instructions, you can check e.g. here http://www.ddj.com/hpc-high-performance-computing/216402188?pgno=5 for more details.

-Max

sirrida · ‎07-04-2012

With BMI2 we will get PDEP and PEXT with Haswell new instructions.
Assuming x=eax and y=edx the following sequence should do the requested operation:

mov ecx,0x55555555
pdep eax,eax,ecx
pdep edx,edx,ecx
lea eax,[eax+edx*2]