SIMD byte problems

tom_r · ‎02-09-2011

hi everyone,
I'm going to try some simd byte manipulation, but i noticed that byte operations are missing..
I tried to do byte add/sub, by thinking them as word or doublewords, it works, but I don't think it's a good idea. What to do if I need this:

new_byte = (byte * 200 - 50) for each of 16 bytes within a simd reg?

I tried to map the bytes to words, but it's a waste of memory.. is there any other way?

thanks,
Tom

Ilnar · ‎02-09-2011

are byte signed? which values are possible for byte? -1, 0, 1? in case "no", how to resolve overflows?

tom_r · ‎02-10-2011

unsigned char, from 0 to 255 (like stuff with rgb colors), is it possible manage bytes? about overflow, I use a formula which overflow never occurs:

new_char = char * 20 / 150 + 40

if char is 255, new_char is 74, so no prob with overflow..

thanks

Ilnar · ‎02-10-2011

i have no knowledge/idea for implenetation using SIMD, sorry
I can suggest you to use precalculted transformation table for 256 elements and to unroll the cycle in order to decreasing branches
char table[256] = {40, 40, 40, 40, 40, 40, 40, 40, 41, 41, 41, 41, 41, 41, 41, 42, ...., 74};
if the formula is stable, it's preferable rather than multiplication and dividing

tom_r · ‎02-10-2011

lookup table is a good idea, but i need to write simd for other reasons too..

Ilnar · ‎02-10-2011

you could donesome simple operations using SIMD with packed 16 bytesusing SSE (+-, mul or div by shifting), and some harder operations by thinking about 8 bytes as words (mul and div)

tom_r · ‎02-10-2011

I try to think words as bytes:

W O R D
00000011 00000001
B Y T E B Y T E

if I sum word_A + word_B, it's the same of sum byte_A0 + byte_B0, byte_A1 + byte_B1 (or at least if I keep bytes less then 255)

but the * and / sounds a bit harder by shifting, because there are not byte shift instructions, if I shift left that word:

00000011 00000001 --> shift right 2 bits--> 00000000 11000000

so the left byte is ok, but the right one is not..

Ilnar · ‎02-10-2011

just shift by masks
rshift2mask = 00111111 00111111
00000011 00000001 --> shift right 2 bits--> 00000000 11000000
after applying mask by bitwise and:
00000000 00000000
you just need 8 masks for rshift and 8 for lshift

jimdempseyatthecove · ‎02-10-2011

>>new_char = char * 20 / 150 + 40

using unsigned char for arithmetic, the result of (char*20) cannot exceed 255
The result of (x<256) / 150 in unsigned char arithmetic can only be 0 or 1
Therefore your end result can only be a list of bytes containing 40 or 41

While I won't write the code for you, the gist would be

multiply the 16 bytes by 20
compare result against 16 bytes of 150 producing a mask
negate the mask
add 40 to all bytes in result.

EDIT

However, you state:

>>if char is 255, new_char is 74, so no prob with overflow..

Therefore the original problem statement should have been stated clearly

new_char = (char)((int)char * 20 / 150 + 40)

For this you would modify the above by first converting 8 uchars to 8 uints
then multiply uints by 20 to produce temps
zero results
loop on
compare temps against 150s to produce a mask
if maskall zeros exit
negate mask
add mask to results
subtract 150s from temps
and with mask
end loop
convert 16-bit results back to 8 chars (shuffle)

Jim Dempsey

tom_r · ‎02-10-2011

yes, due to integer division stuff, for * and / it was supposed the byte to become int (or short) and then byte again, otherwise as you said, the result can only be 0 or 1 dividing by 150..

by "mul bytes" you mean using word mult instruction? if I move 16 bytes to register, I need to convert all 16 bytes to integers by shuffling data, or you mean converting before moving to reg?

thanks for your reply guys, I will try those solutions soon

jimdempseyatthecove · ‎02-10-2011

If you have SSE4.1 or later use

__m128i _mm_cvtepu8_epi16 (__m128i a);

This converts 8 uchars into 8 shorts

If you have earlier version of sse use

__m128i _mm_shuffle_epi8 (__m128i a, __m128i b);

Then shuffle can be used afterwards to convert back from 16-bit to 8-bit.

Properly constructed, you could load 16 bytes into SSE register then using shuffle, convert 8 of those to 16 bits, mung those 8, producing 8 results in SSE register, then convert the other 8 bytes to 16-bits, and mung those. IOW one 16-byte load, one 16-byte store (two passes to produce results).

Jim Dempsey

tom_r · ‎02-10-2011

you're right, I managed that way..

thanks