Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1694 Discussions

## SIMD byte problems

Beginner
554 Views
hi everyone,
I'm going to try some simd byte manipulation, but i noticed that byte operations are missing..
I tried to do byte add/sub, by thinking them as word or doublewords, it works, but I don't think it's a good idea. What to do if I need this:

new_byte = (byte * 200 - 50) for each of 16 bytes within a simd reg?

I tried to map the bytes to words, but it's a waste of memory.. is there any other way?

thanks,
Tom
11 Replies
Beginner
554 Views
are byte signed? which values are possible for byte? -1, 0, 1? in case "no", how to resolve overflows?
Beginner
554 Views
unsigned char, from 0 to 255 (like stuff with rgb colors), is it possible manage bytes? about overflow, I use a formula which overflow never occurs:

new_char = char * 20 / 150 + 40

if char is 255, new_char is 74, so no prob with overflow..

thanks
Beginner
554 Views
i have no knowledge/idea for implenetation using SIMD, sorry
I can suggest you to use precalculted transformation table for 256 elements and to unroll the cycle in order to decreasing branches
char table[256] = {40, 40, 40, 40, 40, 40, 40, 40, 41, 41, 41, 41, 41, 41, 41, 42, ...., 74};
if the formula is stable, it's preferable rather than multiplication and dividing
Beginner
554 Views
lookup table is a good idea, but i need to write simd for other reasons too..

Beginner
554 Views
you could donesome simple operations using SIMD with packed 16 bytesusing SSE (+-, mul or div by shifting), and some harder operations by thinking about 8 bytes as words (mul and div)
Beginner
554 Views
I try to think words as bytes:

W O R D
00000011 00000001
B Y T E B Y T E

if I sum word_A + word_B, it's the same of sum byte_A0 + byte_B0, byte_A1 + byte_B1 (or at least if I keep bytes less then 255)

but the * and / sounds a bit harder by shifting, because there are not byte shift instructions, if I shift left that word:

00000011 00000001 --> shift right 2 bits--> 00000000 11000000

so the left byte is ok, but the right one is not..
Beginner
554 Views
00000011 00000001 --> shift right 2 bits--> 00000000 11000000
after applying mask by bitwise and:
00000000 00000000
you just need 8 masks for rshift and 8 for lshift
Black Belt
554 Views
>>new_char = char * 20 / 150 + 40

using unsigned char for arithmetic, the result of (char*20) cannot exceed 255
The result of (x<256) / 150 in unsigned char arithmetic can only be 0 or 1
Therefore your end result can only be a list of bytes containing 40 or 41

While I won't write the code for you, the gist would be

multiply the 16 bytes by 20
compare result against 16 bytes of 150 producing a mask
add 40 to all bytes in result.

EDIT

However, you state:

>>if char is 255, new_char is 74, so no prob with overflow..

Therefore the original problem statement should have been stated clearly

new_char = (char)((int)char * 20 / 150 + 40)

For this you would modify the above by first converting 8 uchars to 8 uints
then multiply uints by 20 to produce temps
zero results
loop on
compare temps against 150s to produce a mask
subtract 150s from temps
end loop
convert 16-bit results back to 8 chars (shuffle)

Jim Dempsey

Beginner
554 Views
yes, due to integer division stuff, for * and / it was supposed the byte to become int (or short) and then byte again, otherwise as you said, the result can only be 0 or 1 dividing by 150..

by "mul bytes" you mean using word mult instruction? if I move 16 bytes to register, I need to convert all 16 bytes to integers by shuffling data, or you mean converting before moving to reg?

Black Belt
554 Views
If you have SSE4.1 or later use

__m128i _mm_cvtepu8_epi16 (__m128i a);

This converts 8 uchars into 8 shorts

If you have earlier version of sse use

__m128i _mm_shuffle_epi8 (__m128i a, __m128i b);

Then shuffle can be used afterwards to convert back from 16-bit to 8-bit.

Properly constructed, you could load 16 bytes into SSE register then using shuffle, convert 8 of those to 16 bits, mung those 8, producing 8 results in SSE register, then convert the other 8 bytes to 16-bits, and mung those. IOW one 16-byte load, one 16-byte store (two passes to produce results).

Jim Dempsey
Beginner
554 Views
you're right, I managed that way..

thanks