- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi everyone,
I'm going to try some simd byte manipulation, but i noticed that byte operations are missing..
I tried to do byte add/sub, by thinking them as word or doublewords, it works, but I don't think it's a good idea. What to do if I need this:
new_byte = (byte * 200 - 50) for each of 16 bytes within a simd reg?
I tried to map the bytes to words, but it's a waste of memory.. is there any other way?
thanks,
Tom
I'm going to try some simd byte manipulation, but i noticed that byte operations are missing..
I tried to do byte add/sub, by thinking them as word or doublewords, it works, but I don't think it's a good idea. What to do if I need this:
new_byte = (byte * 200 - 50) for each of 16 bytes within a simd reg?
I tried to map the bytes to words, but it's a waste of memory.. is there any other way?
thanks,
Tom
Link Copied
11 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
are byte signed? which values are possible for byte? -1, 0, 1? in case "no", how to resolve overflows?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
unsigned char, from 0 to 255 (like stuff with rgb colors), is it possible manage bytes? about overflow, I use a formula which overflow never occurs:
new_char = char * 20 / 150 + 40
if char is 255, new_char is 74, so no prob with overflow..
thanks
new_char = char * 20 / 150 + 40
if char is 255, new_char is 74, so no prob with overflow..
thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i have no knowledge/idea for implenetation using SIMD, sorry
I can suggest you to use precalculted transformation table for 256 elements and to unroll the cycle in order to decreasing branches
char table[256] = {40, 40, 40, 40, 40, 40, 40, 40, 41, 41, 41, 41, 41, 41, 41, 42, ...., 74};
if the formula is stable, it's preferable rather than multiplication and dividing
I can suggest you to use precalculted transformation table for 256 elements and to unroll the cycle in order to decreasing branches
char table[256] = {40, 40, 40, 40, 40, 40, 40, 40, 41, 41, 41, 41, 41, 41, 41, 42, ...., 74};
if the formula is stable, it's preferable rather than multiplication and dividing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
lookup table is a good idea, but i need to write simd for other reasons too..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you could donesome simple operations using SIMD with packed 16 bytesusing SSE (+-, mul or div by shifting), and some harder operations by thinking about 8 bytes as words (mul and div)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I try to think words as bytes:
W O R D
00000011 00000001
B Y T E B Y T E
if I sum word_A + word_B, it's the same of sum byte_A0 + byte_B0, byte_A1 + byte_B1 (or at least if I keep bytes less then 255)
but the * and / sounds a bit harder by shifting, because there are not byte shift instructions, if I shift left that word:
00000011 00000001 --> shift right 2 bits--> 00000000 11000000
so the left byte is ok, but the right one is not..
W O R D
00000011 00000001
B Y T E B Y T E
if I sum word_A + word_B, it's the same of sum byte_A0 + byte_B0, byte_A1 + byte_B1 (or at least if I keep bytes less then 255)
but the * and / sounds a bit harder by shifting, because there are not byte shift instructions, if I shift left that word:
00000011 00000001 --> shift right 2 bits--> 00000000 11000000
so the left byte is ok, but the right one is not..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
just shift by masks
rshift2mask = 00111111 00111111
00000011 00000001 --> shift right 2 bits--> 00000000 11000000
after applying mask by bitwise and:
00000000 00000000
you just need 8 masks for rshift and 8 for lshift
rshift2mask = 00111111 00111111
00000011 00000001 --> shift right 2 bits--> 00000000 11000000
after applying mask by bitwise and:
00000000 00000000
you just need 8 masks for rshift and 8 for lshift
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>new_char = char * 20 / 150 + 40
using unsigned char for arithmetic, the result of (char*20) cannot exceed 255
The result of (x<256) / 150 in unsigned char arithmetic can only be 0 or 1
Therefore your end result can only be a list of bytes containing 40 or 41
While I won't write the code for you, the gist would be
multiply the 16 bytes by 20
compare result against 16 bytes of 150 producing a mask
negate the mask
add 40 to all bytes in result.
EDIT
However, you state:
>>if char is 255, new_char is 74, so no prob with overflow..
Therefore the original problem statement should have been stated clearly
new_char = (char)((int)char * 20 / 150 + 40)
For this you would modify the above by first converting 8 uchars to 8 uints
then multiply uints by 20 to produce temps
zero results
loop on
compare temps against 150s to produce a mask
if maskall zeros exit
negate mask
add mask to results
subtract 150s from temps
and with mask
end loop
convert 16-bit results back to 8 chars (shuffle)
Jim Dempsey
using unsigned char for arithmetic, the result of (char*20) cannot exceed 255
The result of (x<256) / 150 in unsigned char arithmetic can only be 0 or 1
Therefore your end result can only be a list of bytes containing 40 or 41
While I won't write the code for you, the gist would be
multiply the 16 bytes by 20
compare result against 16 bytes of 150 producing a mask
negate the mask
add 40 to all bytes in result.
EDIT
However, you state:
>>if char is 255, new_char is 74, so no prob with overflow..
Therefore the original problem statement should have been stated clearly
new_char = (char)((int)char * 20 / 150 + 40)
For this you would modify the above by first converting 8 uchars to 8 uints
then multiply uints by 20 to produce temps
zero results
loop on
compare temps against 150s to produce a mask
if maskall zeros exit
negate mask
add mask to results
subtract 150s from temps
and with mask
end loop
convert 16-bit results back to 8 chars (shuffle)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yes, due to integer division stuff, for * and / it was supposed the byte to become int (or short) and then byte again, otherwise as you said, the result can only be 0 or 1 dividing by 150..
by "mul bytes" you mean using word mult instruction? if I move 16 bytes to register, I need to convert all 16 bytes to integers by shuffling data, or you mean converting before moving to reg?
thanks for your reply guys, I will try those solutions soon
by "mul bytes" you mean using word mult instruction? if I move 16 bytes to register, I need to convert all 16 bytes to integers by shuffling data, or you mean converting before moving to reg?
thanks for your reply guys, I will try those solutions soon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you have SSE4.1 or later use
__m128i _mm_cvtepu8_epi16 (__m128i a);
This converts 8 uchars into 8 shorts
If you have earlier version of sse use
__m128i _mm_shuffle_epi8 (__m128i a, __m128i b);
Then shuffle can be used afterwards to convert back from 16-bit to 8-bit.
Properly constructed, you could load 16 bytes into SSE register then using shuffle, convert 8 of those to 16 bits, mung those 8, producing 8 results in SSE register, then convert the other 8 bytes to 16-bits, and mung those. IOW one 16-byte load, one 16-byte store (two passes to produce results).
Jim Dempsey
__m128i _mm_cvtepu8_epi16 (__m128i a);
This converts 8 uchars into 8 shorts
If you have earlier version of sse use
__m128i _mm_shuffle_epi8 (__m128i a, __m128i b);
Then shuffle can be used afterwards to convert back from 16-bit to 8-bit.
Properly constructed, you could load 16 bytes into SSE register then using shuffle, convert 8 of those to 16 bits, mung those 8, producing 8 results in SSE register, then convert the other 8 bytes to 16-bits, and mung those. IOW one 16-byte load, one 16-byte store (two passes to produce results).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you're right, I managed that way..
thanks
thanks
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page