SSE4.2 STTNI 'equal each' instruction customization

tanwar_1981 · ‎05-07-2012

Hi,

I need to write a C program where STTNI 'equal each' instruction seems to be useful because I want to achive parallism. Here src1 should be equal to src2 for an array of 16.

At the same time my requirement is that instead of comparing exact value, src1 should lie between (src2-(src2/10)) and (src2 + (src2/10)).

I serached but I did not find anything which I can use directly.

Can I have some C src code equivallent to the function 'equal each' where I can put my range condition?

Thanks

AT

jimdempseyatthecove · ‎05-08-2012

>>for an array of 16.

Of what?

Your description sounds like you should be looking at CMP... and UCMP... (or VCMP... and VUCMP...)

Jim Dempsey

tanwar_1981 · ‎05-08-2012

It's an array of byte.

So, CMP ... and UCMP will give equivallent performance as in SSE4.2 instruction.

-AT

jimdempseyatthecove · ‎05-09-2012

>>It's an array of byte.

Signed or unsigned?

How do you handle saturation? ((src2 + (src2/10)) above max, (src2-(src2/10)) below min)

Jim Dempsey

tanwar_1981 · ‎05-09-2012

it is unsigned.

Here max would be 256 as as per byte maximum value. In given problem src2 never crosses 100 so no further logic required for ((src2 + src2/10). Minimum value of src2 would be zero so 0-0/10 won't cross 0.

Is there direct instruction for less than and greter than comparison where I can supply 16 byte together for parallel computaion just like pcmpestr.

Also I didn't see much performance difference with pcmpestr instruction over direct comparison using for loop of 256. Is sthere any switch I need to use while compiling code or in code.

-AT

jimdempseyatthecove · ‎05-10-2012

>>unsigned...src2 never crosses 100

src2/10 is problematic because you do not have an SSE/AVX instruction to divide integers, much less divide unsigned chars. This is why I suggest you examine your requirements to see if you can use divide by 8 or divide by 16. These results can be produced using a shift an mask (x/8 = (xmm >> 3) & maskOf31s), (x/6 = (xmm >> 4) & maskOf15s). IOW using two SSE instructions.

pseudo code

xmm2 = [src2]
xmm1 = [src1]
xmm0 = xmm2 >> shift; (128-bit shift of 3 or 4 bit positions)
xmm0 = xmm0 & mask; (31's or 15's)
xmm3 = xmm2 + xmm0; (src2 + fraction of src2)
xmm4 = xmm2 - xmm0; (src2 - fraction of src2)
xmm3 = xmm1 - xmm3; (src1 - (src2 + fraction of src2))
xmm3 = .not. xmm3
xmm4 = xmm1 - xmm4; (src1 - (src2 - fraction of src2))
xmm4 = xmm3 || xmm4

at this point any byte with the msb set is outside your range

Jim Dempsey

tanwar_1981 · ‎05-14-2012

Hi Jim,

I am trying to use AVX instructions for the above code. Is there any reference guide on these instruction set?

I need to process integer but I found integer support is not there in AVX. Then I tried to convert my integer set to floating using _mm256_cvtep32_ps(ymm_t);

My code:

unsigned small_pic[16][16];

for(i =0; i< 16; ++i) {

for(j =0 ; j <16; ++j) {

fscanf(small_f, "%d", &small_pic);

}}

for(i =0; i< 16; ++i) {

__mm256i ymm_t = _mm256_loadu_si256((__m256i *) small_pic);

__m256 ymm_jj = _mm256_cvtep32_ps(ymm_t);

}

Is this correct code to start in terms of performance.

Next how can I check whether I have correct data loading. Like I want to have printf on ymm_t contents.

Also If I need to read large data(4096X4096) from a file which contains space separated integer(max 8 bits) then is there a faster way than below.

for(i =0; i< 4096; ++i) {

for(j =0 ; j <4096; ++j) {

fscanf(big_f, "%d", &big_pic);

}}

Thanks

AT

jimdempseyatthecove · ‎05-15-2012

From my understanding of your posts, you have 8-bit (unsigned) integers (0:100?).
Why are you trying to convert to float?

It may be benificial to the readers if you state what you are trying to accomplish (as opposed to how you think you can do it).

RE: fscanf

Can you read the whole array in one shot?
Then use your own conversion function to convert space delimited text integers to "char" and store via p++.

if(readBigArray(big_f, bigBuff)) exit(-1);
if(convert(bigBuff, big_pic, 4096*4096)) exit(-2);
...

int convert(char* b, char* c, int nc)
{
// scan and convert
while(nc--)
{
int i = 0;
while(isspace(*b))
++b;
if(!isdigit(*b))
break;
while(isdigit(*b))
i = i*10 + *b++ - '0';
*c++ = i;
}
return nc;
}

You can add additional error tests if you like.

Jim Dempsey

tanwar_1981 · ‎05-15-2012

Thanks Jim for the answer.

I need to compare each integer read from one file with another integer read from different file.

In SSE 4.2 I could not find any instruction which provide add/sub/gt/lt for parallel 16 byte comparison. It only tells about string equality check.

In AVX I got different function for add/sub/gt/lt and many more. Then I got to know that all these functions are only available for floating/double type. For integer these will be available in AVX2.

That's why I want to convert integer to float and then use above available functions.

Do you know any better way.

-AT

jimdempseyatthecove · ‎05-15-2012

Is there some reason why you have to use src2-src2/10 as opposed to /8 or /16?

As stated earlier you can perform these divisions using the shift and mask (2 instructions) for all 16-bytes.
As opposed to 16 x (maybe 3 instructions).

Jim Dempsey

tanwar_1981 · ‎05-15-2012

Hi Jim,

This is part of problem. Not only this a/10 I need to perform a%10.

Instead of sequential 16 operation I am thinking to use subtract and compare in parallel.

Like equivallent of follwoing using AVX instruction.

while(a>9)

{

a-=10;

++b;

}

Now a will have remainder and b will have quotient.

One more thing, why you mentioned 16 bytes. I didn't find any AVX instruction which can perform on byte/integer. Everything thing which I am thinking is in terms of float. Can you please provide your valuable input on this as this is important for me whether I can have instrcution on integer. I am using gcc 4.4.6.

levicki · ‎05-17-2012

For integer AVX instructions you will have to wait until AVX2 is out which will be in 2013 with Haswell CPU if I am not mistaken.

Even then, you will not have parallel division for bytes.

Basically, as Jim has already suggested, if you are looking for help it is better to state what you want to accomplish instead of trying to guess how when you are obviously not aware of architectural and instruction set limitations.

jimdempseyatthecove · ‎05-18-2012

The following (non-SSE/AVX) untested code might produce what you want.
Or at least produce what you have told us what you want.

struct div10_s
{
union
{
unsigned char uc[256][256];
unsigned short us[256*256];
};
div10_s()
{
for(int Left=0;Left < 256;++Left)
for(int Right=0; Right < 256; ++Right)
uc[Left][Right] = Right / 10;
} // div10_s()
__m128i _mm_div10_epu8(__m128i a)
{
__declspec(align(16))
unsigned short asuShorts[8];
_mm_store_si128((__m128i*)asuShorts,a);
for(int i = 0; i < 8; ++i)
asShorts = us[asShorts];
return *((__m128i*)asShorts);
}
}

You still should consider /8 or /16 as this can be performed entirely within the SSE instruction set.

Jim Dempsey

tomorrowwillbefine · ‎05-28-2012

Microsoft Office 2010 is actually the newest software from microsoft office 2010 keys Microsoft Corporation introduced in the last year. Its leading aims tend to be to catch the present business requirements and to be on top of every competition with regard to the international market criteria. This can be a very good idea to obtain Microsoft Office 2010 Key immediately to maintain norton antivirus keys yourself up-to-date and to present you with the vast qualified progress opportunities for success. Microsoft Office 2010 is available in both 32-bit and 64-bit editions, but attention please the two are not able to co-exist on the very same personal computer. All of the Office 2010 editions are kaspersky antivirus keys suitable for Windows XP SP3, Windows Vista and Windows 7.

www.keyyeah.com