Solved: Also above logic may not work - Page 2

Ravi_K_ · ‎04-24-2015

Hi All, I am using AVX/SSE instructions to replace memcmp and our workload includes comparing 64 bytes and occasionally 64 and 128 bytes. I am using following function cmp32 for 32byte comparisons and extend it 2 times for 64 or 4 times for 128 bytes and I am hardly getting 1% performance improvement. Testing was done on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu 14.04 x86_64. I tried replacing following lines vcmp = _mm256_cmpeq_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); with vcmp = _mm_xor_si128(xmm0, xmm1); result = _mm_testz_si128(vcmp, vcmp); performance numbers are same. Secondly I tried replacing unaligned loads with aligned loads and still no help. Any additional optimization that can be done to improve performance?? static inline int cmp32(const uint8_t *src_1, const uint8_t *src_2) { __m256i xmm0; __m256i xmm1; __m256i vcmp; int64_t vmask; xmm0 = _mm256_loadu_si256((const __m256i *)src_1); xmm1 = _mm256_loadu_si256((const __m256i *)src_2); vcmp = _mm256_cmpeq_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); if (likely(vmask == 0xffffffff)) { return 0; } else { vcmp = _mm256_cmpgt_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); if (vmask == 0xffffffff) return 1; else return -1; } }

Vladimir_Sedach · ‎05-02-2015

Hi Ravi,

You're actually measuring the time of sprintf() and rand()
since they are *much* slower than my_cmp64() or memcmp().

In case of memcmp() compiler does not call it at all -- it knows its result in advance.
That's why memcmp() is "faster".

I've tested my_cmp64() with the code below. It is about 10 time as fast as memcmp().

Please note the line:
char * volatile src = _src;

volatile makes compiler to really call my_cmp64() or memcmp() and not to optimize them out.

Call with, say, (src + 1) to test unaligned access.
===================================================

__inline int my_cmp64(const void* src_1, const void* src_2)
{
const __m256i* src1 = (const __m256i*)src_1;
const __m256i* src2 = (const __m256i*)src_2;

__m256i mm11 = _mm256_lddqu_si256(src1);
__m256i mm12 = _mm256_lddqu_si256(src1 + 1);

__m256i mm21 = _mm256_lddqu_si256(src2);
__m256i mm22 = _mm256_lddqu_si256(src2 + 1);

   __m256i mm1 = _mm256_xor_si256(mm11, mm21);
   __m256i mm2 = _mm256_xor_si256(mm12, mm22);
   __m256i mm = _mm256_or_si256(mm1, mm2);

return !_mm256_testz_si256(mm, mm);
}

void test_cmp()
{
   __declspec(align(64)) char   _src[200];
   char * volatile   src = _src;
   double   time;
   int       i, result = 0;

for (i = 0; i < sizeof(_src); i++)
_src = i;

time = vx_time();
for (i = 0; i < 100 * 1000 * 1000; i++)
result += my_cmp64(src, src);
// result += memcmp(src, src, 64);

pl("time: %.3f, %d", vx_time(time), result);
}

View solution in original post

andysem · ‎05-19-2015

Try ORing my_cmp16 results on return into a single accumulator instead of lddqu and movemask.

Ravi_K_ · ‎05-19-2015

Thanks andysem, your earlier responses have been very helpful in understanding SIMD. I am assuming you are suggesting below version, if yes, I had tried it before code snippets below my_cmp256(const void *src_1, const void *src_2) { int ret; ret = my_cmp16((const uint8_t *)src_1 + 0 * 16, (const uint8_t *)src_2 + 0 * 16); ret |= my_cmp16((const uint8_t *)src_1 + 1 * 16, (const uint8_t *)src_2 + 1 * 16); ret |= my_cmp16((const uint8_t *)src_1 + 2 * 16, (const uint8_t *)src_2 + 2 * 16); ret |= my_cmp16((const uint8_t *)src_1 + 3 * 16, (const uint8_t *)src_2 + 3 * 16); ... ret |= my_cmp16((const uint8_t *)src_1 + 3 * 16, (const uint8_t *)src_2 + 3 * 16) ... ret |= my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16) return ret } unfortunately performance drastically increases. I had debugged this and found out that if I removed last "ret |=" i.e. "from 15 * 16 comparison" performance number is inline with memcmp. I looked at the code generated by gcc and didn't see anything suspicious. I am not sure yet why it happens. Wihtout last "oring" gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c /home/rkerur/compare# ./a.out Time: 163 ticks (1533742 memcmp/tick) with last oring I get gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c /home/rkerur/compare# ./a.out Time: 1499 ticks (166777 memcmp/tick) If your input is not what I had tried please let me know I am willing to try out your suggestion. If you are interested in actual code I can send it out. Thanks.

Ravi_K_ · ‎05-19-2015

Also above logic may not work if I try to mimic memcmp return values (0== equal, -ve == lesser, +ve == greater). Thanks,

andysem · ‎05-20-2015

Ravi K. wrote:

unfortunately performance drastically increases. I had debugged this and found out that if I removed last "ret |=" i.e. "from 15 * 16 comparison" performance number is inline with memcmp.

I assume you meant that the performance decreased. I'm not sure what you mean by removing the last OR - did you remove the whole last call to my_cmp16 or just replaced OR with something else? Note that if you assign to ret instead of combining it with its previous value the compiler may eliminate all earlier code that composed that previous value.

There is a significant difference between a typical memcmp and your algorithm. memcmp will stop comparing when it finds the first difference while your algorithm completes the traverse anyway. Depending on the input data, this may result in different performance ratio between memcmp and my_cmp256. You have to take into account your real world data that will most probably be fed to your algorithm.

Ravi K. wrote:

Also above logic may not work if I try to mimic memcmp return values (0== equal, -ve == lesser, +ve == greater).

Your my_cmp16 and my_cmp256 did not support that interface, so as long as that is ok ORing is also ok. If you do need ordering as well as equality then you need to modify both my_cmp16 and my_cmp256 to accommodate that. See my initial post in this thread for an example.

Vladimir_Sedach · ‎05-20-2015

andysem,

In my_memcmp() above must be:
return rcmp - cmp;
instead of
return cmp - rcmp;

What's the origin of this method, especially its bit manipulation of ints?

Unfortunately it is much slower than byte swapping and comparison of 8, 1, 2, 4 byte values.

andysem · ‎05-20-2015

Vladimir Sedach wrote:

andysem,

In my_memcmp() above must be:
return rcmp - cmp;
instead of
return cmp - rcmp;

Oh, right, thanks. I've fixed the post.

Vladimir Sedach wrote:

What's the origin of this method, especially its bit manipulation of ints?

I wrote similar code for Boost.UUID and adapted it for this post.

Vladimir Sedach wrote:

Unfortunately it is much slower than byte swapping and comparison of 8, 1, 2, 4 byte values.

I'm not sure I understand. Do you mean _mm_shuffle_epi8 to swap bytes? For 256-bit registers that would also require _mm256_permute2f128_si256, and that would only save a few integer instructions on the cmp/rcmp in the end. Did you benchmark it or did I misunderstand your suggestion?

Ravi_K_ · ‎05-20-2015

andysem, I assume you meant that the performance decreased. I'm not sure what you mean by removing the last OR - did you remove the whole last call to my_cmp16 or just replaced OR with something else? Note that if you assign to ret instead of combining it with its previous value the compiler may eliminate all earlier code that composed that previous value. No it drastically increases cpu ticks and thats what is puzzling. When I say remove last OR, ret |= my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16) with last ORing I get gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c /home/rkerur/compare# ./a.out Time: 1499 ticks (166777 memcmp/tick) instead of above code if I have ret = my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16) gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c /home/rkerur/compare# ./a.out Time: 163 ticks (1533742 memcmp/tick) Since I am testing for equality return didn't matter and I was debugging to find out why the increase in cpu ticks. If I use the logic to mimic memcmp return codes, then I cannot use just "ORing" logic, correct? Thanks.

andysem · ‎05-20-2015

Ravi K. wrote:

No it drastically increases cpu ticks and thats what is puzzling. When I say remove last OR,

ret |= my_cmp16((const uint8_t *)src_1 + 15 * 16,
(const uint8_t *)src_2 + 15 * 16)

with last ORing I get

gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c
/home/rkerur/compare# ./a.out
Time: 1499 ticks (166777 memcmp/tick)

instead of above code if I have

ret = my_cmp16((const uint8_t *)src_1 + 15 * 16,
(const uint8_t *)src_2 + 15 * 16)

gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c

/home/rkerur/compare# ./a.out
Time: 163 ticks (1533742 memcmp/tick)

Since I am testing for equality return didn't matter and I was debugging to find out why the increase in cpu ticks.

I'm pretty sure the compiler removed all calls to my_cmp16 except the last one, and that explains the difference in numbers. Have a look at the disassembled code of your test to verify that.

Ravi K. wrote:

If I use the logic to mimic memcmp return codes, then I cannot use just "ORing" logic, correct?

Yes.

Vladimir_Sedach · ‎05-20-2015

Andy,

I mean:
__m128i idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i v0 = _mm_loadu_si128(a);
__m128i v1 = _mm_loadu_si128(b);

if (v0 != v1)
{
v0 = _mm_shuffle_epi8(v0, idx); //reverse byte order
v1 = _mm_shuffle_epi8(v1, idx);

v0 = _mm_xor_si128(v0, _mm_set1_epi8(0x80)); //to compare unsigned bytes with instructions for signed bytes
v1 = _mm_xor_si128(v1, _mm_set1_epi8(0x80));

return _mm_movemask_epi8(v0 > v1) - _mm_movemask_epi8(v1 > v0);
}

Though, _byteswap_uint64() and likes appeared to be faster:

#define CMP_64(a, b) { \
   __u64   x = bswap_64(*(__u64 *)(a)); \
   __u64   y = bswap_64(*(__u64 *)(b)); \
   if (x != y) return (x < y) ? -1 : 1; }

GCC creates a very fast code for the last line.

Ravi_K_ · ‎05-20-2015

Vladimir, What is the idx value in _mm_shuffle_epi8(v0, idx); ? I am just trying to understand the code, hence the question. Thanks.

Vladimir_Sedach · ‎05-20-2015

Ravi,

Sorry, I missed it:
__m128i idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);

v0 = _mm_shuffle_epi8(v0, idx) reverses the byte order in v0.

Ravi_K_ · ‎05-20-2015

Vladimir, Thanks for the information. I tested out #define CMP_64(a, b) { \ __u64 x = bswap_64(*(__u64 *)(a)); \ __u64 y = bswap_64(*(__u64 *)(b)); \ if (x != y) return (x < y) ? -1 : 1; } its faster than SIMD. My question is why swap before comparison? Direct comparison will be even faster? Thanks

Ravi_K_ · ‎05-20-2015

I'm pretty sure the compiler removed all calls to my_cmp16 except the last one, and that explains the difference in numbers. Have a look at the disassembled code of your test to verify that. Andy, My mistake,, I should have paid attention. Yes, your explanation is correct. Thanks,

Vladimir_Sedach · ‎05-20-2015

Ravi,

Comparing integers we first compare high order bytes. They have offsets +8, +7,... in memory.
In memcmp() we compare bytes with offsets +0, +1, ...

andysem · ‎05-21-2015

Vladimir Sedach wrote:

Andy,

I mean:
__m128i idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i   v0 = _mm_loadu_si128(a);
__m128i   v1 = _mm_loadu_si128(b);

if (v0 != v1)
{
   v0 = _mm_shuffle_epi8(v0, idx); //reverse byte order
   v1 = _mm_shuffle_epi8(v1, idx);

   v0 = _mm_xor_si128(v0, _mm_set1_epi8(0x80)); //to compare unsigned bytes with instructions for signed bytes
   v1 = _mm_xor_si128(v1, _mm_set1_epi8(0x80));

   return _mm_movemask_epi8(v0 > v1) - _mm_movemask_epi8(v1 > v0);
}

Though, _byteswap_uint64() and likes appeared to be faster:

#define CMP_64(a, b) { \
   __u64   x = bswap_64(*(__u64 *)(a)); \
   __u64   y = bswap_64(*(__u64 *)(b)); \
   if (x != y) return (x < y) ? -1 : 1; }

GCC creates a very fast code for the last line.

I'll have to run some tests with these variants. The SIMD variant requires two more memory loads, but you eliminated one _mm_xor_si128 and scalar integer math. The bswap version also looks very interesting.

Can I use your code in my projects? I might update Boost.UUID if it shows beneficial.

Vladimir_Sedach · ‎05-21-2015

Andy,

Yes, you can freely use the code.
The code below is an excerpt from my library that will appear publicly one day.
memeq() returns 1 for equality and 0 otherwise.
The functions are faster if called with const "size" parameter due to smart inlinement.

typedef char                   __i8;
typedef unsigned char           __u8;
typedef short                   __i16;
typedef unsigned short           __u16;
typedef int                       __i32;
typedef unsigned int           __u32;
typedef long long               __i64;   //GC stdint.h->int64_t is either "long long int" (32-bit) or "long int" (64-bit)
typedef unsigned long long       __u64;

#ifdef __GNUC__
__inline __u16 bswap_16(__u16 a)
{ return __builtin_bswap16(a);}

__inline __u32 bswap_32(__u32 a)
{ return __builtin_bswap32(a);}

__inline __u64 bswap_64(__u64 a)
{ return __builtin_bswap64(a);}

#else

__inline __u16 bswap_16(__u16 a)
{ return _byteswap_ushort(a);}

__inline __u32 bswap_32(__u32 a)
{ return _byteswap_ulong(a);}

__inline __u64 bswap_64(__u64 a)
{ return _byteswap_uint64(a);}
#endif

//****************************************
#define CMP_1(a, b) { \
   __u8   x = *(__u8 *)(a); \
   __u8   y = *(__u8 *)(b); \
   if (x != y) return x - y; }

#define _CMP_1(a, b) \
   return *(__u8 *)(a) - *(__u8 *)(b);
//****************************************
#define CMP_2(a, b) { \
   __u16   x = bswap_16(*(__u16 *)(a)); \
   __u16   y = bswap_16(*(__u16 *)(b)); \
   if (x != y) return x - y; }

#define _CMP_2(a, b) { \
   __u16   x = bswap_16(*(__u16 *)(a)); \
   __u16   y = bswap_16(*(__u16 *)(b)); \
   return x - y; }
//****************************************
#define CMP_4(a, b) { \
   __u32   x = bswap_32(*(__u32 *)(a)); \
   __u32   y = bswap_32(*(__u32 *)(b)); \
   if (x != y) return (x < y) ? -1 : 1; }

#define _CMP_4(a, b) { \
   __u32   x = bswap_32(*(__u32 *)(a)); \
   __u32   y = bswap_32(*(__u32 *)(b)); \
   return (x < y) ? -1 : (x > y) ? 1 : 0; }
//****************************************
#define CMP_8(a, b) { \
   __u64   x = bswap_64(*(__u64 *)(a)); \
   __u64   y = bswap_64(*(__u64 *)(b)); \
   if (x != y) return (x < y) ? -1 : 1; }

#define _CMP_8(a, b) { \
   __u64   x = bswap_64(*(__u64 *)(a)); \
   __u64   y = bswap_64(*(__u64 *)(b)); \
   return (x < y) ? -1 : (x > y) ? 1 : 0; }
//****************************************

//*************************************************************************
static __inline __i32 _memcmp(const void *_a, const void *_b, size_t _size)
//*************************************************************************
{
   __u8   *a = (__u8 *)_a;
   __u8   *b = (__u8 *)_b;
   ptrdiff_t   size = _size;
   __u64   x, y;
   ptrdiff_t   i;

if (!size)
return 0;

CMP_1(a, b)

if (size >= 32)
goto cmp_long;

   for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
   {
       CMP_8(a + 0, b + 0)
       CMP_8(a + 8, b + 8)
   }

cmp_15:
   switch (size - i)
   {
   case 0:
       return 0;
   case 1:
       _CMP_1(a, b)
   case 2:
       _CMP_2(a, b)
   case 3:
       CMP_2(a, b)
       _CMP_1(a + 2, b + 2)
   case 4:
       _CMP_4(a, b)
   case 5:
       CMP_4(a, b)
       _CMP_1(a + 4, b + 4)
   case 6:
       CMP_4(a, b)
       _CMP_2(a + 4, b + 4)
   case 7:
       CMP_4(a, b)
       CMP_2(a + 4, b + 4)
       _CMP_1(a + 6, b + 6)
   case 8:
       _CMP_8(a, b)
   case 9:
       CMP_8(a, b)
       _CMP_1(a + 8, b + 8)
   case 10:
       CMP_8(a, b)
       _CMP_2(a + 8, b + 8)
   case 11:
       CMP_8(a, b)
       CMP_2(a + 8, b + 8)
       _CMP_1(a + 10, b + 10)
   case 12:
       CMP_8(a, b)
       _CMP_4(a + 8, b + 8)
   case 13:
       CMP_8(a, b)
       CMP_4(a + 8, b + 8)
       _CMP_1(a + 12, b + 12)
   case 14:
       CMP_8(a, b)
       CMP_4(a + 8, b + 8)
       _CMP_2(a + 12, b + 12)
   case 15:
       CMP_8(a, b)
       CMP_4(a + 8, b + 8)
       CMP_2(a + 12, b + 12)
       _CMP_1(a + 14, b + 14)
   } //switch

cmp_long:
   for (i = 0; i <= size - 32; i += 32, a += 32, b += 32)
   {
       x = *(__u64 *)(a + 0);   if (x != *(__u64 *)(b + 0))   goto ret0;
       x = *(__u64 *)(a + 8);   if (x != *(__u64 *)(b + 8))   goto ret8;
       x = *(__u64 *)(a + 16);   if (x != *(__u64 *)(b + 16))   goto ret16;
       x = *(__u64 *)(a + 24);   if (x != *(__u64 *)(b + 24))   goto ret24;
   }

if (size - i < 16)
goto cmp_15;

x = *(__u64 *)(a + 0); if (x != *(__u64 *)(b + 0)) goto ret0;
x = *(__u64 *)(a + 8); if (x != *(__u64 *)(b + 8)) goto ret8;

   a += 16;
   b += 16;
   i += 16;
   goto cmp_15;

ret0:   y = *(__u64 *)(b + 0);   goto ret;
ret8:   y = *(__u64 *)(b + 8);   goto ret;
ret16:   y = *(__u64 *)(b + 16);   goto ret;
ret24:   y = *(__u64 *)(b + 24);   goto ret;

ret:
   x = bswap_64(x);
   y = bswap_64(y);
   return (x < y) ? -1 : (x > y) ? 1 : 0;
} //_memcmp

//***********************************************
#define CMPEQ_1(a, b) \
if (*(__u8 *)(a) != *(__u8 *)(b)) return 0;

#define _CMPEQ_1(a, b) \
return *(__u8 *)(a) == *(__u8 *)(b);
//***********************************************
#define CMPEQ_2(a, b) \
if (*(__u16 *)(a) != *(__u16 *)(b)) return 0;

#define _CMPEQ_2(a, b) \
return *(__u16 *)(a) == *(__u16 *)(b);
//***********************************************
#define CMPEQ_4(a, b) \
if (*(__u32 *)(a) != *(__u32 *)(b)) return 0;

#define _CMPEQ_4(a, b) \
return *(__u32 *)(a) == *(__u32 *)(b);
//***********************************************
#define CMPEQ_8(a, b) \
if (*(__u64 *)(a) != *(__u64 *)(b)) return 0;

#define _CMPEQ_8(a, b) \
return *(__u64 *)(a) == *(__u64 *)(b);
//***********************************************

//***********************************************************************
static __inline __i32 memeq(const void *_a, const void *_b, size_t _size)
//***********************************************************************
{
   __u8   *a = (__u8 *)_a;
   __u8   *b = (__u8 *)_b;
   ptrdiff_t   size = _size;
   ptrdiff_t   i;
   __v1i8   v0, v1;

if (!size)
return 1;

CMPEQ_1(a, b)

#if 1
   for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
   {
       CMPEQ_8(a + 0, b + 0)
       CMPEQ_8(a + 8, b + 8)
   }
#elif VX_SSE2
   for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
   {
       v0 = loadu(a);
       v1 = loadu(b);

       if (v0 != v1)
           return 0;
   }
#endif

   switch (size - i)
   {
   case 0:
       return 1;
   case 1:
       _CMPEQ_1(a, b)
   case 2:
       _CMPEQ_2(a, b)
   case 3:
       CMPEQ_2(a, b)
       _CMPEQ_1(a + 2, b + 2)
   case 4:
       _CMPEQ_4(a, b)
   case 5:
       CMPEQ_4(a, b)
       _CMPEQ_1(a + 4, b + 4)
   case 6:
       CMPEQ_4(a, b)
       _CMPEQ_2(a + 4, b + 4)
   case 7:
       CMPEQ_4(a, b)
       CMPEQ_2(a + 4, b + 4)
       _CMPEQ_1(a + 6, b + 6)
   case 8:
       _CMPEQ_8(a, b)
   case 9:
       CMPEQ_8(a, b)
       _CMPEQ_1(a + 8, b + 8)
   case 10:
       CMPEQ_8(a, b)
       _CMPEQ_2(a + 8, b + 8)
   case 11:
       CMPEQ_8(a, b)
       CMPEQ_2(a + 8, b + 8)
       _CMPEQ_1(a + 10, b + 10)
   case 12:
       CMPEQ_8(a, b)
       _CMPEQ_4(a + 8, b + 8)
   case 13:
       CMPEQ_8(a, b)
       CMPEQ_4(a + 8, b + 8)
       _CMPEQ_1(a + 12, b + 12)
   case 14:
       CMPEQ_8(a, b)
       CMPEQ_4(a + 8, b + 8)
       _CMPEQ_2(a + 12, b + 12)
   case 15:
       CMPEQ_8(a, b)
       CMPEQ_4(a + 8, b + 8)
       CMPEQ_2(a + 12, b + 12)
       _CMPEQ_1(a + 14, b + 14)
   } //switch
} //memeq

Vladimir_Sedach · ‎05-21-2015

The above _memcmp() and memeq() became faster due to comparing first bytes first.

Ravi_K_ · ‎05-21-2015

Vladimir, Andy I tried bswap_64 function for comparison which works fine for small comparisons. As the comparison size increases (256, 512, 1024, ...,) this mechanism just dies down and can't beat memcmp. I hope I am not misunderstanding the code. I had used 16 bytes SSE comparison before and couldn't beat it, so wanted to know if your use-case has larger comparisons and it has worked? Thanks.

jimdempseyatthecove · ‎05-21-2015

Ravi,

Another option you can do, which I will describe to you, and let you write the code (should be simple enough to do)

a) perform two compares, one for gt, the other for eq, the third option (lt) is anything else.
b) perform logical and on each producing a 2-bit bit mask (2==gt, 1==eq, 0==lt) with each 2-bit field shifted left by 2 bits in the order of either little-endian or big-endian depending on how you organize your data. The "shifted" part is performed in the constant used in the and mask
c) perform a logical or on the two masked results completing each 2-bit result
d) perform hadd to combine the results into a composite number (repeat as often as necessary for the width of the original input data (64-bit, 32-bit ??-bit).

What you essentially are doing is producing

(least significant 2-bit field) +
(next significant 2-bit field)<<2 +
(next significant 2-bit field)<<4 +
(next significant 2-bit field)<<6 +
...

You'd produce as many 2-bit results as necessary that fit within an __m128i (or ___m256i) register before writing to memory. Then use the GP register commands to complete the process.

As to if this will be faster than what is posted above, I cannot say. It will have to be tested.

Jim Dempsey

Vladimir_Sedach · ‎05-22-2015

Ravi,

If you want to compare for equality only, it's better to use memeq() above.
I've updated _memcmp() to handle large memory chunks fast.

Ravi_K_ · ‎05-22-2015

Thanks Vladimir for sharing your code, will look into it during my free time. As of now, I am mostly interested in SIMD comparisons and one thing that struck me from your code is vector comparison. __m128i idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); __m128i v0 = _mm_loadu_si128(a); __m128i v1 = _mm_loadu_si128(b); if (v0 != v1) { ... } I couldn't find any intrinsics function which can return int hence have to use _xor_ and _testz_. Did you avoid using _xor_ and _testz_, if yes, could you please share your knowledge. Pseudo code I am using for testing is below /* compares 16 bytes*/ static inline int my_cmp16(const void *src_1, const void *src_2) { __m128i xmm0, xmm1, xmm2; xmm0 = _mm_lddqu_si128((const __m128i *)src_1); xmm1 = _mm_lddqu_si128((const __m128i *)src_2); xmm2 = _mm_xor_si128(xmm0, xmm1); ... } /* compares 0 to 15 bytes */ static inline int my_memcmp_regular(const uint8_t *src_1u, const uint8_t *src_2u, size_t n) { int ret = 1; /** * Compare less than 16 bytes */ if (n & 0x08) { ret = (*(const uint64_t *)src_1u == *(const uint64_t *)src_2u); if ((ret != 1)) goto exit_8; n -= 0x8; src_1u += 0x8; src_2u += 0x8; } ... } static inline int my_cmp32(const void *src_1, const void *src_2) { my_cmp16(...); my_cmp16(...) } static inline int my_cmp48(const void *src_1, const void *src_2) { my_cmp16(...); my_cmp16(...); my_cmp16(...); } Similarly I have for 64, 128 and 256 bytes comparison. static inline int my_memcmp(const void *_src_1, const void *_src_2, size_t n) { const uint8_t *src_1 = (const uint8_t *)_src_1; const uint8_t *src_2 = (const uint8_t *)_src_2; int ret = 0; if (n < 16) return my_memcmp_regular(src_1, src_2, n); if (n <= 32) { ret = my_cmp16(src_1, src_2); if (unlikely(ret != 0)) return ret; return my_cmp16(src_1 - 16 + n, src_2 - 16 + n); } if (n <= 48) { ret = my_cmp32(src_1, src_2); if (unlikely(ret != 0)) return ret; return my_cmp16(src_1 - 16 + n, src_2 - 16 + n); } if (n <= 64) { ret = my_cmp32(src_1, src_2); if (unlikely(ret != 0)) return ret; ret = my_cmp16(src_1 + 32, src_2 + 32); if (unlikely(ret != 0)) return ret; return my_cmp16(src_1 - 16 + n, src_2 - 16 + n); } ... } Thanks.

Q on memory comparison optimization