- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ravi,
You're actually measuring the time of sprintf() and rand()
since they are *much* slower than my_cmp64() or memcmp().
In case of memcmp() compiler does not call it at all -- it knows its result in advance.
That's why memcmp() is "faster".
I've tested my_cmp64() with the code below. It is about 10 time as fast as memcmp().
Please note the line:
char * volatile src = _src;
volatile makes compiler to really call my_cmp64() or memcmp() and not to optimize them out.
Call with, say, (src + 1) to test unaligned access.
===================================================
__inline int my_cmp64(const void* src_1, const void* src_2)
{
const __m256i* src1 = (const __m256i*)src_1;
const __m256i* src2 = (const __m256i*)src_2;
__m256i mm11 = _mm256_lddqu_si256(src1);
__m256i mm12 = _mm256_lddqu_si256(src1 + 1);
__m256i mm21 = _mm256_lddqu_si256(src2);
__m256i mm22 = _mm256_lddqu_si256(src2 + 1);
__m256i mm1 = _mm256_xor_si256(mm11, mm21);
__m256i mm2 = _mm256_xor_si256(mm12, mm22);
__m256i mm = _mm256_or_si256(mm1, mm2);
return !_mm256_testz_si256(mm, mm);
}
void test_cmp()
{
__declspec(align(64)) char _src[200];
char * volatile src = _src;
double time;
int i, result = 0;
for (i = 0; i < sizeof(_src); i++)
_src = i;
time = vx_time();
for (i = 0; i < 100 * 1000 * 1000; i++)
result += my_cmp64(src, src);
// result += memcmp(src, src, 64);
pl("time: %.3f, %d", vx_time(time), result);
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try ORing my_cmp16 results on return into a single accumulator instead of lddqu and movemask.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ravi K. wrote:
unfortunately performance drastically increases. I had debugged this and found out that if I removed last "ret |=" i.e. "from 15 * 16 comparison" performance number is inline with memcmp.
I assume you meant that the performance decreased. I'm not sure what you mean by removing the last OR - did you remove the whole last call to my_cmp16 or just replaced OR with something else? Note that if you assign to ret instead of combining it with its previous value the compiler may eliminate all earlier code that composed that previous value.
There is a significant difference between a typical memcmp and your algorithm. memcmp will stop comparing when it finds the first difference while your algorithm completes the traverse anyway. Depending on the input data, this may result in different performance ratio between memcmp and my_cmp256. You have to take into account your real world data that will most probably be fed to your algorithm.
Ravi K. wrote:
Also above logic may not work if I try to mimic memcmp return values (0== equal, -ve == lesser, +ve == greater).
Your my_cmp16 and my_cmp256 did not support that interface, so as long as that is ok ORing is also ok. If you do need ordering as well as equality then you need to modify both my_cmp16 and my_cmp256 to accommodate that. See my initial post in this thread for an example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem,
In my_memcmp() above must be:
return
rcmp - cmp;
instead of
return
cmp - rcmp;
What's the origin of this method, especially its bit manipulation of ints?
Unfortunately it is much slower than byte swapping and comparison of 8, 1, 2, 4 byte values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir Sedach wrote:
andysem,
In my_memcmp() above must be:
return rcmp - cmp;
instead of
return cmp - rcmp;
Oh, right, thanks. I've fixed the post.
Vladimir Sedach wrote:
What's the origin of this method, especially its bit manipulation of ints?
I wrote similar code for Boost.UUID and adapted it for this post.
Vladimir Sedach wrote:
Unfortunately it is much slower than byte swapping and comparison of 8, 1, 2, 4 byte values.
I'm not sure I understand. Do you mean _mm_shuffle_epi8 to swap bytes? For 256-bit registers that would also require _mm256_permute2f128_si256, and that would only save a few integer instructions on the cmp/rcmp in the end. Did you benchmark it or did I misunderstand your suggestion?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ravi K. wrote:
No it drastically increases cpu ticks and thats what is puzzling. When I say remove last OR,
ret |= my_cmp16((const uint8_t *)src_1 + 15 * 16,
(const uint8_t *)src_2 + 15 * 16)with last ORing I get
gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c
/home/rkerur/compare# ./a.out
Time: 1499 ticks (166777 memcmp/tick)instead of above code if I have
ret = my_cmp16((const uint8_t *)src_1 + 15 * 16,
(const uint8_t *)src_2 + 15 * 16)gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c
/home/rkerur/compare# ./a.out
Time: 163 ticks (1533742 memcmp/tick)Since I am testing for equality return didn't matter and I was debugging to find out why the increase in cpu ticks.
I'm pretty sure the compiler removed all calls to my_cmp16 except the last one, and that explains the difference in numbers. Have a look at the disassembled code of your test to verify that.
Ravi K. wrote:
If I use the logic to mimic memcmp return codes, then I cannot use just "ORing" logic, correct?
Yes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andy,
I mean:
__m128i idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i v0 = _mm_loadu_si128(a);
__m128i v1 = _mm_loadu_si128(b);
if (v0 != v1)
{
v0 = _mm_shuffle_epi8(v0, idx); //reverse byte order
v1 = _mm_shuffle_epi8(v1, idx);
v0 = _mm_xor_si128(v0, _mm_set1_epi8(0x80)); //to compare unsigned bytes with instructions for signed bytes
v1 = _mm_xor_si128(v1, _mm_set1_epi8(0x80));
return _mm_movemask_epi8(v0 > v1) - _mm_movemask_epi8(v1 > v0);
}
Though, _byteswap_uint64() and likes appeared to be faster:
#define CMP_64(a, b) { \
__u64 x = bswap_64(*(__u64 *)(a)); \
__u64 y = bswap_64(*(__u64 *)(b)); \
if (x != y) return (x < y) ? -1 : 1; }
GCC creates a very fast code for the last line.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ravi,
Sorry, I missed it:
__m128i idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
v0 = _mm_shuffle_epi8(v0, idx) reverses the byte order in v0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ravi,
Comparing integers we first compare high order bytes. They have offsets +8, +7,... in memory.
In memcmp() we compare bytes with offsets +0, +1, ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir Sedach wrote:
Andy,
I mean:
__m128i idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i v0 = _mm_loadu_si128(a);
__m128i v1 = _mm_loadu_si128(b);if (v0 != v1)
{
v0 = _mm_shuffle_epi8(v0, idx); //reverse byte order
v1 = _mm_shuffle_epi8(v1, idx);v0 = _mm_xor_si128(v0, _mm_set1_epi8(0x80)); //to compare unsigned bytes with instructions for signed bytes
v1 = _mm_xor_si128(v1, _mm_set1_epi8(0x80));return _mm_movemask_epi8(v0 > v1) - _mm_movemask_epi8(v1 > v0);
}Though, _byteswap_uint64() and likes appeared to be faster:
#define CMP_64(a, b) { \
__u64 x = bswap_64(*(__u64 *)(a)); \
__u64 y = bswap_64(*(__u64 *)(b)); \
if (x != y) return (x < y) ? -1 : 1; }GCC creates a very fast code for the last line.
I'll have to run some tests with these variants. The SIMD variant requires two more memory loads, but you eliminated one _mm_xor_si128 and scalar integer math. The bswap version also looks very interesting.
Can I use your code in my projects? I might update Boost.UUID if it shows beneficial.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andy,
Yes, you can freely use the code.
The code below is an excerpt from my library that will appear publicly one day.
memeq() returns 1 for equality and 0 otherwise.
The functions are faster if called with const "size" parameter due to smart inlinement.
typedef char __i8;
typedef unsigned char __u8;
typedef short __i16;
typedef unsigned short __u16;
typedef int __i32;
typedef unsigned int __u32;
typedef long long __i64; //GC stdint.h->int64_t is either "long long int" (32-bit) or "long int" (64-bit)
typedef unsigned long long __u64;
#ifdef __GNUC__
__inline __u16 bswap_16(__u16 a)
{ return __builtin_bswap16(a);}
__inline __u32 bswap_32(__u32 a)
{ return __builtin_bswap32(a);}
__inline __u64 bswap_64(__u64 a)
{ return __builtin_bswap64(a);}
#else
__inline __u16 bswap_16(__u16 a)
{ return _byteswap_ushort(a);}
__inline __u32 bswap_32(__u32 a)
{ return _byteswap_ulong(a);}
__inline __u64 bswap_64(__u64 a)
{ return _byteswap_uint64(a);}
#endif
//****************************************
#define CMP_1(a, b) { \
__u8 x = *(__u8 *)(a); \
__u8 y = *(__u8 *)(b); \
if (x != y) return x - y; }
#define _CMP_1(a, b) \
return *(__u8 *)(a) - *(__u8 *)(b);
//****************************************
#define CMP_2(a, b) { \
__u16 x = bswap_16(*(__u16 *)(a)); \
__u16 y = bswap_16(*(__u16 *)(b)); \
if (x != y) return x - y; }
#define _CMP_2(a, b) { \
__u16 x = bswap_16(*(__u16 *)(a)); \
__u16 y = bswap_16(*(__u16 *)(b)); \
return x - y; }
//****************************************
#define CMP_4(a, b) { \
__u32 x = bswap_32(*(__u32 *)(a)); \
__u32 y = bswap_32(*(__u32 *)(b)); \
if (x != y) return (x < y) ? -1 : 1; }
#define _CMP_4(a, b) { \
__u32 x = bswap_32(*(__u32 *)(a)); \
__u32 y = bswap_32(*(__u32 *)(b)); \
return (x < y) ? -1 : (x > y) ? 1 : 0; }
//****************************************
#define CMP_8(a, b) { \
__u64 x = bswap_64(*(__u64 *)(a)); \
__u64 y = bswap_64(*(__u64 *)(b)); \
if (x != y) return (x < y) ? -1 : 1; }
#define _CMP_8(a, b) { \
__u64 x = bswap_64(*(__u64 *)(a)); \
__u64 y = bswap_64(*(__u64 *)(b)); \
return (x < y) ? -1 : (x > y) ? 1 : 0; }
//****************************************
//*************************************************************************
static __inline __i32 _memcmp(const void *_a, const void *_b, size_t _size)
//*************************************************************************
{
__u8 *a = (__u8 *)_a;
__u8 *b = (__u8 *)_b;
ptrdiff_t size = _size;
__u64 x, y;
ptrdiff_t i;
if (!size)
return 0;
CMP_1(a, b)
if (size >= 32)
goto cmp_long;
for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
{
CMP_8(a + 0, b + 0)
CMP_8(a + 8, b + 8)
}
cmp_15:
switch (size - i)
{
case 0:
return 0;
case 1:
_CMP_1(a, b)
case 2:
_CMP_2(a, b)
case 3:
CMP_2(a, b)
_CMP_1(a + 2, b + 2)
case 4:
_CMP_4(a, b)
case 5:
CMP_4(a, b)
_CMP_1(a + 4, b + 4)
case 6:
CMP_4(a, b)
_CMP_2(a + 4, b + 4)
case 7:
CMP_4(a, b)
CMP_2(a + 4, b + 4)
_CMP_1(a + 6, b + 6)
case 8:
_CMP_8(a, b)
case 9:
CMP_8(a, b)
_CMP_1(a + 8, b + 8)
case 10:
CMP_8(a, b)
_CMP_2(a + 8, b + 8)
case 11:
CMP_8(a, b)
CMP_2(a + 8, b + 8)
_CMP_1(a + 10, b + 10)
case 12:
CMP_8(a, b)
_CMP_4(a + 8, b + 8)
case 13:
CMP_8(a, b)
CMP_4(a + 8, b + 8)
_CMP_1(a + 12, b + 12)
case 14:
CMP_8(a, b)
CMP_4(a + 8, b + 8)
_CMP_2(a + 12, b + 12)
case 15:
CMP_8(a, b)
CMP_4(a + 8, b + 8)
CMP_2(a + 12, b + 12)
_CMP_1(a + 14, b + 14)
} //switch
cmp_long:
for (i = 0; i <= size - 32; i += 32, a += 32, b += 32)
{
x = *(__u64 *)(a + 0); if (x != *(__u64 *)(b + 0)) goto ret0;
x = *(__u64 *)(a + 8); if (x != *(__u64 *)(b + 8)) goto ret8;
x = *(__u64 *)(a + 16); if (x != *(__u64 *)(b + 16)) goto ret16;
x = *(__u64 *)(a + 24); if (x != *(__u64 *)(b + 24)) goto ret24;
}
if (size - i < 16)
goto cmp_15;
x = *(__u64 *)(a + 0); if (x != *(__u64 *)(b + 0)) goto ret0;
x = *(__u64 *)(a + 8); if (x != *(__u64 *)(b + 8)) goto ret8;
a += 16;
b += 16;
i += 16;
goto cmp_15;
ret0: y = *(__u64 *)(b + 0); goto ret;
ret8: y = *(__u64 *)(b + 8); goto ret;
ret16: y = *(__u64 *)(b + 16); goto ret;
ret24: y = *(__u64 *)(b + 24); goto ret;
ret:
x = bswap_64(x);
y = bswap_64(y);
return (x < y) ? -1 : (x > y) ? 1 : 0;
} //_memcmp
//***********************************************
#define CMPEQ_1(a, b) \
if (*(__u8 *)(a) != *(__u8 *)(b)) return 0;
#define _CMPEQ_1(a, b) \
return *(__u8 *)(a) == *(__u8 *)(b);
//***********************************************
#define CMPEQ_2(a, b) \
if (*(__u16 *)(a) != *(__u16 *)(b)) return 0;
#define _CMPEQ_2(a, b) \
return *(__u16 *)(a) == *(__u16 *)(b);
//***********************************************
#define CMPEQ_4(a, b) \
if (*(__u32 *)(a) != *(__u32 *)(b)) return 0;
#define _CMPEQ_4(a, b) \
return *(__u32 *)(a) == *(__u32 *)(b);
//***********************************************
#define CMPEQ_8(a, b) \
if (*(__u64 *)(a) != *(__u64 *)(b)) return 0;
#define _CMPEQ_8(a, b) \
return *(__u64 *)(a) == *(__u64 *)(b);
//***********************************************
//***********************************************************************
static __inline __i32 memeq(const void *_a, const void *_b, size_t _size)
//***********************************************************************
{
__u8 *a = (__u8 *)_a;
__u8 *b = (__u8 *)_b;
ptrdiff_t size = _size;
ptrdiff_t i;
__v1i8 v0, v1;
if (!size)
return 1;
CMPEQ_1(a, b)
#if 1
for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
{
CMPEQ_8(a + 0, b + 0)
CMPEQ_8(a + 8, b + 8)
}
#elif VX_SSE2
for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
{
v0 = loadu(a);
v1 = loadu(b);
if (v0 != v1)
return 0;
}
#endif
switch (size - i)
{
case 0:
return 1;
case 1:
_CMPEQ_1(a, b)
case 2:
_CMPEQ_2(a, b)
case 3:
CMPEQ_2(a, b)
_CMPEQ_1(a + 2, b + 2)
case 4:
_CMPEQ_4(a, b)
case 5:
CMPEQ_4(a, b)
_CMPEQ_1(a + 4, b + 4)
case 6:
CMPEQ_4(a, b)
_CMPEQ_2(a + 4, b + 4)
case 7:
CMPEQ_4(a, b)
CMPEQ_2(a + 4, b + 4)
_CMPEQ_1(a + 6, b + 6)
case 8:
_CMPEQ_8(a, b)
case 9:
CMPEQ_8(a, b)
_CMPEQ_1(a + 8, b + 8)
case 10:
CMPEQ_8(a, b)
_CMPEQ_2(a + 8, b + 8)
case 11:
CMPEQ_8(a, b)
CMPEQ_2(a + 8, b + 8)
_CMPEQ_1(a + 10, b + 10)
case 12:
CMPEQ_8(a, b)
_CMPEQ_4(a + 8, b + 8)
case 13:
CMPEQ_8(a, b)
CMPEQ_4(a + 8, b + 8)
_CMPEQ_1(a + 12, b + 12)
case 14:
CMPEQ_8(a, b)
CMPEQ_4(a + 8, b + 8)
_CMPEQ_2(a + 12, b + 12)
case 15:
CMPEQ_8(a, b)
CMPEQ_4(a + 8, b + 8)
CMPEQ_2(a + 12, b + 12)
_CMPEQ_1(a + 14, b + 14)
} //switch
} //memeq
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The above _memcmp() and memeq() became faster due to comparing first bytes first.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ravi,
Another option you can do, which I will describe to you, and let you write the code (should be simple enough to do)
a) perform two compares, one for gt, the other for eq, the third option (lt) is anything else.
b) perform logical and on each producing a 2-bit bit mask (2==gt, 1==eq, 0==lt) with each 2-bit field shifted left by 2 bits in the order of either little-endian or big-endian depending on how you organize your data. The "shifted" part is performed in the constant used in the and mask
c) perform a logical or on the two masked results completing each 2-bit result
d) perform hadd to combine the results into a composite number (repeat as often as necessary for the width of the original input data (64-bit, 32-bit ??-bit).
What you essentially are doing is producing
(least significant 2-bit field) +
(next significant 2-bit field)<<2 +
(next significant 2-bit field)<<4 +
(next significant 2-bit field)<<6 +
...
You'd produce as many 2-bit results as necessary that fit within an __m128i (or ___m256i) register before writing to memory. Then use the GP register commands to complete the process.
As to if this will be faster than what is posted above, I cannot say. It will have to be tested.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ravi,
If you want to compare for equality only, it's better to use memeq() above.
I've updated _memcmp() to handle large memory chunks fast.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page