Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Beginner
380 Views

Q on memory comparison optimization

Jump to solution
Hi All, I am using AVX/SSE instructions to replace memcmp and our workload includes comparing 64 bytes and occasionally 64 and 128 bytes. I am using following function cmp32 for 32byte comparisons and extend it 2 times for 64 or 4 times for 128 bytes and I am hardly getting 1% performance improvement. Testing was done on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu 14.04 x86_64. I tried replacing following lines vcmp = _mm256_cmpeq_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); with vcmp = _mm_xor_si128(xmm0, xmm1); result = _mm_testz_si128(vcmp, vcmp); performance numbers are same. Secondly I tried replacing unaligned loads with aligned loads and still no help. Any additional optimization that can be done to improve performance?? static inline int cmp32(const uint8_t *src_1, const uint8_t *src_2) { __m256i xmm0; __m256i xmm1; __m256i vcmp; int64_t vmask; xmm0 = _mm256_loadu_si256((const __m256i *)src_1); xmm1 = _mm256_loadu_si256((const __m256i *)src_2); vcmp = _mm256_cmpeq_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); if (likely(vmask == 0xffffffff)) { return 0; } else { vcmp = _mm256_cmpgt_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); if (vmask == 0xffffffff) return 1; else return -1; } }
0 Kudos
51 Replies
New Contributor I
53 Views

Ravi,

If you want to compare for equality only, it's better to use memeq() above.
I've updated _memcmp() to handle large memory chunks fast.

0 Kudos
Beginner
53 Views
Thanks Vladimir for sharing your code, will look into it during my free time. As of now, I am mostly interested in SIMD comparisons and one thing that struck me from your code is vector comparison. __m128i idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); __m128i v0 = _mm_loadu_si128(a); __m128i v1 = _mm_loadu_si128(b); if (v0 != v1) { ... } I couldn't find any intrinsics function which can return int hence have to use _xor_ and _testz_. Did you avoid using _xor_ and _testz_, if yes, could you please share your knowledge. Pseudo code I am using for testing is below /* compares 16 bytes*/ static inline int my_cmp16(const void *src_1, const void *src_2) { __m128i xmm0, xmm1, xmm2; xmm0 = _mm_lddqu_si128((const __m128i *)src_1); xmm1 = _mm_lddqu_si128((const __m128i *)src_2); xmm2 = _mm_xor_si128(xmm0, xmm1); ... } /* compares 0 to 15 bytes */ static inline int my_memcmp_regular(const uint8_t *src_1u, const uint8_t *src_2u, size_t n) { int ret = 1; /** * Compare less than 16 bytes */ if (n & 0x08) { ret = (*(const uint64_t *)src_1u == *(const uint64_t *)src_2u); if ((ret != 1)) goto exit_8; n -= 0x8; src_1u += 0x8; src_2u += 0x8; } ... } static inline int my_cmp32(const void *src_1, const void *src_2) { my_cmp16(...); my_cmp16(...) } static inline int my_cmp48(const void *src_1, const void *src_2) { my_cmp16(...); my_cmp16(...); my_cmp16(...); } Similarly I have for 64, 128 and 256 bytes comparison. static inline int my_memcmp(const void *_src_1, const void *_src_2, size_t n) { const uint8_t *src_1 = (const uint8_t *)_src_1; const uint8_t *src_2 = (const uint8_t *)_src_2; int ret = 0; if (n < 16) return my_memcmp_regular(src_1, src_2, n); if (n <= 32) { ret = my_cmp16(src_1, src_2); if (unlikely(ret != 0)) return ret; return my_cmp16(src_1 - 16 + n, src_2 - 16 + n); } if (n <= 48) { ret = my_cmp32(src_1, src_2); if (unlikely(ret != 0)) return ret; return my_cmp16(src_1 - 16 + n, src_2 - 16 + n); } if (n <= 64) { ret = my_cmp32(src_1, src_2); if (unlikely(ret != 0)) return ret; ret = my_cmp16(src_1 + 32, src_2 + 32); if (unlikely(ret != 0)) return ret; return my_cmp16(src_1 - 16 + n, src_2 - 16 + n); } ... } Thanks.
0 Kudos
New Contributor I
53 Views

Ravi,

(v0 != v1)
could be:

(_mm_movemask_epi8(_mm_cmpeq_epi8(v0, v1)) != 0xFFFF)

 

0 Kudos
New Contributor III
53 Views

I have finally run some benchmarks to compare my SIMD-based solution vs. bswap-based one.

TL;DR: bswap-based can be faster than SIMD but only when compiler has a good optimizer and you hit the fast path. Otherwise SIMD is faster.

I was experimenting on 16-byte data, which is what `boost::uuid` is. The following are the test subjects:

mem_equal, mem_less - The reference functions that use std::memcmp to compare the two UUIDs for equality and less.
simd_equal, simd_less - The functions that use SSE2 for comparing the UUIDs. This is the current implementation on Boost.UUID (and simd_less is equivalent to the one I provided in the comment #2).
reg_equal - The function that compares the UUIDs for equality by loading and comparing 2 pairs of uint64_t comrising the UUIDs data. The compilers use general redisters to implement that (I checked the disassembler).
reg_less, reg32_less - Use bswap instruction to swap the loaded words and compare for less order. reg_less uses 64-bit loads and bswap, reg32_less uses 32-bit ones (and consequently does twice as many comparisons). Basically, this is Vladimir's version.

All above functions receive two UUIDs and return bool as the result. Each test subject was tested 4 times:

  • With 2 different UUIDs placed on the stack (and likely close to each other)
  • With 2 different UUIDs placed on the heap
  • With 2 different but equal UUIDs placed on the stack
  • With 2 different but equal UUIDs placed on the heap

For each test, duration of 500000000 comparisons was measured, as well as an approximate number of ticks per one call (this I don't consider accurate). The hardware is a Sandy Bridge CPU. You can find the test code in attach.

 

Here are the results:

gcc 4.9.2, Linux, 64-bit target (g++ -g2 -O3 -I. -o uuid_operators uuid_operators.cpp)

    Values placed on stack:
    mem_equal        duration: 2909054936 ns, clocks per iteration: 54
    simd_equal       duration: 703210732 ns, clocks per iteration: 30
    reg_equal        duration: 701833340 ns, clocks per iteration: 30
    mem_less         duration: 2805026359 ns, clocks per iteration: 57
    simd_less        duration: 924337219 ns, clocks per iteration: 30
    reg_less         duration: 935014123 ns, clocks per iteration: 30
    reg32_less       duration: 704345068 ns, clocks per iteration: 27

    Values placed on heap:
    mem_equal        duration: 2803484224 ns, clocks per iteration: 64
    simd_equal       duration: 702860158 ns, clocks per iteration: 37
    reg_equal        duration: 700742185 ns, clocks per iteration: 30
    mem_less         duration: 2806557934 ns, clocks per iteration: 64
    simd_less        duration: 920703610 ns, clocks per iteration: 30
    reg_less         duration: 935851585 ns, clocks per iteration: 30
    reg32_less       duration: 704654313 ns, clocks per iteration: 30

    Same values placed on stack:
    mem_equal        duration: 1877397010 ns, clocks per iteration: 51
    simd_equal       duration: 700662250 ns, clocks per iteration: 27
    reg_equal        duration: 932869196 ns, clocks per iteration: 37
    mem_less         duration: 1882268763 ns, clocks per iteration: 51
    simd_less        duration: 925608825 ns, clocks per iteration: 37
    reg_less         duration: 822508496 ns, clocks per iteration: 30
    reg32_less       duration: 1289300793 ns, clocks per iteration: 33

    Same values placed on heap:
    mem_equal        duration: 1875647450 ns, clocks per iteration: 73
    simd_equal       duration: 700736792 ns, clocks per iteration: 30
    reg_equal        duration: 932544893 ns, clocks per iteration: 37
    mem_less         duration: 1889143565 ns, clocks per iteration: 58
    simd_less        duration: 922657593 ns, clocks per iteration: 37
    reg_less         duration: 825816472 ns, clocks per iteration: 30
    reg32_less       duration: 1292292925 ns, clocks per iteration: 55

gcc 4.9.2, Linux, 32-bit target (g++ -m32 -msse2 -g2 -O3 -I. -o uuid_operators32 uuid_operators.cpp)

    Values placed on stack:
    mem_equal        duration: 3003129535 ns, clocks per iteration: 55
    simd_equal       duration: 815852105 ns, clocks per iteration: 40
    reg_equal        duration: 957879657 ns, clocks per iteration: 33
    mem_less         duration: 2909670077 ns, clocks per iteration: 45
    simd_less        duration: 1351182395 ns, clocks per iteration: 36
    reg_less         duration: 1702336738 ns, clocks per iteration: 49
    reg32_less       duration: 816181695 ns, clocks per iteration: 33

    Values placed on heap:
    mem_equal        duration: 2910065687 ns, clocks per iteration: 55
    simd_equal       duration: 816033224 ns, clocks per iteration: 33
    reg_equal        duration: 958156388 ns, clocks per iteration: 33
    mem_less         duration: 2909768581 ns, clocks per iteration: 45
    simd_less        duration: 1351213470 ns, clocks per iteration: 36
    reg_less         duration: 1702484718 ns, clocks per iteration: 45
    reg32_less       duration: 816183720 ns, clocks per iteration: 33

    Same values placed on stack:
    mem_equal        duration: 2544227736 ns, clocks per iteration: 48
    simd_equal       duration: 815765735 ns, clocks per iteration: 40
    reg_equal        duration: 1381228041 ns, clocks per iteration: 33
    mem_less         duration: 2472541973 ns, clocks per iteration: 55
    simd_less        duration: 1351399333 ns, clocks per iteration: 36
    reg_less         duration: 2419782306 ns, clocks per iteration: 48
    reg32_less       duration: 1456508305 ns, clocks per iteration: 39

    Same values placed on heap:
    mem_equal        duration: 2536636251 ns, clocks per iteration: 48
    simd_equal       duration: 815715163 ns, clocks per iteration: 37
    reg_equal        duration: 1381799951 ns, clocks per iteration: 33
    mem_less         duration: 2474383405 ns, clocks per iteration: 55
    simd_less        duration: 1351475715 ns, clocks per iteration: 36
    reg_less         duration: 2419964420 ns, clocks per iteration: 48
    reg32_less       duration: 1451296170 ns, clocks per iteration: 36

gcc 4.4.7, Linux, 64-bit target (g++-4.4 -g2 -O3 -I. -o uuid_operators-4.4 uuid_operators.cpp)

    Values placed on stack:
    mem_equal        duration: 6495902852 ns, clocks per iteration: 63
    simd_equal       duration: 701597240 ns, clocks per iteration: 27
    reg_equal        duration: 704913667 ns, clocks per iteration: 27
    mem_less         duration: 6490310490 ns, clocks per iteration: 63
    simd_less        duration: 941798322 ns, clocks per iteration: 37
    reg_less         duration: 700693695 ns, clocks per iteration: 30
    reg32_less       duration: 703242622 ns, clocks per iteration: 178

    Values placed on heap:
    mem_equal        duration: 6377764566 ns, clocks per iteration: 70
    simd_equal       duration: 703149156 ns, clocks per iteration: 27
    reg_equal        duration: 702795927 ns, clocks per iteration: 34
    mem_less         duration: 6505660554 ns, clocks per iteration: 63
    simd_less        duration: 938937460 ns, clocks per iteration: 30
    reg_less         duration: 701984104 ns, clocks per iteration: 30
    reg32_less       duration: 701489491 ns, clocks per iteration: 21

    Same values placed on stack:
    mem_equal        duration: 10366218787 ns, clocks per iteration: 97
    simd_equal       duration: 702945481 ns, clocks per iteration: 30
    reg_equal        duration: 932854987 ns, clocks per iteration: 37
    mem_less         duration: 10573932471 ns, clocks per iteration: 97
    simd_less        duration: 942492857 ns, clocks per iteration: 30
    reg_less         duration: 1051130774 ns, clocks per iteration: 36
    reg32_less       duration: 1058324348 ns, clocks per iteration: 42

    Same values placed on heap:
    mem_equal        duration: 10366752789 ns, clocks per iteration: 97
    simd_equal       duration: 700994832 ns, clocks per iteration: 30
    reg_equal        duration: 932596190 ns, clocks per iteration: 37
    mem_less         duration: 10558350284 ns, clocks per iteration: 90
    simd_less        duration: 939857638 ns, clocks per iteration: 30
    reg_less         duration: 1059823697 ns, clocks per iteration: 40
    reg32_less       duration: 1064250646 ns, clocks per iteration: 39

gcc 4.4.7, Linux, 32-bit target (g++-4.4 -m32 -msse2 -g2 -O3 -I. -o uuid_operators32-4.4 uuid_operators.cpp)

    Values placed on stack:
    mem_equal        duration: 7035034313 ns, clocks per iteration: 54
    simd_equal       duration: 814800509 ns, clocks per iteration: 30
    reg_equal        duration: 1051117502 ns, clocks per iteration: 28
    mem_less         duration: 7372365518 ns, clocks per iteration: 72
    simd_less        duration: 1270579133 ns, clocks per iteration: 36
    reg_less         duration: 1648952515 ns, clocks per iteration: 36
    reg32_less       duration: 787097322 ns, clocks per iteration: 30

    Values placed on heap:
    mem_equal        duration: 6922481102 ns, clocks per iteration: 69
    simd_equal       duration: 814379042 ns, clocks per iteration: 30
    reg_equal        duration: 1050458651 ns, clocks per iteration: 30
    mem_less         duration: 7271476652 ns, clocks per iteration: 79
    simd_less        duration: 1269998969 ns, clocks per iteration: 30
    reg_less         duration: 1634621334 ns, clocks per iteration: 43
    reg32_less       duration: 788067392 ns, clocks per iteration: 30

    Same values placed on stack:
    mem_equal        duration: 10531383059 ns, clocks per iteration: 100
    simd_equal       duration: 814549940 ns, clocks per iteration: 30
    reg_equal        duration: 1507881322 ns, clocks per iteration: 30
    mem_less         duration: 10893784524 ns, clocks per iteration: 106
    simd_less        duration: 1270398106 ns, clocks per iteration: 30
    reg_less         duration: 2155156482 ns, clocks per iteration: 60
    reg32_less       duration: 1477907964 ns, clocks per iteration: 40

    Same values placed on heap:
    mem_equal        duration: 10530000955 ns, clocks per iteration: 97
    simd_equal       duration: 814384822 ns, clocks per iteration: 37
    reg_equal        duration: 1582379702 ns, clocks per iteration: 61
    mem_less         duration: 10892357714 ns, clocks per iteration: 96
    simd_less        duration: 1269953983 ns, clocks per iteration: 30
    reg_less         duration: 2155506224 ns, clocks per iteration: 60
    reg32_less       duration: 1478063429 ns, clocks per iteration: 33

clang 3.6.0, Linux, 64-bit target (clang -stdlib=libstdc++ -g2 -O3 -I. -o uuid_operators-clang uuid_operators.cpp -lstdc++)

    Values placed on stack:
    mem_equal        duration: 2915831378 ns, clocks per iteration: 57
    simd_equal       duration: 702978963 ns, clocks per iteration: 27
    reg_equal        duration: 932846599 ns, clocks per iteration: 30
    mem_less         duration: 2801806046 ns, clocks per iteration: 64
    simd_less        duration: 903058419 ns, clocks per iteration: 30
    reg_less         duration: 935835507 ns, clocks per iteration: 33
    reg32_less       duration: 931503301 ns, clocks per iteration: 33

    Values placed on heap:
    mem_equal        duration: 2801707798 ns, clocks per iteration: 54
    simd_equal       duration: 699887194 ns, clocks per iteration: 30
    reg_equal        duration: 931113078 ns, clocks per iteration: 30
    mem_less         duration: 2798621990 ns, clocks per iteration: 64
    simd_less        duration: 900708358 ns, clocks per iteration: 37
    reg_less         duration: 931509307 ns, clocks per iteration: 33
    reg32_less       duration: 932174809 ns, clocks per iteration: 40

    Same values placed on stack:
    mem_equal        duration: 1869707957 ns, clocks per iteration: 51
    simd_equal       duration: 732869633 ns, clocks per iteration: 27
    reg_equal        duration: 698069511 ns, clocks per iteration: 37
    mem_less         duration: 1861519606 ns, clocks per iteration: 48
    simd_less        duration: 894058684 ns, clocks per iteration: 37
    reg_less         duration: 831131467 ns, clocks per iteration: 30
    reg32_less       duration: 949342085 ns, clocks per iteration: 30

    Same values placed on heap:
    mem_equal        duration: 1861275241 ns, clocks per iteration: 58
    simd_equal       duration: 698240258 ns, clocks per iteration: 30
    reg_equal        duration: 700417340 ns, clocks per iteration: 30
    mem_less         duration: 1885464546 ns, clocks per iteration: 58
    simd_less        duration: 894038482 ns, clocks per iteration: 30
    reg_less         duration: 829288790 ns, clocks per iteration: 30
    reg32_less       duration: 930727650 ns, clocks per iteration: 30

The following were run on a virtual machine running on the same hardware Linux host. Do not compare timings to the above results.

MSVC14 (VS2015), Windows, 64-bit target (cl -EHsc -FAcs -I. -O2 uuid_operators.cpp)

    Values placed on stack:
    mem_equal        duration: 1132779724 ns, clocks per iteration: 91
    simd_equal       duration: 1051041682 ns, clocks per iteration: 24
    reg_equal        duration: 1324786530 ns, clocks per iteration: 91
    mem_less         duration: 2401681498 ns, clocks per iteration: 169
    simd_less        duration: 1137397350 ns, clocks per iteration: 115
    reg_less         duration: 1022363456 ns, clocks per iteration: 24
    reg32_less       duration: 823214402 ns, clocks per iteration: 85

    Values placed on heap:
    mem_equal        duration: 1279553406 ns, clocks per iteration: 160
    simd_equal       duration: 995784101 ns, clocks per iteration: 24
    reg_equal        duration: 1071507691 ns, clocks per iteration: 82
    mem_less         duration: 2537630620 ns, clocks per iteration: 166
    simd_less        duration: 1107802527 ns, clocks per iteration: 21
    reg_less         duration: 978329089 ns, clocks per iteration: 30
    reg32_less       duration: 924786809 ns, clocks per iteration: 87

    Same values placed on stack:
    mem_equal        duration: 922507748 ns, clocks per iteration: 91
    simd_equal       duration: 939009846 ns, clocks per iteration: 93
    reg_equal        duration: 969571272 ns, clocks per iteration: 117
    mem_less         duration: 2920631812 ns, clocks per iteration: 300
    simd_less        duration: 1396546767 ns, clocks per iteration: 33
    reg_less         duration: 1357923143 ns, clocks per iteration: 90
    reg32_less       duration: 1421889653 ns, clocks per iteration: 196

    Same values placed on heap:
    mem_equal        duration: 868460935 ns, clocks per iteration: 133
    simd_equal       duration: 973166142 ns, clocks per iteration: 67
    reg_equal        duration: 1060500426 ns, clocks per iteration: 27
    mem_less         duration: 2873302053 ns, clocks per iteration: 360
    simd_less        duration: 1114337995 ns, clocks per iteration: 106
    reg_less         duration: 1426126784 ns, clocks per iteration: 27
    reg32_less       duration: 1326662187 ns, clocks per iteration: 121

MSVC14 (VS2015), Windows, 32-bit target (cl -EHsc -FAcs -O2 -I. uuid_operators.cpp)

    Values placed on stack:
    mem_equal        duration: 1097685040 ns, clocks per iteration: 139
    simd_equal       duration: 899783352 ns, clocks per iteration: 52
    reg_equal        duration: 948021047 ns, clocks per iteration: 96
    mem_less         duration: 1709203544 ns, clocks per iteration: 77
    simd_less        duration: 1676794955 ns, clocks per iteration: 76
    reg_less         duration: 1515213805 ns, clocks per iteration: 130
    reg32_less       duration: 1186983541 ns, clocks per iteration: 66

    Values placed on heap:
    mem_equal        duration: 1024332142 ns, clocks per iteration: 90
    simd_equal       duration: 1310634172 ns, clocks per iteration: 61
    reg_equal        duration: 957639308 ns, clocks per iteration: 228
    mem_less         duration: 1338495255 ns, clocks per iteration: 85
    simd_less        duration: 1698868990 ns, clocks per iteration: 55
    reg_less         duration: 1337584804 ns, clocks per iteration: 91
    reg32_less       duration: 1092318437 ns, clocks per iteration: 91

    Same values placed on stack:
    mem_equal        duration: 2074361685 ns, clocks per iteration: 87
    simd_equal       duration: 930216829 ns, clocks per iteration: 45
    reg_equal        duration: 1460527245 ns, clocks per iteration: 136
    mem_less         duration: 1961324972 ns, clocks per iteration: 87
    simd_less        duration: 1751332920 ns, clocks per iteration: 46
    reg_less         duration: 1815786363 ns, clocks per iteration: 108
    reg32_less       duration: 1760446090 ns, clocks per iteration: 152

    Same values placed on heap:
    mem_equal        duration: 1880232543 ns, clocks per iteration: 84
    simd_equal       duration: 851140577 ns, clocks per iteration: 91
    reg_equal        duration: 1362192401 ns, clocks per iteration: 130
    mem_less         duration: 2043999726 ns, clocks per iteration: 224
    simd_less        duration: 1761549303 ns, clocks per iteration: 64
    reg_less         duration: 1924742111 ns, clocks per iteration: 136
    reg32_less       duration: 1985236671 ns, clocks per iteration: 151

I have also run a few tests on a Nehalem machine but the general disposition of the results was similar.

Analysis:

  1. reg_equal does not perform faster than simd_equal, and particularly in 32-bit tests it performs worse.
  2. The 64-bit bswap has greater latency than 32-bit, and on some CPUs also the worse throughput. I believe that's the reason why in most 64-bit test results reg_less performed slower than reg32_less.
  3. reg_less and reg32_less performance largely depends on the compiler. gcc and MSVC can optimize reg_less to the point where it is close or slightly faster than simd_less, but only when the fast path is used (i.e. when the function returns after the first comparison). However that is not the case with clang (I suspect, this is because of inefficient code ordering - it generated a single ret at the end of the function, with conditional jumps to it from the middle, whereas gcc produced ret right after the first condition).
  4. reg32_less typically performs better than simd_less but only when you hit the fast path. Most compilers produced the code where the fast path was when the input UUIDs are different. When UUIDs are the same the performance drops below to that of simd_less. reg_less has the same property but to a lesser extent since the difference between the fast and slow paths is less pronounced.
  5. For 32-bit targets reg_less is generally slower than reg32_less because the compiler still loads 8 bytes of data from each UUID and performs 4 bswaps per comparison, even if the comparison fails. This reduces the effect of the fast path compared to reg32_less.

In general I'd say simd_less showed more consistent results across the board. As much as reg32_less can be faster, it can also be slower, so it's not a clear cut. reg_less has more stable performance but it is only ever beneficial on 64-bit targets.

 

0 Kudos
New Contributor I
53 Views

Андрей,

I understand this suggestions are too tricky, nevertheless:

1. You can first compare only the first bytes. If they are different (with probability 255/256) in case of "equal" that's it.
2. You can set your own rule for "less" and don't swap bytes at all.

 

0 Kudos
New Contributor III
53 Views

1. You can first compare only the first bytes. If they are different (with probability 255/256) in case of "equal" that's it.

Yes, I tried that for the simd_less algorithm. Indeed, this fast path speeds up the case when the input UUIDs are different (in this case it outperforms reg_less and reg32_less and is the fastest). But the branch kills performance when the slow path is taken - it becomes slightly slower than reg_less (and much slower than the original simd_less as well). I think the additional branch is taking too much clocks to process, especially in case of mispredictions.

2. You can set your own rule for "less" and don't swap bytes at all.

Yes, that would be ideal, and I would have done that if I was not constrained with the legacy behavior which used std::lexicographical_compare. Enabling optimizations should not change the behavior.

 

0 Kudos
New Contributor I
53 Views

1. Average time with random UUIDs is:
  time = (time_1*255 + time_not_1*1) / 256
so, time_not_1 is negligible.

2. There's an old method to cope with the problem: add "_" prefix to the names.
 

0 Kudos
53 Views

Ravi,

In re-reviewing this thread, and taking into consideration your response to my first reply ("100% of comparisons are for equality (i.e. check for "== 0") in our code").... there comes to mind some additional alternatives.

Assuming your key comes in as an unknown and your function is to see if the unknown key exists in a list of known keys. A possible technique is in knowing the characteristic pattern of the unknown input key and known stored keys is to determine an optimal number of bytes per key to compare such that multiple keys can be packed into a 256-bit entry of a search table. Example: assuming 4 bytes of the incoming keys vary the most, then you would pack 8 copies of the unknown key into each int32 lane of the ymm register used for search, then each read of the condensed table can perform 8 tests at once. Your first level table would have no duplicates. You would have to work something out for handling duplicates when a hit occurs and you go on to the second 4 bytes (or more bytes) for the second tier table. You have representative data, so you can write a program to determine the optimal packing (including ten 3-byte key sub-fields, or ten 3-byte plus one 2-byte).

This would necessitate NOT using something like cmp32, cmp64 etc, and replace it with find_entry(key_t* unknown) which returns -1 for not found or index of key when found. Note, if you want to put extra effort in, the class/struct could determine the optimal keytable sub-key selection (1 to 32 bytes, extended to 64 bytes for KNC/KNL/AVX512)

Jim Dempsey

0 Kudos
Beginner
53 Views

Thanks Jim for your inputs. Thanks andysem/vladimir for reporting extensive test results on your implementation. My case is primarily for flow (layer-2 mac + layer-3 ipv4/ipv6 + layer-4 tcp/udp ports) and i think perf will benefit using simd. 

0 Kudos
New Contributor III
53 Views

Vladimir Sedach wrote:

1. Average time with random UUIDs is:
  time = (time_1*255 + time_not_1*1) / 256
so, time_not_1 is negligible.

That is true, as long as you expect most of the UUIDs to be different. That's probably a fair expectation, but I'm not sure it's justified enough to code it into a generic component, such as 'uuid'.

I tested the modified version of simd_less with a fast path in a few of my projects. Adding a branch for the fast path often results in compiler splitting the ordering function into two parts - the fast path, which gets inlined, and the rest, which is not inlined. This makes the difference between the fast and slow paths even mode significant. The branchless version always gets inlined completely.

Vladimir Sedach wrote:

2. There's an old method to cope with the problem: add "_" prefix to the names.

Unfortunately, that's not an option for me. The operation is actually spelled as 'operator<' and is required to be equivalent to lexicographical_compare. Custom comparison functions are possible, of course, but my aim is to optimize the generic library implementation.

 

0 Kudos
New Contributor I
53 Views

Andrey,

1. Could you use
    #define BOOST_..._EXPECT_DIFFERENT

2. Perhaps also
    #define BOOST_..._NON_LEXICOGRAPHICAL_COMPARE

before #include to allow the user to choose standard/nonstandard approach.

0 Kudos
New Contributor III
53 Views

No, config macros are generally evil as they can easily lead to ODR violations and other subtle issues.

 

0 Kudos