Solved: Q on memory comparison optimization

Ravi_K_ · ‎04-24-2015

Hi All, I am using AVX/SSE instructions to replace memcmp and our workload includes comparing 64 bytes and occasionally 64 and 128 bytes. I am using following function cmp32 for 32byte comparisons and extend it 2 times for 64 or 4 times for 128 bytes and I am hardly getting 1% performance improvement. Testing was done on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu 14.04 x86_64. I tried replacing following lines vcmp = _mm256_cmpeq_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); with vcmp = _mm_xor_si128(xmm0, xmm1); result = _mm_testz_si128(vcmp, vcmp); performance numbers are same. Secondly I tried replacing unaligned loads with aligned loads and still no help. Any additional optimization that can be done to improve performance?? static inline int cmp32(const uint8_t *src_1, const uint8_t *src_2) { __m256i xmm0; __m256i xmm1; __m256i vcmp; int64_t vmask; xmm0 = _mm256_loadu_si256((const __m256i *)src_1); xmm1 = _mm256_loadu_si256((const __m256i *)src_2); vcmp = _mm256_cmpeq_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); if (likely(vmask == 0xffffffff)) { return 0; } else { vcmp = _mm256_cmpgt_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); if (vmask == 0xffffffff) return 1; else return -1; } }

Vladimir_Sedach · ‎05-02-2015

Hi Ravi,

You're actually measuring the time of sprintf() and rand()
since they are *much* slower than my_cmp64() or memcmp().

In case of memcmp() compiler does not call it at all -- it knows its result in advance.
That's why memcmp() is "faster".

I've tested my_cmp64() with the code below. It is about 10 time as fast as memcmp().

Please note the line:
char * volatile src = _src;

volatile makes compiler to really call my_cmp64() or memcmp() and not to optimize them out.

Call with, say, (src + 1) to test unaligned access.
===================================================

__inline int my_cmp64(const void* src_1, const void* src_2)
{
const __m256i* src1 = (const __m256i*)src_1;
const __m256i* src2 = (const __m256i*)src_2;

__m256i mm11 = _mm256_lddqu_si256(src1);
__m256i mm12 = _mm256_lddqu_si256(src1 + 1);

__m256i mm21 = _mm256_lddqu_si256(src2);
__m256i mm22 = _mm256_lddqu_si256(src2 + 1);

   __m256i mm1 = _mm256_xor_si256(mm11, mm21);
   __m256i mm2 = _mm256_xor_si256(mm12, mm22);
   __m256i mm = _mm256_or_si256(mm1, mm2);

return !_mm256_testz_si256(mm, mm);
}

void test_cmp()
{
   __declspec(align(64)) char   _src[200];
   char * volatile   src = _src;
   double   time;
   int       i, result = 0;

for (i = 0; i < sizeof(_src); i++)
_src = i;

time = vx_time();
for (i = 0; i < 100 * 1000 * 1000; i++)
result += my_cmp64(src, src);
// result += memcmp(src, src, 64);

pl("time: %.3f, %d", vx_time(time), result);
}

View solution in original post

andysem · ‎04-25-2015

If you tend to compare large regions I would recommend trying to amortize for vptest latency. You can perform multiple XORs and then combine the results with ORs and perform a single test on the combined result.

int my_memcmp(const void* src_1, const void* src_2, size_t size)
{
    const __m256i* src1 = (const __m256i*)src_1;
    const __m256i* src2 = (const __m256i*)src_2;
    const size_t n = size / 64u;

    for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
    {
        __m256i mm11 = _mm256_lddqu_si256(src1);
        __m256i mm12 = _mm256_lddqu_si256(src1 + 1);
        __m256i mm21 = _mm256_lddqu_si256(src2);
        __m256i mm22 = _mm256_lddqu_si256(src2 + 1);

        __m256i mm1 = _mm256_xor_si256(mm11, mm21);
        __m256i mm2 = _mm256_xor_si256(mm12, mm22);

        __m256i mm = _mm256_or_si256(mm1, mm2);

        if (unlikely(!_mm256_testz_si256(mm, mm)))
        {
            // Find out which of the two 32-byte blocks are different
            if (_mm256_testz_si256(mm1, mm1))
            {
                mm11 = mm12;
                mm21 = mm22;
                mm1 = mm2;
            }

            // Produce the comparison result
            __m256i mm_cmp = _mm256_cmpgt_epi8(mm21, mm11);
            __m256i mm_rcmp = _mm256_cmpgt_epi8(mm11, mm21);

            mm_cmp = _mm256_xor_si256(mm1, mm_cmp);
            mm_rcmp = _mm256_xor_si256(mm1, mm_rcmp);

            uint32_t cmp = _mm256_movemask_epi8(mm_cmp);
            uint32_t rcmp = _mm256_movemask_epi8(mm_rcmp);

            cmp = (cmp - 1u) ^ cmp;
            rcmp = (rcmp - 1u) ^ rcmp;

            return (int32_t)rcmp - (int32_t)cmp;
        }
    }

    // Compare tail bytes, if needed

    return 0;
}

Ravi_K_ · ‎04-27-2015

Thanks Andysem, will try out shortly and get back with results.

Ravi_K_ · ‎04-30-2015

I tested with new code and for 64 bytes workload I get 1 ~ 2 % performance improvement and for 128 bytes workload I get ~3% improvement. Not sure if performance is limited by the system I have. I am testing on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, DDR3 16G, Quad-core with hyperthreading enabled, Ubuntu 14.04 x86_64. Thanks.

jimdempseyatthecove · ‎05-01-2015

Ravi,

In the initial post example, the return value indicates

0 == all equal
1 == all greater
-1== don't know if value is greater or lesser (as there is a mix of greater, lesser, and equal).

What are the probability of the returned values? You might want to tune your tests and return conditions based on this.

Note, arbitrary string matches (unsorted text), would almost always return -1. The you would need to perform additional tests to determine if the text string is greater or lesser.

If this is the case (-1 preponderance and leading to additional test), then I recommend performing the cmpgt and cmplt tests, running the movemask on each, then performing an unsigned 32-bit compare, the greater value will indicate which cmp?t was larger (if 0 then the two were equal). This would eliminate the cmpeq test as well as eliminating additional processing as you may be doing now when -1 is returned.

Jim Dempsey

Ravi_K_ · ‎05-01-2015

Jim, Thanks for your thoughtful response. 100% of comparisons are for equality (i.e. check for "== 0") in our code. I modified original code and wrote a sample test code as shown below to measure ticks and frankly AVX2 doesn't achieve any better. Testing was done on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, DDR3 16G, Quad-core with hyperthreading enabled, Ubuntu 14.04 x86_64. Not sure whether code is an issue or compilation has to be done differently or its a bottleneck in the system. #include #include #include #include #include #include #include static inline int my_cmp64 (const void* src_1, const void* src_2) { const __m256i* src1 = (const __m256i*)src_1; const __m256i* src2 = (const __m256i*)src_2; bool result; __m256i mm11 = _mm256_lddqu_si256(src1); __m256i mm12 = _mm256_lddqu_si256(src1 + 1); __m256i mm21 = _mm256_lddqu_si256(src2); __m256i mm22 = _mm256_lddqu_si256(src2 + 1); __m256i mm1 = _mm256_xor_si256(mm11, mm21); __m256i mm2 = _mm256_xor_si256(mm12, mm22); __m256i mm = _mm256_or_si256(mm1, mm2); result = _mm256_testz_si256(mm, mm); return !result; } int main(void) { int i; char src[64]; int dupes = 0; int start = times(NULL); for (i=0; i<1024 * 1024 * 1024; i++) { int num = rand(); snprintf(src, 64, "%d", num); if (!my_cmp64((void *)src, (void *)src)) { dupes++; } num = rand(); } int ticks = times(NULL) - start; printf("Time: %d ticks (%d memcmp/tick)\n", ticks, dupes/ticks); return 0; } compiled via "gcc -mavx2 -O3 test-memcmp-64.c" # ./a.out Time: 8587 ticks (125042 memcmp/tick) # # ./a.out Time: 8629 ticks (124434 memcmp/tick) # Replaced my_cmp64 with memcmp in above code and compiled with "gcc -O3 test-memcmp-64.c" results are # ./a.out Time: 8505 ticks (126248 memcmp/tick) # ./a.out Time: 8505 ticks (126248 memcmp/tick) # Thanks.

Vladimir_Sedach · ‎05-02-2015

Hi Ravi,

You're actually measuring the time of sprintf() and rand()
since they are *much* slower than my_cmp64() or memcmp().

In case of memcmp() compiler does not call it at all -- it knows its result in advance.
That's why memcmp() is "faster".

I've tested my_cmp64() with the code below. It is about 10 time as fast as memcmp().

Please note the line:
char * volatile src = _src;

volatile makes compiler to really call my_cmp64() or memcmp() and not to optimize them out.

Call with, say, (src + 1) to test unaligned access.
===================================================

__inline int my_cmp64(const void* src_1, const void* src_2)
{
const __m256i* src1 = (const __m256i*)src_1;
const __m256i* src2 = (const __m256i*)src_2;

__m256i mm11 = _mm256_lddqu_si256(src1);
__m256i mm12 = _mm256_lddqu_si256(src1 + 1);

__m256i mm21 = _mm256_lddqu_si256(src2);
__m256i mm22 = _mm256_lddqu_si256(src2 + 1);

   __m256i mm1 = _mm256_xor_si256(mm11, mm21);
   __m256i mm2 = _mm256_xor_si256(mm12, mm22);
   __m256i mm = _mm256_or_si256(mm1, mm2);

return !_mm256_testz_si256(mm, mm);
}

void test_cmp()
{
   __declspec(align(64)) char   _src[200];
   char * volatile   src = _src;
   double   time;
   int       i, result = 0;

for (i = 0; i < sizeof(_src); i++)
_src = i;

time = vx_time();
for (i = 0; i < 100 * 1000 * 1000; i++)
result += my_cmp64(src, src);
// result += memcmp(src, src, 64);

pl("time: %.3f, %d", vx_time(time), result);
}

jimdempseyatthecove · ‎05-03-2015

Vladimir is spot on.

If you want to test the performance of a function, a) do not include the setup/initialization in the timing, and b) make sure that the compiler does not out smart you by eliminating code that appears to produce unused results.

Situation a) when not observed, is particularly puzzling when testing parallel programmers because most noobs are not aware that random number generators are serializing (have critical section). And this throws off any timing runs.

Jim Dempsey

Ravi_K_ · ‎05-04-2015

Thank you Vladimir and Jim for your help. I was trying to mimic our test suite. "Volatile" keyword did all the difference for true performance comparison. I will test it out on our test suite and come back if I have additional questions. Thanks again for your help.

Christian_M_2 · ‎05-04-2015

As to random numbers: I always create small test vectors and fill them with random data and then start the test.Results can be stored as you said in volatile variables. I never heard of a compiler eliminating volatile results, but the other way it could be. Secondly, check the generated assembly, nontheless whether results are expected or unexpected. The interesting part you will code, should not be to hard to understand with some training.

It might also be a good idea to run the test several times. Especially, if this is the real usage scenario later. Use the same data again, this way you get some cache warm up and if data is small enought to fit in L1 you will eliminate memory bandwith at least to a certain extent.

Vladimir_Sedach · ‎05-05-2015

Christian,

It's important that volatile is the attribute of a pointer, not variable or array it points to.
The pointer actually might never be changed.
This is much better because one don't need to care much about the attribute and contents of the variable.
Besides that, VC doesn't allow volatile __mXXX vars.
Compiler cannot skip access with such a pointer since it can't predict its value.
The only thing remaining is you need to print some result of calculation.

Unfortunately, I figured out that just recently. Before that, I also used volatile vars.

Ravi_K_ · ‎05-05-2015

Agree with Vladimir, volatile pointer makes all the difference here. Compiler emits different code for volatile and non-volatile pointer access. In some of our test-cases, volatile-pointer + AVX performs better compared to non-volatile-pointer + gcc optimized memcmp.

Christian_M_2 · ‎05-06-2015

Vladimir,

thanks for the info!

I read my post again, and realized I wrote a little bit unclear what I meant: I assumed a case where you have several values, or a whole result array - still small for the test. Then I always generate a random index in the range of the output array and read this item to a voltatile variable. But this var is of type float, double depending on the prior AVX, SSE code. This way you need no printing. And the compiler can't omit any write to result array, as there is at least one read afterwards, but at an uncertain position. At least, this is what I think and I never had problems this way.

What you mentioned sounds interesing, too. You use pointer to _m128 opr _m256 types and make the pointer itselft volatile right? This seems more universal. But instead of printing you could try, declaring a volatile float var and it with a _mm_store_ss/sd. I mean if the pointer is still volatile, the compiler can't optimize it and if you store on element of the vector type in a normal volatile variable, this can't be optimized either. What do you think?

Vladimir_Sedach · ‎05-06-2015

Christian,

You can replace any random access by an access with volatile pointer (or index) to avoid optimization.

If you're storing to a var (even volatile) in a loop, a smart compiler (eg LLVM) could skip all steps except the last one.
If you're storing (even a const) with a volatile pointer to a var (even non-volatile), compiler has to really store because it does not know the value of the pointer in advance. It should assume it is anywhere in the memory.
In other words, compiler never omits any write operator with volatile pointer no matter how simple it is to optimize otherwise.
The only obvious exception is when you don't use the result.

Another approach could be:
   for (...)
   x ^= foo(...);
   print(x);
Compiler must do all the calls to combine all results.

Random access is not a good idea since it takes time to call rand() and it is much slower compared with regular sequential access.
As a result you could get completely unrealistic timing.

Christian_M_2 · ‎05-06-2015

Vladimir,

that's a good point. The thing with the loop, is something I know. Therefore I added the result array, where each result has its own location. So the compiler can't do any skipping of iterations. And yes this adds memory bandwith issues, but if it is the way it is used in reality or your programm, this definitely is legitim. Next thing is that the random access appeared after timing the test, and only once. So I think additional affort did at least not affect test timings. And you can replace print (x) with volatile tmp = x. The compiler can't optimize it, as this is the last and only write.

But I like your volatile pointer approch, seems a good idea to use from now on.

andysem · ‎05-09-2015

> If you're storing to a var (even volatile) in a loop, a smart compiler (eg LLVM) could skip all steps except the last one.

Volatile stores/loads cannot be omitted. It doesn't matter whether the variable itself is volatile or the pointer is to a volatile.

Christian_M_2 · ‎05-10-2015

andysem wrote:

> If you're storing to a var (even volatile) in a loop, a smart compiler (eg LLVM) could skip all steps except the last one.

Volatile stores/loads cannot be omitted. It doesn't matter whether the variable itself is volatile or the pointer is to a volatile.

This is also, what I thought it to be.

So the question is, whether there is a real example demonstrating the opposite for volatile variables.

Ravi_K_ · ‎05-11-2015

andysem, I had tried with volatile data and didn't see performance improvement until I changed it to volatile pointer. I will double check on that and get back to you with the results.

Ravi_K_ · ‎05-12-2015

With "volatile data" compiler optimizes and doesn't even call "memcmp". With "volatile pointer" no optimization is done by compiler and I could measure performance results correctly.

Ravi_K_ · ‎05-14-2015

Hi, I had an additional question but this time it is with SSE which is limited to 16bytes comparison. I have similar comparison code as shown above for 16bytes my_cmp16(...) and it works fine i.e. beats memcmp cpu ticks for everything upto 256 bytes. Later on things gets worse. I have following code. Please note issue is not with 16 comparisons, assuming that data is equal and if I ignore all return values and return 0, it is way faster than regular memcmp. The actual issue is with the return code check which needs to be done 16times and it really slows down comparison function. Just wanted to know any better way to do that? I tried to split my_cmp256 into 2 my_cmp128's and it gets worse. Any inputs appreciated. static inline int my_cmp256(const void *src_1, const void *src_2) { int ret_0, ret_1, ret_2, ret_3, ret_4, ret_5, ret_6, ret_7; ret_0 = my_cmp16((const uint8_t *)src_1 + 0 * 16, (const uint8_t *)src_2 + 0 * 16); ret_1 = my_cmp16((const uint8_t *)src_1 + 1 * 16, (const uint8_t *)src_2 + 1 * 16); ret_2 = my_cmp16((const uint8_t *)src_1 + 2 * 16, (const uint8_t *)src_2 + 2 * 16); ret_3 = my_cmp16((const uint8_t *)src_1 + 3 * 16, (const uint8_t *)src_2 + 3 * 16); ret_4 = my_cmp16((const uint8_t *)src_1 + 4 * 16, (const uint8_t *)src_2 + 4 * 16); ret_5 = my_cmp16((const uint8_t *)src_1 + 5 * 16, (const uint8_t *)src_2 + 5 * 16); ret_6 = my_cmp16((const uint8_t *)src_1 + 6 * 16, (const uint8_t *)src_2 + 6 * 16); ret_7 = my_cmp16((const uint8_t *)src_1 + 7 * 16, (const uint8_t *)src_2 + 7 * 16); ret_8 = my_cmp16((const uint8_t *)src_1 + 8 * 16, (const uint8_t *)src_2 + 8 * 16); ret_9 = my_cmp16((const uint8_t *)src_1 + 9 * 16, (const uint8_t *)src_2 + 9 * 16); ret_10 = my_cmp16((const uint8_t *)src_1 + 10 * 16, (const uint8_t *)src_2 + 10 * 16); ret_11 = my_cmp16((const uint8_t *)src_1 + 11 * 16, (const uint8_t *)src_2 + 11 * 16); ret_12 = my_cmp16((const uint8_t *)src_1 + 12 * 16, (const uint8_t *)src_2 + 12 * 16); ret_13 = my_cmp16((const uint8_t *)src_1 + 13 * 16, (const uint8_t *)src_2 + 13 * 16); ret_14 = my_cmp16((const uint8_t *)src_1 + 14 * 16, (const uint8_t *)src_2 + 14 * 16); ret_15 = my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16); if (unlikely(ret_0 != 0)) return ret_0; else if (unlikely(ret_1 != 0)) return ret_1; else if (unlikely(ret_2 != 0)) return ret_2; ... else return ret_15; } Thanks Ravi

Ravi_K_ · ‎05-18-2015

I tried following things, instead of 16 comparison + 16 checks for return value, I did following changes static inline int my_cmp128(const void *src_1, const void *src_2) { static int8_t ret[16] __attribute__ ((aligned (16))) = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; /* int8_t ret[8] = { 0, 0, 0, 0, 0, 0, 0, 0}; */ uint32_t cmp; __m128i mm1; __m128i zero = _mm_setzero_si128(); ret[0] = my_cmp16((const uint8_t *)src_1 + 0 * 16, (const uint8_t *)src_2 + 0 * 16); ret[1] = my_cmp16((const uint8_t *)src_1 + 1 * 16, (const uint8_t *)src_2 + 1 * 16); ret[2] = my_cmp16((const uint8_t *)src_1 + 2 * 16, (const uint8_t *)src_2 + 2 * 16); ret[3] = my_cmp16((const uint8_t *)src_1 + 3 * 16, (const uint8_t *)src_2 + 3 * 16); ... ret[16] = my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16); mm1 = _mm_lddqu_si128((const __m128i *)&ret[0]); cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(mm1, zero)); if (cmp == 0xffff) return 0; else { return 1; } } This function performs worse than "16 comparisons and 16 if conditions" from earlier version. I was expecting newer version to perform better but confused why? Inputs appreciated. Thanks.