Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Ravi_K_
Beginner
420 Views

Q on memory comparison optimization

Jump to solution
Hi All, I am using AVX/SSE instructions to replace memcmp and our workload includes comparing 64 bytes and occasionally 64 and 128 bytes. I am using following function cmp32 for 32byte comparisons and extend it 2 times for 64 or 4 times for 128 bytes and I am hardly getting 1% performance improvement. Testing was done on Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz, Ubuntu 14.04 x86_64. I tried replacing following lines vcmp = _mm256_cmpeq_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); with vcmp = _mm_xor_si128(xmm0, xmm1); result = _mm_testz_si128(vcmp, vcmp); performance numbers are same. Secondly I tried replacing unaligned loads with aligned loads and still no help. Any additional optimization that can be done to improve performance?? static inline int cmp32(const uint8_t *src_1, const uint8_t *src_2) { __m256i xmm0; __m256i xmm1; __m256i vcmp; int64_t vmask; xmm0 = _mm256_loadu_si256((const __m256i *)src_1); xmm1 = _mm256_loadu_si256((const __m256i *)src_2); vcmp = _mm256_cmpeq_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); if (likely(vmask == 0xffffffff)) { return 0; } else { vcmp = _mm256_cmpgt_epi64(xmm0, xmm1); vmask = _mm256_movemask_epi8(vcmp); if (vmask == 0xffffffff) return 1; else return -1; } }
0 Kudos
51 Replies
Ravi_K_
Beginner
43 Views
I tried following things, instead of 16 comparison + 16 checks for return value, I did following changes static inline int my_cmp128(const void *src_1, const void *src_2) { static int8_t ret[16] __attribute__ ((aligned (16))) = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; /* int8_t ret[8] = { 0, 0, 0, 0, 0, 0, 0, 0}; */ uint32_t cmp; __m128i mm1; __m128i zero = _mm_setzero_si128(); ret[0] = my_cmp16((const uint8_t *)src_1 + 0 * 16, (const uint8_t *)src_2 + 0 * 16); ret[1] = my_cmp16((const uint8_t *)src_1 + 1 * 16, (const uint8_t *)src_2 + 1 * 16); ret[2] = my_cmp16((const uint8_t *)src_1 + 2 * 16, (const uint8_t *)src_2 + 2 * 16); ret[3] = my_cmp16((const uint8_t *)src_1 + 3 * 16, (const uint8_t *)src_2 + 3 * 16); ... ret[16] = my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16); mm1 = _mm_lddqu_si128((const __m128i *)&ret[0]); cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(mm1, zero)); if (cmp == 0xffff) return 0; else { return 1; } } This function performs worse than "16 comparisons and 16 if conditions" from earlier version. I was expecting newer version to perform better but confused why? Inputs appreciated. Thanks.
andysem
New Contributor III
43 Views

Try ORing my_cmp16 results on return into a single accumulator instead of lddqu and movemask.

 

Ravi_K_
Beginner
43 Views
Thanks andysem, your earlier responses have been very helpful in understanding SIMD. I am assuming you are suggesting below version, if yes, I had tried it before code snippets below my_cmp256(const void *src_1, const void *src_2) { int ret; ret = my_cmp16((const uint8_t *)src_1 + 0 * 16, (const uint8_t *)src_2 + 0 * 16); ret |= my_cmp16((const uint8_t *)src_1 + 1 * 16, (const uint8_t *)src_2 + 1 * 16); ret |= my_cmp16((const uint8_t *)src_1 + 2 * 16, (const uint8_t *)src_2 + 2 * 16); ret |= my_cmp16((const uint8_t *)src_1 + 3 * 16, (const uint8_t *)src_2 + 3 * 16); ... ret |= my_cmp16((const uint8_t *)src_1 + 3 * 16, (const uint8_t *)src_2 + 3 * 16) ... ret |= my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16) return ret } unfortunately performance drastically increases. I had debugged this and found out that if I removed last "ret |=" i.e. "from 15 * 16 comparison" performance number is inline with memcmp. I looked at the code generated by gcc and didn't see anything suspicious. I am not sure yet why it happens. Wihtout last "oring" gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c /home/rkerur/compare# ./a.out Time: 163 ticks (1533742 memcmp/tick) with last oring I get gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c /home/rkerur/compare# ./a.out Time: 1499 ticks (166777 memcmp/tick) If your input is not what I had tried please let me know I am willing to try out your suggestion. If you are interested in actual code I can send it out. Thanks.
Ravi_K_
Beginner
43 Views
Also above logic may not work if I try to mimic memcmp return values (0== equal, -ve == lesser, +ve == greater). Thanks,
andysem
New Contributor III
43 Views

Ravi K. wrote:

unfortunately performance drastically increases. I had debugged this and found out that if I removed last "ret |=" i.e. "from 15 * 16 comparison" performance number is inline with memcmp.

I assume you meant that the performance decreased. I'm not sure what you mean by removing the last OR - did you remove the whole last call to my_cmp16 or just replaced OR with something else? Note that if you assign to ret instead of combining it with its previous value the compiler may eliminate all earlier code that composed that previous value.

There is a significant difference between a typical memcmp and your algorithm. memcmp will stop comparing when it finds the first difference while your algorithm completes the traverse anyway. Depending on the input data, this may result in different performance ratio between memcmp and my_cmp256. You have to take into account your real world data that will most probably be fed to your algorithm.

Ravi K. wrote:

Also above logic may not work if I try to mimic memcmp return values (0== equal, -ve == lesser, +ve == greater).

Your my_cmp16 and my_cmp256 did not support that interface, so as long as that is ok ORing is also ok. If you do need ordering as well as equality then you need to modify both my_cmp16 and my_cmp256 to accommodate that. See my initial post in this thread for an example.

 

Vladimir_Sedach
New Contributor I
43 Views

andysem,

In my_memcmp() above must be:
return rcmp - cmp;
instead of
return cmp - rcmp;

What's the origin of this method, especially its bit manipulation of ints?

Unfortunately it is much slower than byte swapping and comparison of 8, 1, 2, 4 byte values.
 

andysem
New Contributor III
43 Views

Vladimir Sedach wrote:

andysem,

In my_memcmp() above must be:
return rcmp - cmp;
instead of
return cmp - rcmp;

Oh, right, thanks. I've fixed the post.

Vladimir Sedach wrote:

What's the origin of this method, especially its bit manipulation of ints?

I wrote similar code for Boost.UUID and adapted it for this post.

Vladimir Sedach wrote:

Unfortunately it is much slower than byte swapping and comparison of 8, 1, 2, 4 byte values.

I'm not sure I understand. Do you mean _mm_shuffle_epi8 to swap bytes? For 256-bit registers that would also require _mm256_permute2f128_si256, and that would only save a few integer instructions on the cmp/rcmp in the end. Did you benchmark it or did I misunderstand your suggestion?

 

Ravi_K_
Beginner
43 Views
andysem, I assume you meant that the performance decreased. I'm not sure what you mean by removing the last OR - did you remove the whole last call to my_cmp16 or just replaced OR with something else? Note that if you assign to ret instead of combining it with its previous value the compiler may eliminate all earlier code that composed that previous value. No it drastically increases cpu ticks and thats what is puzzling. When I say remove last OR, ret |= my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16) with last ORing I get gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c /home/rkerur/compare# ./a.out Time: 1499 ticks (166777 memcmp/tick) instead of above code if I have ret = my_cmp16((const uint8_t *)src_1 + 15 * 16, (const uint8_t *)src_2 + 15 * 16) gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c /home/rkerur/compare# ./a.out Time: 163 ticks (1533742 memcmp/tick) Since I am testing for equality return didn't matter and I was debugging to find out why the increase in cpu ticks. If I use the logic to mimic memcmp return codes, then I cannot use just "ORing" logic, correct? Thanks.
andysem
New Contributor III
43 Views

Ravi K. wrote:

No it drastically increases cpu ticks and thats what is puzzling. When I say remove last OR,

ret |= my_cmp16((const uint8_t *)src_1 + 15 * 16,
(const uint8_t *)src_2 + 15 * 16)

with last ORing I get

gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c
/home/rkerur/compare# ./a.out
Time: 1499 ticks (166777 memcmp/tick)

instead of above code if I have

ret = my_cmp16((const uint8_t *)src_1 + 15 * 16,
(const uint8_t *)src_2 + 15 * 16)

gcc -mavx2 -m64 -O3 test-memcmp-256-1-sse.c

/home/rkerur/compare# ./a.out
Time: 163 ticks (1533742 memcmp/tick)

Since I am testing for equality return didn't matter and I was debugging to find out why the increase in cpu ticks.

I'm pretty sure the compiler removed all calls to my_cmp16 except the last one, and that explains the difference in numbers. Have a look at the disassembled code of your test to verify that.

Ravi K. wrote:

If I use the logic to mimic memcmp return codes, then I cannot use just "ORing" logic, correct?

Yes.

 

Vladimir_Sedach
New Contributor I
43 Views

Andy,

I mean:
__m128i    idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i    v0 = _mm_loadu_si128(a);
__m128i    v1 = _mm_loadu_si128(b);

if (v0 != v1)
{
    v0 = _mm_shuffle_epi8(v0, idx); //reverse byte order
    v1 = _mm_shuffle_epi8(v1, idx);

    v0 = _mm_xor_si128(v0, _mm_set1_epi8(0x80)); //to compare unsigned bytes with  instructions for signed bytes
    v1 = _mm_xor_si128(v1, _mm_set1_epi8(0x80));

    return _mm_movemask_epi8(v0 > v1) - _mm_movemask_epi8(v1 > v0);
}

Though, _byteswap_uint64() and likes appeared to be faster:

#define CMP_64(a, b) { \
    __u64    x = bswap_64(*(__u64 *)(a)); \
    __u64    y = bswap_64(*(__u64 *)(b)); \
    if (x != y) return (x < y) ? -1 : 1; }

GCC creates a very fast code for the last line.
 

 

Ravi_K_
Beginner
43 Views
Vladimir, What is the idx value in _mm_shuffle_epi8(v0, idx); ? I am just trying to understand the code, hence the question. Thanks.
Vladimir_Sedach
New Contributor I
43 Views

Ravi,

Sorry, I missed it:
__m128i    idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);


v0 = _mm_shuffle_epi8(v0, idx) reverses the byte order in v0.

Ravi_K_
Beginner
43 Views
Vladimir, Thanks for the information. I tested out #define CMP_64(a, b) { \ __u64 x = bswap_64(*(__u64 *)(a)); \ __u64 y = bswap_64(*(__u64 *)(b)); \ if (x != y) return (x < y) ? -1 : 1; } its faster than SIMD. My question is why swap before comparison? Direct comparison will be even faster? Thanks
Ravi_K_
Beginner
43 Views
I'm pretty sure the compiler removed all calls to my_cmp16 except the last one, and that explains the difference in numbers. Have a look at the disassembled code of your test to verify that. Andy, My mistake,, I should have paid attention. Yes, your explanation is correct. Thanks,
Vladimir_Sedach
New Contributor I
43 Views

Ravi,

Comparing integers we first compare high order bytes. They have offsets +8, +7,... in memory.
In memcmp() we compare bytes with offsets +0, +1, ...

andysem
New Contributor III
43 Views

Vladimir Sedach wrote:

Andy,

I mean:
__m128i    idx = _mm_setr_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i    v0 = _mm_loadu_si128(a);
__m128i    v1 = _mm_loadu_si128(b);

if (v0 != v1)
{
    v0 = _mm_shuffle_epi8(v0, idx); //reverse byte order
    v1 = _mm_shuffle_epi8(v1, idx);

    v0 = _mm_xor_si128(v0, _mm_set1_epi8(0x80)); //to compare unsigned bytes with  instructions for signed bytes
    v1 = _mm_xor_si128(v1, _mm_set1_epi8(0x80));

    return _mm_movemask_epi8(v0 > v1) - _mm_movemask_epi8(v1 > v0);
}

Though, _byteswap_uint64() and likes appeared to be faster:

#define CMP_64(a, b) { \
    __u64    x = bswap_64(*(__u64 *)(a)); \
    __u64    y = bswap_64(*(__u64 *)(b)); \
    if (x != y) return (x < y) ? -1 : 1; }

GCC creates a very fast code for the last line.
 

I'll have to run some tests with these variants. The SIMD variant requires two more memory loads, but you eliminated one _mm_xor_si128 and scalar integer math. The bswap version also looks very interesting.

Can I use your code in my projects? I might update Boost.UUID if it shows beneficial.

 

Vladimir_Sedach
New Contributor I
43 Views

Andy,

Yes, you can freely use the code.
The code below is an excerpt from my library that will appear publicly one day.
memeq() returns 1 for equality and 0 otherwise.
The functions are faster if called with const "size" parameter due to smart inlinement.

typedef char                    __i8;
typedef unsigned char            __u8;
typedef short                    __i16;
typedef unsigned short            __u16;
typedef int                        __i32;
typedef unsigned int            __u32;
typedef long long                __i64;    //GC stdint.h->int64_t is either "long long int" (32-bit) or "long int" (64-bit)
typedef unsigned long long        __u64;

#ifdef __GNUC__
__inline __u16 bswap_16(__u16 a)
{    return __builtin_bswap16(a);}

__inline __u32 bswap_32(__u32 a)
{    return __builtin_bswap32(a);}

__inline __u64 bswap_64(__u64 a)
{    return __builtin_bswap64(a);}

#else

__inline __u16 bswap_16(__u16 a)
{    return _byteswap_ushort(a);}

__inline __u32 bswap_32(__u32 a)
{    return _byteswap_ulong(a);}

__inline __u64 bswap_64(__u64 a)
{    return _byteswap_uint64(a);}
#endif

//****************************************
#define CMP_1(a, b) { \
    __u8    x = *(__u8 *)(a); \
    __u8    y = *(__u8 *)(b); \
    if (x != y) return x - y; }

#define _CMP_1(a, b) \
    return *(__u8 *)(a) - *(__u8 *)(b);
//****************************************
#define CMP_2(a, b) { \
    __u16    x = bswap_16(*(__u16 *)(a)); \
    __u16    y = bswap_16(*(__u16 *)(b)); \
    if (x != y) return x - y; }

#define _CMP_2(a, b) { \
    __u16    x = bswap_16(*(__u16 *)(a)); \
    __u16    y = bswap_16(*(__u16 *)(b)); \
    return x - y; }
//****************************************
#define CMP_4(a, b) { \
    __u32    x = bswap_32(*(__u32 *)(a)); \
    __u32    y = bswap_32(*(__u32 *)(b)); \
    if (x != y) return (x < y) ? -1 : 1; }

#define _CMP_4(a, b) { \
    __u32    x = bswap_32(*(__u32 *)(a)); \
    __u32    y = bswap_32(*(__u32 *)(b)); \
    return (x < y) ? -1 : (x > y) ? 1 : 0; }
//****************************************
#define CMP_8(a, b) { \
    __u64    x = bswap_64(*(__u64 *)(a)); \
    __u64    y = bswap_64(*(__u64 *)(b)); \
    if (x != y) return (x < y) ? -1 : 1; }

#define _CMP_8(a, b) { \
    __u64    x = bswap_64(*(__u64 *)(a)); \
    __u64    y = bswap_64(*(__u64 *)(b)); \
    return (x < y) ? -1 : (x > y) ? 1 : 0; }
//****************************************

//*************************************************************************
static __inline __i32 _memcmp(const void *_a, const void *_b, size_t _size)
//*************************************************************************
{
    __u8    *a = (__u8 *)_a;
    __u8    *b = (__u8 *)_b;
    ptrdiff_t    size = _size;
    __u64    x, y;
    ptrdiff_t    i;

    if (!size)
        return 0;

    CMP_1(a, b)

    if (size >= 32)
        goto cmp_long;

    for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
    {
        CMP_8(a + 0, b + 0)
        CMP_8(a + 8, b + 8)
    }

cmp_15:
    switch (size - i)
    {
    case 0:
        return 0;
    case 1:
        _CMP_1(a, b)
    case 2:
        _CMP_2(a, b)
    case 3:
        CMP_2(a, b)
        _CMP_1(a + 2, b + 2)
    case 4:
        _CMP_4(a, b)
    case 5:
        CMP_4(a, b)
        _CMP_1(a + 4, b + 4)
    case 6:
        CMP_4(a, b)
        _CMP_2(a + 4, b + 4)
    case 7:
        CMP_4(a, b)
        CMP_2(a + 4, b + 4)
        _CMP_1(a + 6, b + 6)
    case 8:
        _CMP_8(a, b)
    case 9:
        CMP_8(a, b)
        _CMP_1(a + 8, b + 8)
    case 10:
        CMP_8(a, b)
        _CMP_2(a + 8, b + 8)
    case 11:
        CMP_8(a, b)
        CMP_2(a + 8, b + 8)
        _CMP_1(a + 10, b + 10)
    case 12:
        CMP_8(a, b)
        _CMP_4(a + 8, b + 8)
    case 13:
        CMP_8(a, b)
        CMP_4(a + 8, b + 8)
        _CMP_1(a + 12, b + 12)
    case 14:
        CMP_8(a, b)
        CMP_4(a + 8, b + 8)
        _CMP_2(a + 12, b + 12)
    case 15:
        CMP_8(a, b)
        CMP_4(a + 8, b + 8)
        CMP_2(a + 12, b + 12)
        _CMP_1(a + 14, b + 14)
    } //switch

cmp_long:
    for (i = 0; i <= size - 32; i += 32, a += 32, b += 32)
    {
        x = *(__u64 *)(a +  0);    if (x != *(__u64 *)(b +  0))    goto ret0;
        x = *(__u64 *)(a +  8);    if (x != *(__u64 *)(b +  8))    goto ret8;
        x = *(__u64 *)(a + 16);    if (x != *(__u64 *)(b + 16))    goto ret16;
        x = *(__u64 *)(a + 24);    if (x != *(__u64 *)(b + 24))    goto ret24;
    }

    if (size - i < 16)
        goto cmp_15;

    x = *(__u64 *)(a + 0);    if (x != *(__u64 *)(b + 0))    goto ret0;
    x = *(__u64 *)(a + 8);    if (x != *(__u64 *)(b + 8))    goto ret8;

    a += 16;
    b += 16;
    i += 16;
    goto cmp_15;

ret0:    y = *(__u64 *)(b +  0);    goto ret;
ret8:    y = *(__u64 *)(b +  8);    goto ret;
ret16:    y = *(__u64 *)(b + 16);    goto ret;
ret24:    y = *(__u64 *)(b + 24);    goto ret;

ret:
    x = bswap_64(x);
    y = bswap_64(y);
    return (x < y) ? -1 : (x > y) ? 1 : 0;
} //_memcmp

//***********************************************
#define CMPEQ_1(a, b) \
    if (*(__u8 *)(a) != *(__u8 *)(b)) return 0;

#define _CMPEQ_1(a, b) \
    return *(__u8 *)(a) == *(__u8 *)(b);
//***********************************************
#define CMPEQ_2(a, b) \
    if (*(__u16 *)(a) != *(__u16 *)(b)) return 0;

#define _CMPEQ_2(a, b) \
    return *(__u16 *)(a) == *(__u16 *)(b);
//***********************************************
#define CMPEQ_4(a, b) \
    if (*(__u32 *)(a) != *(__u32 *)(b)) return 0;

#define _CMPEQ_4(a, b) \
    return *(__u32 *)(a) == *(__u32 *)(b);
//***********************************************
#define CMPEQ_8(a, b) \
    if (*(__u64 *)(a) != *(__u64 *)(b)) return 0;

#define _CMPEQ_8(a, b) \
    return *(__u64 *)(a) == *(__u64 *)(b);
//***********************************************

//***********************************************************************
static __inline __i32 memeq(const void *_a, const void *_b, size_t _size)
//***********************************************************************
{
    __u8    *a = (__u8 *)_a;
    __u8    *b = (__u8 *)_b;
    ptrdiff_t    size = _size;
    ptrdiff_t    i;
    __v1i8    v0, v1;

    if (!size)
        return 1;

    CMPEQ_1(a, b)

#if 1
    for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
    {
        CMPEQ_8(a + 0, b + 0)
        CMPEQ_8(a + 8, b + 8)
    }
#elif VX_SSE2
    for (i = 0; i <= size - 16; i += 16, a += 16, b += 16)
    {
        v0 = loadu(a);
        v1 = loadu(b);

        if (v0 != v1)
            return 0;
    }
#endif

    switch (size - i)
    {
    case 0:
        return 1;
    case 1:
        _CMPEQ_1(a, b)
    case 2:
        _CMPEQ_2(a, b)
    case 3:
        CMPEQ_2(a, b)
        _CMPEQ_1(a + 2, b + 2)
    case 4:
        _CMPEQ_4(a, b)
    case 5:
        CMPEQ_4(a, b)
        _CMPEQ_1(a + 4, b + 4)
    case 6:
        CMPEQ_4(a, b)
        _CMPEQ_2(a + 4, b + 4)
    case 7:
        CMPEQ_4(a, b)
        CMPEQ_2(a + 4, b + 4)
        _CMPEQ_1(a + 6, b + 6)
    case 8:
        _CMPEQ_8(a, b)
    case 9:
        CMPEQ_8(a, b)
        _CMPEQ_1(a + 8, b + 8)
    case 10:
        CMPEQ_8(a, b)
        _CMPEQ_2(a + 8, b + 8)
    case 11:
        CMPEQ_8(a, b)
        CMPEQ_2(a + 8, b + 8)
        _CMPEQ_1(a + 10, b + 10)
    case 12:
        CMPEQ_8(a, b)
        _CMPEQ_4(a + 8, b + 8)
    case 13:
        CMPEQ_8(a, b)
        CMPEQ_4(a + 8, b + 8)
        _CMPEQ_1(a + 12, b + 12)
    case 14:
        CMPEQ_8(a, b)
        CMPEQ_4(a + 8, b + 8)
        _CMPEQ_2(a + 12, b + 12)
    case 15:
        CMPEQ_8(a, b)
        CMPEQ_4(a + 8, b + 8)
        CMPEQ_2(a + 12, b + 12)
        _CMPEQ_1(a + 14, b + 14)
    } //switch
} //memeq

Vladimir_Sedach
New Contributor I
43 Views

The above _memcmp() and memeq() became faster due to comparing first bytes first.

Ravi_K_
Beginner
43 Views
Vladimir, Andy I tried bswap_64 function for comparison which works fine for small comparisons. As the comparison size increases (256, 512, 1024, ...,) this mechanism just dies down and can't beat memcmp. I hope I am not misunderstanding the code. I had used 16 bytes SSE comparison before and couldn't beat it, so wanted to know if your use-case has larger comparisons and it has worked? Thanks.
jimdempseyatthecove
Black Belt
43 Views

Ravi,

Another option you can do, which I will describe to you, and let you write the code (should be simple enough to do)

a) perform two compares, one for gt, the other for eq, the third option (lt) is anything else.
b) perform logical and on each producing a 2-bit bit mask (2==gt, 1==eq, 0==lt) with each 2-bit field shifted left by 2 bits in the order of either little-endian or big-endian depending on how you organize your data. The "shifted" part is performed in the constant used in the and mask
c) perform a logical or on the two masked results completing each 2-bit result
d) perform hadd to combine the results into a composite number (repeat as often as necessary for the width of the original input data (64-bit, 32-bit ??-bit).

What you essentially are doing is producing

(least significant 2-bit field) +
(next significant 2-bit field)<<2 +
(next significant 2-bit field)<<4 +
(next significant 2-bit field)<<6 +
...

You'd produce as many 2-bit results as necessary that fit within an __m128i (or ___m256i) register before writing to memory. Then use the GP register commands to complete the process.

As to if this will be faster than what is posted above, I cannot say. It will have to be tested.

Jim Dempsey