_mm256_blend_epi16 doesn't work as documented

Jeff_D_2 · ‎12-31-2014

The documentation for _mm256_blend_epi16 doesn't indicate that it operates on individual 128-bit channels, but this is the behavior I am seeing. Is this the correct behavior? Here is a reproducer code below showing the behavior for _mm256_blend_epi16 and _mm256_blend_epi32 where I attempt to insert a value into the first position of a vector using the blend instruction.

#include <stdint.h>
#include <stdio.h>

#include <immintrin.h>

typedef union {
    __m256i m;
    int32_t v[8];;
} __m256i_32_t;

typedef union {
    __m256i m;
    int16_t v[16];;
} __m256i_16_t;

void print_m256i_32(__m256i a) {
    __m256i_32_t t;
    t.m = a;
    printf("{%d,%d,%d,%d,%d,%d,%d,%d}",
            t.v[0], t.v[1], t.v[2], t.v[3],
            t.v[4], t.v[5], t.v[6], t.v[7]);
}

void print_m256i_16(__m256i a) {
    __m256i_16_t t;
    t.m = a;
    printf("{%d,%d,%d,%d,%d,%d,%d,%d,"
            "%d,%d,%d,%d,%d,%d,%d,%d}",
            t.v[ 0], t.v[ 1], t.v[ 2], t.v[ 3],
            t.v[ 4], t.v[ 5], t.v[ 6], t.v[ 7],
            t.v[ 8], t.v[ 9], t.v[10], t.v[11],
            t.v[12], t.v[13], t.v[14], t.v[15]);
}

int main(int argc, char **argv)
{
    __m256i a32 = _mm256_set_epi32(1,2,3,4,5,6,7,8);
    __m256i a16 = _mm256_set_epi16(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
    __m256i z32 = _mm256_set1_epi32(99);
    __m256i z16 = _mm256_set1_epi16(99);
    __m256i insert32 = _mm256_blend_epi32(a32, z32, 1);
    printf("insert32 = _mm256_blend_epi32(a32, z32, 1)\n");
    print_m256i_32(insert32);
    printf("\n");
    __m256i insert16 = _mm256_blend_epi16(a16, z16, 1);
    printf("insert16 = _mm256_blend_epi16(a16, z16, 1)\n");
    print_m256i_16(insert16);
    printf("\n");
    return 0;
}

The output on my system is the following:

insert32 = _mm256_blend_epi32(a32, z32, 1)
{99,7,6,5,4,3,2,1}
insert16 = _mm256_blend_epi16(a16, z16, 1)
{99,15,14,13,12,11,10,9,99,7,6,5,4,3,2,1}

If this is indeed the case then I must use _mm256_blendv_epi8 to accomplish what I am trying to do using _mm256_blend_epi16, but the latency and throughput are not as good.

Is the documentation then incorrect and this is behaving as intended?

Vladimir_Sedach · ‎01-01-2015

Jeff D. wrote:

The documentation for _mm256_blend_epi16 doesn't indicate that it operates on individual 128-bit channels, but this is the behavior I am seeing. Is this the correct behavior?

Immediate constant parameters of *all* Intel intrinsics are 8-bit long, so _mm256_blend_epi16() can't blend 16 elements individually.
Your doc is incomplete.
I'd recommend to use the "instruction set reference, A-Z" at
https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
This is the *only* *complete* source of info on Intel instructions/intrinsics.

If you need info on an intrinsic, just find it with Ctrl+F and read the section above it.

bronxzv · ‎01-01-2015

Jeff D. wrote:

The documentation for _mm256_blend_epi16 doesn't indicate that it operates on individual 128-bit channels, but this is the behavior I am seeing.

I'm not sure which documentation you are refering to but I see that the Intrinsics Guide is indeed wrong

at least this source is correct for this intrinsic: https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-5369B2B5-B1E1-4D96-85AB-2019982667B4.htm

andysem · ‎01-01-2015

bronxzv wrote:

I'm not sure which documentation you are refering to but I see that the Intrinsics Guide is indeed wrong

Perhaps, this should be reported in the dedicated thread.

bronxzv · ‎01-01-2015

andysem wrote:

Quote:

bronxzv wrote:

I'm not sure which documentation you are refering to but I see that the Intrinsics Guide is indeed wrong

Perhaps, this should be reported in the dedicated thread.

done!