AVX: _mm_alignr_epi8 equivalent for YMM registers ?

emmanuel_attia · ‎05-31-2011

Hello,

I am looking for an equivalent to the most useful intrinsic_mm_alignr_epi8 with AVX registers (I guess its equivalent to PALIGNR or VPALIGNR for the one who are not familiar with C intrinsics).

More precisely i would need the equivalent of an hypothetical _mm256_alignr_ps (i need float granularity, not byte one).

Since there is no "slri_si256" or "slli_si256", I have though of a solution with _mm256_permute2_ps but this intrinsic does not seem to be available on my compiler (and maybe neither on my Core i7 2600K). I am using Intel XE 12 Update 4 for Windows.

Right now I have used extractf128/insertf128 combined with two alignr_epi8 but the performances are as expected very bad (i.e my AVX code is slower than the SSE one) because of the mixing of XMM and YMM instructions.

Best regards

Emmanuel

Brijender_B_Intel · ‎05-31-2011

Hi,
one question, are you writing this code in assembly or in intrinsics? if you are writing in intrinsics, compiler should not mix up XMM and YMM in this particular case and should generate right form of the instruction.
However, i think you can use _mm256_permute_ps (not permute2_ps) intrinsic. But again, you may have to use use permute2f128 or extractf128/insertf128 combination to bring part of the upper lane to lower lane. If you can post the code, it may be easy to suggest you the best way to do so. As you know there is no 256bit alignr. it is only 128bit.

bronxzv · ‎05-31-2011

IIRC vpermil2ps (_mm256_permute2_ps) was removed from the AVX specs at the same time than FMA4 was replaced by FMA3

vpermil2ps is now part of AMD's XOP (to be featured inupcoming Bulldozer products along FMA4), AFAIK no Intel CPU has been announced with support for it, though

it looks like you want to shift values between the 2 128-bit lanes and I'm afraid there isn't any fast option available in standard AVX, yet

emmanuel_attia · ‎09-26-2013

If that can be useful, here is an emulated _mm256_alignr_ps in 4 instruction (at most), note that the index argument must be an immediate.

FORCEINLINE is __forceinline in ms world, __attribute__((always_inline)) in the gcc world.

[cpp]

template <int N>
FORCEINLINE __m256 _mm256_alignr_ps_emul(__m256 const & b, __m256 const & a)
{
    __m256 a47b03 = _mm256_permute2f128_ps(a, b, _MM_SHUFFLE(0, 2, 0, 1));

    if (N == 0)      return a;
    else if (N == 8) return b;
    else if (N == 4) return a47b03;
    else if (N < 4)
    {
        // variable name for N = 1 case
        __m256 a13Xa57X   = _mm256_permute_ps(a     , _MM_SHUFFLE((3 + N) & 3, (2 + N) & 3, (1 + N) & 3, N & 3));
        __m256 XXXa4XXXb0 = _mm256_permute_ps(a47b03, _MM_SHUFFLE((3 + N) & 3, (2 + N) & 3, (1 + N) & 3, N & 3));

        return _mm256_blend_ps(a13Xa57X, XXXa4XXXb0,
            ((((N&3)>0)<<7) | (((N&3)>1)<<6) | (((N&3)>2)<<5) | (((N&3)>3))<<4)
          | ((((N&3)>0)<<3) | (((N&3)>1)<<2) | (((N&3)>2)<<1) | (((N&3)>3))<<0));
    }
    else
    {
        // variable name for N = 1 case
        __m256 a13Xa57X   = _mm256_permute_ps(a47b03, _MM_SHUFFLE((3 + N) & 3, (2 + N) & 3, (1 + N) & 3, N & 3));
        __m256 XXXa4XXXb0 = _mm256_permute_ps(b     , _MM_SHUFFLE((3 + N) & 3, (2 + N) & 3, (1 + N) & 3, N & 3));

        return _mm256_blend_ps(a13Xa57X, XXXa4XXXb0,
            ((((N&3)>0)<<7) | (((N&3)>1)<<6) | (((N&3)>2)<<5) | (((N&3)>3))<<4)
          | ((((N&3)>0)<<3) | (((N&3)>1)<<2) | (((N&3)>2)<<1) | (((N&3)>3))<<0));
    }
}

#define _mm256_alignr_ps(x, y, n) _mm256_alignr_ps_emul<(n)>((x), (y))

[/cpp]

AFog0 · ‎10-11-2013

There is no instruction that can shift 32-bytes operands. If your data are stored in an array then the most efficient solution is to make an unalighed read. Unaligned reads are fast on all processors that have AVX, and definitely faster than permute.

emmanuel_attia · ‎10-12-2013

Hi
In my use case, the data are already in registers.

I replaced by unaligned load, and I got a global 10% penalty over the whole algorithm, which is a bad penalty.

To be fair the algorithm i 'm benching is not faster in AVX than SSE, so the penalty from emuled align is too high (and from unaligned load is even worse).

Best regards

perfwise · ‎10-13-2013

To clarify.. unaligned reads are NOT fast on Intel always. Intel has a penalty for loads which span a cacheline boundary or cross a cacheline boundary (latency in FP for the load goes from 8 clks to 13 clks, in 256-bit movups scenarios [128-bit loads take 7 clks]) and if your load spans a cacheline boundary it's a much stiffer penalty (load latnecy goes from 8 to 33 clks). You don't see this in non-vector code because your alignment is native.. you do 4 byte or 8 byte loads upon usually 4 or 8 byte sized data.. and that data is usually aligned to at least 8 bytes so no issue is observed. When you vectorize code, say upon 64-bit sized data, then you're accessing addresses which are 8 byte aligned.. but you may span a cacheline. In 128-bits.. if you're not 16-byte aligned but 8 byte aligned then you will on average make 1 misaligned cachline spanning access every 4 requests. In 256-bit that may be 1 every 2 requests. So alignement is important, just wanted to elucidate that.. but when you can't enforce it.. you will pay some penalty. Intel likely buffers all this quite well with their large # of LD Q entries.. but if you're code is sensitive to the latency then you may pay some penalty. Last point.. when misaligned.. and spanning a 64-bit boundary.. your throughput also drops from 2 lds to 1 ld.. but the real hammer in perf is the latnecy not the throughput..

perfwise