- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

one question, are you writing this code in assembly or in intrinsics? if you are writing in intrinsics, compiler should not mix up XMM and YMM in this particular case and should generate right form of the instruction.

However, i think you can use _mm256_permute_ps (not permute2_ps) intrinsic. But again, you may have to use use permute2f128 or extractf128/insertf128 combination to bring part of the upper lane to lower lane. If you can post the code, it may be easy to suggest you the best way to do so. As you know there is no 256bit alignr. it is only 128bit.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

vpermil2ps is now part of AMD's XOP (to be featured inupcoming Bulldozer products along FMA4), AFAIK no Intel CPU has been announced with support for it, though

it looks like you want to shift values between the 2 128-bit lanes and I'm afraid there isn't any fast option available in standard AVX, yet

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

If that can be useful, here is an emulated _mm256_alignr_ps in 4 instruction (at most), note that the index argument must be an immediate.

FORCEINLINE is __forceinline in ms world, __attribute__((always_inline)) in the gcc world.

[cpp]

template <int N>

FORCEINLINE __m256 _mm256_alignr_ps_emul(__m256 const & b, __m256 const & a)

{

__m256 a47b03 = _mm256_permute2f128_ps(a, b, _MM_SHUFFLE(0, 2, 0, 1));

if (N == 0) return a;

else if (N == 8) return b;

else if (N == 4) return a47b03;

else if (N < 4)

{

// variable name for N = 1 case

__m256 a13Xa57X = _mm256_permute_ps(a , _MM_SHUFFLE((3 + N) & 3, (2 + N) & 3, (1 + N) & 3, N & 3));

__m256 XXXa4XXXb0 = _mm256_permute_ps(a47b03, _MM_SHUFFLE((3 + N) & 3, (2 + N) & 3, (1 + N) & 3, N & 3));

return _mm256_blend_ps(a13Xa57X, XXXa4XXXb0,

((((N&3)>0)<<7) | (((N&3)>1)<<6) | (((N&3)>2)<<5) | (((N&3)>3))<<4)

| ((((N&3)>0)<<3) | (((N&3)>1)<<2) | (((N&3)>2)<<1) | (((N&3)>3))<<0));

}

else

{

// variable name for N = 1 case

__m256 a13Xa57X = _mm256_permute_ps(a47b03, _MM_SHUFFLE((3 + N) & 3, (2 + N) & 3, (1 + N) & 3, N & 3));

__m256 XXXa4XXXb0 = _mm256_permute_ps(b , _MM_SHUFFLE((3 + N) & 3, (2 + N) & 3, (1 + N) & 3, N & 3));

return _mm256_blend_ps(a13Xa57X, XXXa4XXXb0,

((((N&3)>0)<<7) | (((N&3)>1)<<6) | (((N&3)>2)<<5) | (((N&3)>3))<<4)

| ((((N&3)>0)<<3) | (((N&3)>1)<<2) | (((N&3)>2)<<1) | (((N&3)>3))<<0));

}

}

#define _mm256_alignr_ps(x, y, n) _mm256_alignr_ps_emul<(n)>((x), (y))

[/cpp]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

There is no instruction that can shift 32-bytes operands. If your data are stored in an array then the most efficient solution is to make an unalighed read. Unaligned reads are fast on all processors that have AVX, and definitely faster than permute.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi

In my use case, the data are already in registers.

I replaced by unaligned load, and I got a global 10% penalty over the whole algorithm, which is a bad penalty.

To be fair the algorithm i 'm benching is not faster in AVX than SSE, so the penalty from emuled align is too high (and from unaligned load is even worse).

Best regards

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

To clarify.. unaligned reads are NOT fast on Intel always. Intel has a penalty for loads which span a cacheline boundary or cross a cacheline boundary (latency in FP for the load goes from 8 clks to 13 clks, in 256-bit movups scenarios [128-bit loads take 7 clks]) and if your load spans a cacheline boundary it's a much stiffer penalty (load latnecy goes from 8 to 33 clks). You don't see this in non-vector code because your alignment is native.. you do 4 byte or 8 byte loads upon usually 4 or 8 byte sized data.. and that data is usually aligned to at least 8 bytes so no issue is observed. When you vectorize code, say upon 64-bit sized data, then you're accessing addresses which are 8 byte aligned.. but you may span a cacheline. In 128-bits.. if you're not 16-byte aligned but 8 byte aligned then you will on average make 1 misaligned cachline spanning access every 4 requests. In 256-bit that may be 1 every 2 requests. So alignement is important, just wanted to elucidate that.. but when you can't enforce it.. you will pay some penalty. Intel likely buffers all this quite well with their large # of LD Q entries.. but if you're code is sensitive to the latency then you may pay some penalty. Last point.. when misaligned.. and spanning a 64-bit boundary.. your throughput also drops from 2 lds to 1 ld.. but the real hammer in perf is the latnecy not the throughput..

perfwise

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page