Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
For the latest information on Intel’s response to the Log4j/Log4Shell vulnerability, please see Intel-SA-00646

Different ways to turn an AoS into an SoA

Diego_Caballero
Beginner
530 Views

Hi,

I'm trying to implement a permutation that turns an AoS (where the structure has 4 float) into a SoA, using SSE, AVX, AVX2 and KNC, and without using gather operations, to find out if it worth it.

For example, using KNC, I would like to use 4 zmm registers:

{A0, A1, ... A15}

{B0, B1, ... B15}

{C0, C1, ... C15}

{D0, D1, ... D15}

to end up having something like:

{A0, A4, A8, A12, B0, B4, B8, B12, C0, C4, C8, C12, D0, D4, D8, D12}

{A1, A5, A9, ...}

{A2, A6, A10, ...}

{A3, A7, A11, ...}

Since the permutation instructions are significantly changing among architectures and I wouldn't like to reinvent the wheel, I would be glad if someone could point me where to find information about this, or share their knowledge.

 

Thank you in advance.

0 Kudos
6 Replies
Diego_Caballero
Beginner
530 Views

Thank you Tim, very useful.

However I'm looking for something more related with this data layout transformations using SSE/AVX/KNC intrinsics.

Christopher_H_
Beginner
530 Views
I have not written any KNC code so far, but the principal is the same, and there are equivalent instructions for doing this. The code below swaps AAAA,BBBB,CCCC,DDDD to ABCD,ABCD,ABCD and vice versa SSE __m128 a[4],tmp[4]; a[0] = _mm_loadu_ps(A); a[1] = _mm_loadu_ps(B); a[2] = _mm_loadu_ps(C); a[3] = _mm_loadu_ps(D); tmp[0] = _mm_unpacklo_ps(a[0], a[1]); tmp[2] = _mm_unpacklo_ps(a[2], a[3]); tmp[1] = _mm_unpackhi_ps(a[0], a[1]); tmp[3] = _mm_unpackhi_ps(a[2], a[3]); a[0] = _mm_shuffle_ps(tmp[0], tmp[2], _MM_SHUFFLE(1,0,1,0) ); a[1] = _mm_shuffle_ps(tmp[0], tmp[2], _MM_SHUFFLE(3,2,3,2) ); a[2] = _mm_shuffle_ps(tmp[1], tmp[3], _MM_SHUFFLE(1,0,1,0) ); a[3] = _mm_shuffle_ps(tmp[1], tmp[3], _MM_SHUFFLE(3,2,3,2) ); AVX __m256 a[4],tmp[4]; a[0] = _mm256_loadu_ps(A); a[1] = _mm256_loadu_ps(B); a[2] = _mm256_loadu_ps(C); a[3] = _mm256_loadu_ps(D); tmp[0] = _mm256_unpacklo_ps(a[0], a[1]); tmp[2] = _mm256_unpacklo_ps(a[2], a[3]); tmp[1] = _mm256_unpackhi_ps(a[0], a[1]); tmp[3] = _mm256_unpackhi_ps(a[2], a[3]); a[0] = _mm256_shuffle_ps(tmp[0], tmp[2], _MM_SHUFFLE(1,0,1,0) ); a[1] = _mm256_shuffle_ps(tmp[0], tmp[2], _MM_SHUFFLE(3,2,3,2) ); a[2] = _mm256_shuffle_ps(tmp[1], tmp[3], _MM_SHUFFLE(1,0,1,0) ); a[3] = _mm256_shuffle_ps(tmp[1], tmp[3], _MM_SHUFFLE(3,2,3,2) );
Diego_Caballero
Beginner
530 Views

Thank you. That's exactly what I was looking for

Vladimir_Sedach
New Contributor I
530 Views

Diego:

Christopher's AVX solution is not exactly what you're looking for :)
it gets a0, b0,.. instead of a0, a4, b0,...

   __m256    a, b, c, d;
    __m256    r0, r1, r2, r3;

    r0 = _mm256_unpacklo_ps(a, b);
    r1 = _mm256_unpacklo_ps(c, d);
    r2 = _mm256_permute2f128_ps(r0, r1, 0x20);
    r3 = _mm256_permute2f128_ps(r0, r1, 0x31);
    r0 = _mm256_unpackhi_ps(a, b);
    r1 = _mm256_unpackhi_ps(c, d);
    a = _mm256_unpacklo_ps(r2, r3);
    b = _mm256_unpackhi_ps(r2, r3);
    r2 = _mm256_permute2f128_ps(r0, r1, 0x20);
    r3 = _mm256_permute2f128_ps(r0, r1, 0x31);
    c = _mm256_unpacklo_ps(r2, r3);
    d = _mm256_unpackhi_ps(r2, r3);

This function could be named deinterleave() (https://www.google.com/search?q=deinterleave).
It (a, b, c, d are being loaded from memory) is surprisingly way faster (~3.5 x) than corresponding four AVX2 gather() calls
with index = {0, 4, 8, 12, 16, 20, 24, 28}.
 

Diego_Caballero
Beginner
530 Views

Sorry I've been offline for a while.

Yes, both approaches are interesting for my research.

Thank you!

Reply