- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

I'm trying to implement a permutation that turns an AoS (where the structure has 4 float) into a SoA, using SSE, AVX, AVX2 and KNC, and without using gather operations, to find out if it worth it.

For example, using KNC, I would like to use 4 zmm registers:

{A0, A1, ... A15}

{B0, B1, ... B15}

{C0, C1, ... C15}

{D0, D1, ... D15}

to end up having something like:

{A0, A4, A8, A12, B0, B4, B8, B12, C0, C4, C8, C12, D0, D4, D8, D12}

{A1, A5, A9, ...}

{A2, A6, A10, ...}

{A3, A7, A11, ...}

Since the permutation instructions are significantly changing among architectures and I wouldn't like to reinvent the wheel, I would be glad if someone could point me where to find information about this, or share their knowledge.

Thank you in advance.

- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

You're probably aware of the IDZ posted article

http://software.intel.com/en-us/articles/memory-layout-transformations

for pretty slides:

for combining tiling with SoA:

http://impact.crhc.illinois.edu/shared/papers/dl_inpar2012_ack.pdf

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you Tim, very useful.

However I'm looking for something more related with this data layout transformations using SSE/AVX/KNC intrinsics.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you. That's exactly what I was looking for

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Diego:

Christopher's AVX solution is not exactly what you're looking for :)

it gets a0, b0,.. instead of a0, a4, b0,...

__m256 a, b, c, d;

__m256 r0, r1, r2, r3;

r0 = _mm256_unpacklo_ps(a, b);

r1 = _mm256_unpacklo_ps(c, d);

r2 = _mm256_permute2f128_ps(r0, r1, 0x20);

r3 = _mm256_permute2f128_ps(r0, r1, 0x31);

r0 = _mm256_unpackhi_ps(a, b);

r1 = _mm256_unpackhi_ps(c, d);

a = _mm256_unpacklo_ps(r2, r3);

b = _mm256_unpackhi_ps(r2, r3);

r2 = _mm256_permute2f128_ps(r0, r1, 0x20);

r3 = _mm256_permute2f128_ps(r0, r1, 0x31);

c = _mm256_unpacklo_ps(r2, r3);

d = _mm256_unpackhi_ps(r2, r3);

This function could be named deinterleave() (https://www.google.com/search?q=deinterleave).

It (a, b, c, d are being loaded from memory) is surprisingly way faster (~3.5 x) than corresponding four AVX2 gather() calls

with index = {0, 4, 8, 12, 16, 20, 24, 28}.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sorry I've been offline for a while.

Yes, both approaches are interesting for my research.

Thank you!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page