Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Technologies
- Intel® ISA Extensions
- Different ways to turn an AoS into an SoA

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Diego_Caballero

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-08-2014
03:13 AM

141 Views

Different ways to turn an AoS into an SoA

Hi,

I'm trying to implement a permutation that turns an AoS (where the structure has 4 float) into a SoA, using SSE, AVX, AVX2 and KNC, and without using gather operations, to find out if it worth it.

For example, using KNC, I would like to use 4 zmm registers:

{A0, A1, ... A15}

{B0, B1, ... B15}

{C0, C1, ... C15}

{D0, D1, ... D15}

to end up having something like:

{A0, A4, A8, A12, B0, B4, B8, B12, C0, C4, C8, C12, D0, D4, D8, D12}

{A1, A5, A9, ...}

{A2, A6, A10, ...}

{A3, A7, A11, ...}

Since the permutation instructions are significantly changing among architectures and I wouldn't like to reinvent the wheel, I would be glad if someone could point me where to find information about this, or share their knowledge.

Thank you in advance.

6 Replies

Highlighted
##

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-08-2014
03:54 AM

141 Views

You're probably aware of the IDZ posted article

http://software.intel.com/en-us/articles/memory-layout-transformations

for pretty slides:

for combining tiling with SoA:

http://impact.crhc.illinois.edu/shared/papers/dl_inpar2012_ack.pdf

Highlighted
##

Diego_Caballero

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-08-2014
04:19 AM

141 Views

Thank you Tim, very useful.

However I'm looking for something more related with this data layout transformations using SSE/AVX/KNC intrinsics.

Highlighted
##

I have not written any KNC code so far, but the principal is the same, and there are equivalent instructions for doing this. The code below swaps AAAA,BBBB,CCCC,DDDD to ABCD,ABCD,ABCD and vice versa
SSE
__m128 a[4],tmp[4];
a[0] = _mm_loadu_ps(A);
a[1] = _mm_loadu_ps(B);
a[2] = _mm_loadu_ps(C);
a[3] = _mm_loadu_ps(D);
tmp[0] = _mm_unpacklo_ps(a[0], a[1]);
tmp[2] = _mm_unpacklo_ps(a[2], a[3]);
tmp[1] = _mm_unpackhi_ps(a[0], a[1]);
tmp[3] = _mm_unpackhi_ps(a[2], a[3]);
a[0] = _mm_shuffle_ps(tmp[0], tmp[2], _MM_SHUFFLE(1,0,1,0) );
a[1] = _mm_shuffle_ps(tmp[0], tmp[2], _MM_SHUFFLE(3,2,3,2) );
a[2] = _mm_shuffle_ps(tmp[1], tmp[3], _MM_SHUFFLE(1,0,1,0) );
a[3] = _mm_shuffle_ps(tmp[1], tmp[3], _MM_SHUFFLE(3,2,3,2) );
AVX
__m256 a[4],tmp[4];
a[0] = _mm256_loadu_ps(A);
a[1] = _mm256_loadu_ps(B);
a[2] = _mm256_loadu_ps(C);
a[3] = _mm256_loadu_ps(D);
tmp[0] = _mm256_unpacklo_ps(a[0], a[1]);
tmp[2] = _mm256_unpacklo_ps(a[2], a[3]);
tmp[1] = _mm256_unpackhi_ps(a[0], a[1]);
tmp[3] = _mm256_unpackhi_ps(a[2], a[3]);
a[0] = _mm256_shuffle_ps(tmp[0], tmp[2], _MM_SHUFFLE(1,0,1,0) );
a[1] = _mm256_shuffle_ps(tmp[0], tmp[2], _MM_SHUFFLE(3,2,3,2) );
a[2] = _mm256_shuffle_ps(tmp[1], tmp[3], _MM_SHUFFLE(1,0,1,0) );
a[3] = _mm256_shuffle_ps(tmp[1], tmp[3], _MM_SHUFFLE(3,2,3,2) );

Christopher_H_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-12-2014
02:52 AM

141 Views

Highlighted
##

Diego_Caballero

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-12-2014
10:22 AM

141 Views

Thank you. That's exactly what I was looking for

Highlighted
##

Vladimir_Sedach

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-16-2014
10:02 AM

141 Views

Diego:

Christopher's AVX solution is not exactly what you're looking for :)

it gets a0, b0,.. instead of a0, a4, b0,...

__m256 a, b, c, d;

__m256 r0, r1, r2, r3;

r0 = _mm256_unpacklo_ps(a, b);

r1 = _mm256_unpacklo_ps(c, d);

r2 = _mm256_permute2f128_ps(r0, r1, 0x20);

r3 = _mm256_permute2f128_ps(r0, r1, 0x31);

r0 = _mm256_unpackhi_ps(a, b);

r1 = _mm256_unpackhi_ps(c, d);

a = _mm256_unpacklo_ps(r2, r3);

b = _mm256_unpackhi_ps(r2, r3);

r2 = _mm256_permute2f128_ps(r0, r1, 0x20);

r3 = _mm256_permute2f128_ps(r0, r1, 0x31);

c = _mm256_unpacklo_ps(r2, r3);

d = _mm256_unpackhi_ps(r2, r3);

This function could be named deinterleave() (https://www.google.com/search?q=deinterleave).

It (a, b, c, d are being loaded from memory) is surprisingly way faster (~3.5 x) than corresponding four AVX2 gather() calls

with index = {0, 4, 8, 12, 16, 20, 24, 28}.

Highlighted
##

Diego_Caballero

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-21-2014
02:09 AM

141 Views

Sorry I've been offline for a while.

Yes, both approaches are interesting for my research.

Thank you!

For more complete information about compiler optimizations, see our Optimization Notice.