- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to implement a permutation that turns an AoS (where the structure has 4 float) into a SoA, using SSE, AVX, AVX2 and KNC, and without using gather operations, to find out if it worth it.
For example, using KNC, I would like to use 4 zmm registers:
{A0, A1, ... A15}
{B0, B1, ... B15}
{C0, C1, ... C15}
{D0, D1, ... D15}
to end up having something like:
{A0, A4, A8, A12, B0, B4, B8, B12, C0, C4, C8, C12, D0, D4, D8, D12}
{A1, A5, A9, ...}
{A2, A6, A10, ...}
{A3, A7, A11, ...}
Since the permutation instructions are significantly changing among architectures and I wouldn't like to reinvent the wheel, I would be glad if someone could point me where to find information about this, or share their knowledge.
Thank you in advance.
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're probably aware of the IDZ posted article
http://software.intel.com/en-us/articles/memory-layout-transformations
for pretty slides:
for combining tiling with SoA:
http://impact.crhc.illinois.edu/shared/papers/dl_inpar2012_ack.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Tim, very useful.
However I'm looking for something more related with this data layout transformations using SSE/AVX/KNC intrinsics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you. That's exactly what I was looking for
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Diego:
Christopher's AVX solution is not exactly what you're looking for :)
it gets a0, b0,.. instead of a0, a4, b0,...
__m256 a, b, c, d;
__m256 r0, r1, r2, r3;
r0 = _mm256_unpacklo_ps(a, b);
r1 = _mm256_unpacklo_ps(c, d);
r2 = _mm256_permute2f128_ps(r0, r1, 0x20);
r3 = _mm256_permute2f128_ps(r0, r1, 0x31);
r0 = _mm256_unpackhi_ps(a, b);
r1 = _mm256_unpackhi_ps(c, d);
a = _mm256_unpacklo_ps(r2, r3);
b = _mm256_unpackhi_ps(r2, r3);
r2 = _mm256_permute2f128_ps(r0, r1, 0x20);
r3 = _mm256_permute2f128_ps(r0, r1, 0x31);
c = _mm256_unpacklo_ps(r2, r3);
d = _mm256_unpackhi_ps(r2, r3);
This function could be named deinterleave() (https://www.google.com/search?q=deinterleave).
It (a, b, c, d are being loaded from memory) is surprisingly way faster (~3.5 x) than corresponding four AVX2 gather() calls
with index = {0, 4, 8, 12, 16, 20, 24, 28}.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry I've been offline for a while.
Yes, both approaches are interesting for my research.
Thank you!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page