- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What are the values of the other elements? Make sure they won't cause overflow/underflow and such as that may cause interrupts to be triggered. I'm not sure if that still applies to Sandy Bridge since it's supposed to handle these cases in hardware instead of microcode, but there might be exceptions.
Also, have you taken a look at the actual assembly to check for unexpected instructions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I could not understand one thing, the SSE code has unaligned load but AVX has aligned load. It looks like data is aligned in both case. However, for AVX code i will suggest use 128bit loads and then use vperl2f128 to put data in upper lane from second 128bit load. If you can guarantee that there is no page fault or cache line split then above code will be good. Otherwise 256bit loads are more prone to that.
This typical code has big load in the beginning, so this load is not changing in both SSE and AVX. AVX will only give advantage in processing 2times elements (8 ) as compared to SSE(4). So, you should expect AVX gain only from those instructions. It wont be 2x as code is memory limited.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
__m256 tmp0 = _mm256_castps128_ps256(_mm_load_ps(first 4element address));
__m256 tmp1 = _mm256_castps128_ps256(_mm_load_ps(first 4element address + 4)); <-- reading next 4 elements
dest = _mm256_insertf128_ps(tmp0, _mm256_castps256_ps128(tmp1), 0x01);
you can do same thing with verpm2f128:
dest = vperm2f128(tmp0, _mm256_castps128_ps256(_mm_load_ps(first 4element address + 4)), cntrl);
If your code gets faster with this approach, then you are hit with page faults/cache line splits.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
__m256 tmp0 = _mm256_castps128_ps256(_mm_load_ps(first 4element address));
__m256 tmp1 = _mm256_castps128_ps256(_mm_load_ps(first 4element address + 4)); <-- reading next 4 elements
dest = _mm256_insertf128_ps(tmp0, _mm256_castps256_ps128(tmp1), 0x01);
hint: with the Intel compiler _mm256_loadu_ps do all of that for you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page