- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have some code vectorized with SSE intrinsics, which executes just slightly faster when compiled with CL. To simplify programming, I made the following class:
[cpp]struct __mVector3_ps { __m128 x, y, z; };[/cpp]
[cpp]inline __mVector3_ps _mm_vec3_ps(__mVector3_ps head, __mVector3_ps tail) { __mVector3_ps result; result.x = _mm_sub_ps(head.x, tail.x); result.y = _mm_sub_ps(head.y, tail.y); result.z = _mm_sub_ps(head.z, tail.z); return result; } [/cpp]
I also defined a few inline functions that take and return a __mVector3_ps struct. accordind to __alignof(), the structure is aligned on a 16-byte boundary, and thus all its members should be aligned as well.
However, when I use the structure in my code, the Intel compiler generares unaligned load instructions. For example:
[cpp] 1091: # if (LINES_PARRALELISM > 1) 1092: __mVector3 r = _mm_vec3_ps(prevPoint[1], Cpos); 000000013FCE7126 movups xmm3,xmmword ptr [rsp+160h] 000000013FCE712E movups xmm12,xmmword ptr [rsp+170h] 000000013FCE7137 movups xmm13,xmmword ptr [rsp+180h] [/cpp]
Both Visual C++ and GCC generate aligned load instructions for this type of situation, and thus, surprisingly, the code generated by Visual C++ is marginally faster.
The issue persists even when I add
[cpp]__declspec(align(__alignof(__m128)))[/cpp]
to the struct declaration.This happens when I pass the entire structure as a parameter, yet when I access its members separately, the loads are aligned.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe Tim is correct, it shouldn't cause any performance degradations on current architectures, but it is a bit of a quandary as to why it's using movups instead of movaps. I've submitted the question to higher authorities (:-) and we'll see if there's any good answers.
Thanks!
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well Dale, the average performance in my case with movaps (MSVC) is about 5% higher than with movups (ICL), on a Corei7-920 HT-enabled, Turbo-Boost enabled, QPI@6.4, x3 channle DDR3-1600. Of course, when you factor the standard deviations, the results are not statistically significant, and I didn't think setting up a script to gater thousands of runs would be worthwile.
On a side note, I noticed some basic optimization problems that ICL is having, in my short incursion with F32vec4. I have a template function, inline Vector3
Another problem I noticed (this time on linux) is that ICL doesn't define operators (+, -. +=, etc) for __m128 types, while GCC does. Also ICL defines the __GNUC__ macro. So code that conditionally defines the operators when __GNUC__ is not defined doesn't work with ICL.
== Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's very helpful feedback, I'll relay that. If you get a chance to provide a test case for the template parameter inlining, I'd love to look at it.
On the operators for __m128's it's true that while we support many gcc extensions, there are a few that we don't. If you like I can submit a feature request for those (if there isn't an existing one).
Thanks!
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dale,
Sorry for the late reply. I'm a short on time lately. I'm not sure if the code below will cause vec3Cross to not be inlined, but I had something extremely similar with the simptoms I described. As I said, I'm a little short on time to test this right now.
[cpp]templatestruct Vector3 { T x, y, z; }; template inline Vector3 vec3Cross(const Vector3 index, const Vector3 middle) { Vector3 result; result.x = index.y * middle.z - index.z * middle.y; // 3 FLOPs result.y = index.z * middle.x - index.x * middle.z; // 3 FLOPs result.z = index.x * middle.y - index.y * middle.x; // 3 FLOPs return result; // Total: 9 FLOPs }; Foo() { Vector3 index, middle, cross; // Initialize index and middle ... cross = vec3Cross(index, middle); }[/cpp]
Let me know if this helps.
==Alex
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page