Intel compiler generates movups for aligned data

mr_nuke · ‎02-14-2010

I have some code vectorized with SSE intrinsics, which executes just slightly faster when compiled with CL. To simplify programming, I made the following class:

[cpp]struct __mVector3_ps
{
    __m128 x, y, z;
};[/cpp]

[cpp]inline __mVector3_ps _mm_vec3_ps(__mVector3_ps head, __mVector3_ps tail)
{
    __mVector3_ps result;
    result.x = _mm_sub_ps(head.x, tail.x);
    result.y = _mm_sub_ps(head.y, tail.y);
    result.z = _mm_sub_ps(head.z, tail.z);
    return result;
}
[/cpp]

I also defined a few inline functions that take and return a __mVector3_ps struct. accordind to __alignof(), the structure is aligned on a 16-byte boundary, and thus all its members should be aligned as well.

However, when I use the structure in my code, the Intel compiler generares unaligned load instructions. For example:

[cpp]  1091: #		if (LINES_PARRALELISM > 1)
  1092:                 __mVector3 r = _mm_vec3_ps(prevPoint[1], Cpos);
000000013FCE7126  movups      xmm3,xmmword ptr [rsp+160h] 
000000013FCE712E  movups      xmm12,xmmword ptr [rsp+170h] 
000000013FCE7137  movups      xmm13,xmmword ptr [rsp+180h] [/cpp]

Both Visual C++ and GCC generate aligned load instructions for this type of situation, and thus, surprisingly, the code generated by Visual C++ is marginally faster.

The issue persists even when I add

[cpp]__declspec(align(__alignof(__m128)))[/cpp]

to the struct declaration.This happens when I pass the entire structure as a parameter, yet when I access its members separately, the loads are aligned.

TimP · ‎02-14-2010

I submitted what may be a similar issue on premier.intel.com some time ago, so it wasn't on a current version. Of course, the difference in performance between movups and movaps is greatest on CPUs which are no longer in production, and should be eliminated on CPUs introduced in the last year (since Core I7 and Barcelona).

Dale_S_Intel · ‎02-22-2010

I believe Tim is correct, it shouldn't cause any performance degradations on current architectures, but it is a bit of a quandary as to why it's using movups instead of movaps. I've submitted the question to higher authorities (:-) and we'll see if there's any good answers.

Thanks!

Dale

mr_nuke · ‎02-22-2010

Well Dale, the average performance in my case with movaps (MSVC) is about 5% higher than with movups (ICL), on a Corei7-920 HT-enabled, Turbo-Boost enabled, QPI@6.4, x3 channle DDR3-1600. Of course, when you factor the standard deviations, the results are not statistically significant, and I didn't think setting up a script to gater thousands of runs would be worthwile.

On a side note, I noticed some basic optimization problems that ICL is having, in my short incursion with F32vec4. I have a template function, inline Vector3 Vec3Cross(Vector3 index, Vector3 middle); When I feed F32vec4 as the template parameter, ICL doesn't inline it, but MSVC and GCC do. I'm too tired right now to pull up the full source.

Another problem I noticed (this time on linux) is that ICL doesn't define operators (+, -. +=, etc) for __m128 types, while GCC does. Also ICL defines the __GNUC__ macro. So code that conditionally defines the operators when __GNUC__ is not defined doesn't work with ICL.

== Alex

Dale_S_Intel · ‎02-22-2010

That's very helpful feedback, I'll relay that. If you get a chance to provide a test case for the template parameter inlining, I'd love to look at it.

On the operators for __m128's it's true that while we support many gcc extensions, there are a few that we don't. If you like I can submit a feature request for those (if there isn't an existing one).

Thanks!

Dale

mr_nuke · ‎02-24-2010

Hi Dale,

Sorry for the late reply. I'm a short on time lately. I'm not sure if the code below will cause vec3Cross to not be inlined, but I had something extremely similar with the simptoms I described. As I said, I'm a little short on time to test this right now.

[cpp]template 
struct Vector3
{
	T x, y, z;
};

template 
inline Vector3 vec3Cross(const Vector3 index, const Vector3middle)
{
	Vector3 result;
	result.x = index.y * middle.z - index.z * middle.y;		// 3 FLOPs
	result.y = index.z * middle.x - index.x * middle.z;		// 3 FLOPs
	result.z = index.x * middle.y - index.y * middle.x;		// 3 FLOPs
	return result;							// Total: 9 FLOPs
};

Foo()
{
	Vector3 index, middle, cross;
	// Initialize index and middle
	...
	cross = vec3Cross(index, middle);
}[/cpp]

Let me know if this helps.

==Alex