__m128 array becomes unaligned with IC optimization

TaylorIoTKidd · ‎05-07-2010

I'm sure this question that has been asked dozens of times. I just can't seem to figure out how to structure a search query that finds the answer.

PROBLEM: I allocated an __m128 aligned data array using new in the constructor. I then used it to perform an operation with _mm_dp_ps() within a function. It worked fine with no optimization. Using full optimization with the Intel Compiler, the data become unaligned and all sorts of bad things happened.

QUESTION: Isn't __m128 defined as being 16 byte aligned? Is this alignment not guaranteed with optimization? If so, is this a bug? Or did I just do something silly that I can't see?

(By the way, I got around this problem by dynamically allocating aligned data using "__m128 *sse_result = (__m128*) _mm_malloc(4*sizeof(__m128), 16);")

--
Taylor

CODE SNIPPETS:

Here's snippets of my code:

In the class definition:

__m128 *transMat_sse; //the sepia transformation matrix

In the constructor:

transMat_sse = new __m128[pixelComponentNum];

//Sepia SSE transformation matrix (MS version)
*(transMat_sse+0) = _mm_set_ps(0.393f, 0.769f, 0.189f, 0.0f);
*(transMat_sse+1) = _mm_set_ps(0.349f, 0.686f, 0.168f, 0.0f);
*(transMat_sse+2) = _mm_set_ps(0.272f, 0.534f, 0.131f, 0.0f);
*(transMat_sse+3) = _mm_set_ps( 0.0f, 0.0f, 0.0f, 1.0f);

My compiler arguments: /c /O2 /Ob2 /Oi /Qipo /I "\\include" /I ".\\Workloads" /I ".\\external\\vtune\\include" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_CRT_SECURE_NO_WARNINGS" /D "_MBCS" /EHsc /MD /GS /Gy /fp:fast /Fo"Win32\\Release_Intel/" /W3 /nologo /Zi /Qwd10121 /Qopenmp /QaxSSE4.2 /QxSSE2 /Q_multisrc-

bronxzv · ‎05-10-2010

I don't know if it's a bug or a (lack of) feature of the Intel compiler but what I'll adviseanyone todo is to usecustom allocators (align 16 for SSE, 32 for AVX) and placement new / new[] to ensure the constructors are called, one basic (and slow) allocator may simply use _mm_malloc / _mm_free

NB: it's what I do since my early tests with SSE under Katmai P!!! and itworked well to port 20'000+lines ofsource code (and was easily ported recently to handle AVX alignment)

TimP · ‎05-14-2010

malloc() is not pre-empted by Intel compilers. On Windows, you get the one provided by Microsoft, so you might consider _aligned_malloc() if you're looking for a solution which should be portable across recently supported varieties of Windows.

levicki · ‎05-17-2010

Long time ago I complained about new[] and malloc() not returning aligned memory for modern (and now intrinsic) vector datatypes.

I find that particular "feature" a weakness of a stagnating language which should have been corrected already.

Alas, nobody dares to confront those dinosaurus' from C++ committee and these days it seems more important to add dozens of different flavors of threading extensions thus creating a paradox of choice instead of giving us one worthy and well thought out interface.

OpenMP, TBB, STM, Cilk, CEAN... what do they have in common?

- They are all unfinished attempts of making parallelization easy.
- They are all trying to solve different aspects of parallelism instead of providing a single all-around solution.
- They all make me want to give up on software development because to cover all cases I need to learn all of them. Learning 5 things instead of one means being average in all 5 instead of mastering a single one.

IMO C++ started suffering from "feature creep" instead of fixing some old design oversights.

Sorry about the rant.

TimP · ‎05-17-2010

Quoting from Harbison & Steele 2nd edition (1987) (just to point out the situation has been thus for quite a while) "functions such as malloc, ....., always return pointers of type char * aligned on a boundary suitable for an object of any type." It's a stretch to imagine this being consistent with 32-bit Windows malloc, even if you restrict "any type" to data types defined in standard C. I would hope to see an indication soon on this forum soon what is intended to be done for AVX types.
In principle, new[] ought to respect alignments specified in the definition, e.g. new _m512. Of course, relying on what C++ ought to do is a sure way to unreliability.

TBB, Cilk, and Ct apparently now share a declared intention of their sponsors to be incompatible with OpenMP and a hope that OpenMP will no longer be advocated. My personal belief is that this attitude should provoke a backlash from those who care about multi-language applications (not accepting these various C++ namespaces as satisfying the requirement for multiple languages).

levicki · ‎06-14-2010

Tim, I can't agree more with you. Well said.