PROBLEM: I allocated an __m128 aligned data array using new in the constructor. I then used it to perform an operation with _mm_dp_ps() within a function. It worked fine with no optimization. Using full optimization with the Intel Compiler, the data become unaligned and all sorts of bad things happened.
QUESTION: Isn't __m128 defined as being 16 byte aligned? Is this alignment not guaranteed with optimization? If so, is this a bug? Or did I just do something silly that I can't see?
(By the way, I got around this problem by dynamically allocating aligned data using "__m128 *sse_result = (__m128*) _mm_malloc(4*sizeof(__m128), 16);")
Here's snippets of my code:
In the class definition:
__m128 *transMat_sse; //the sepia transformation matrix
In the constructor:
transMat_sse = new __m128[pixelComponentNum];
transformation matrix (MS version)
*(transMat_sse+0) = _mm_set_ps(0.393f, 0.769f, 0.189f, 0.0f);
*(transMat_sse+1) = _mm_set_ps(0.349f, 0.686f, 0.168f, 0.0f);
*(transMat_sse+2) = _mm_set_ps(0.272f, 0.534f, 0.131f, 0.0f);
*(transMat_sse+3) = _mm_set_ps( 0.0f, 0.0f, 0.0f, 1.0f);
My compiler arguments: /c /O2 /Ob2 /Oi /Qipo /I "\\include" /I ".\\Workloads" /I ".\\external\\vtune\\include" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_CRT_SECURE_NO_WARNINGS" /D "_MBCS" /EHsc /MD /GS /Gy /fp:fast /Fo"Win32\\Release_Intel/" /W3 /nologo /Zi /Qwd10121 /Qopenmp /QaxSSE4.2 /QxSSE2 /Q_multisrc-
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
I don't know if it's a bug or a (lack of) feature of the Intel compiler but what I'll adviseanyone todo is to usecustom allocators (align 16 for SSE, 32 for AVX) and placement new / new to ensure the constructors are called, one basic (and slow) allocator may simply use _mm_malloc / _mm_free
NB: it's what I do since my early tests with SSE under Katmai P!!! and itworked well to port 20'000+lines ofsource code (and was easily ported recently to handle AVX alignment)
I find that particular "feature" a weakness of a stagnating language which should have been corrected already.
Alas, nobody dares to confront those dinosaurus' from C++ committee and these days it seems more important to add dozens of different flavors of threading extensions thus creating a paradox of choice instead of giving us one worthy and well thought out interface.
OpenMP, TBB, STM, Cilk, CEAN... what do they have in common?
- They are all unfinished attempts of making parallelization easy.
- They are all trying to solve different aspects of parallelism instead of providing a single all-around solution.
- They all make me want to give up on software development because to cover all cases I need to learn all of them. Learning 5 things instead of one means being average in all 5 instead of mastering a single one.
IMO C++ started suffering from "feature creep" instead of fixing some old design oversights.
Sorry about the rant.
In principle, new ought to respect alignments specified in the definition, e.g. new _m512. Of course, relying on what C++ ought to do is a sure way to unreliability.
TBB, Cilk, and Ct apparently now share a declared intention of their sponsors to be incompatible with OpenMP and a hope that OpenMP will no longer be advocated. My personal belief is that this attitude should provoke a backlash from those who care about multi-language applications (not accepting these various C++ namespaces as satisfying the requirement for multiple languages).