_mm_load_ps generates VMOVUPS

emmanuel_attia · ‎09-26-2013

Hi all,

I've tested the following case using Intel XE Compiler 2011.3 and 2013.4

I have a question, let's take a very basic SSE function:

[cpp]void test1(float * pool)
{
    __m128 v = _mm_load_ps(pool);
    __m128 a = _mm_load_ps(pool + 8);

    _mm_store_ps(pool + 16, _mm_add_ps(v, a));

    printf("test1: %g\n", pool[16]);
}[/cpp]

if I compile it without specific flags i get expected SSE code, aligned load (explicit for pool, implicit for pool + 20h) and store (pool + 40h):

[plain]00E410A3 movaps      xmm0,xmmword ptr [eax]
00E410A6 addps       xmm0,xmmword ptr [eax+20h]
00E410AA movaps      xmmword ptr [eax+40h],xmm0 [/plain]

if I compile it using AVX i get unaligned load for pool, implicit aligned load for pool + 20h and unaligned store for pool + 40h

[plain]002F10A3 vmovups     ymm0,xmmword ptr [eax]
002F10A7 vaddps      ymm1,ymm0,xmmword ptr [eax+20h]
002F10AC vmovups     xmmword ptr [eax+40h],xmm1[plain]

Is this expected ? Does this affect performance ?

Kind regards

emmanuel_attia · ‎09-26-2013

When i say "I compile it using AVX", I mean /QxAVX under Windows (and that means in my project there is AVX elsewhere so not using this flags ends up in either emulating AVX instruction with SSE or mixing legacy / VEX instruction => performance disaster)

Bernard · ‎09-26-2013

Look at this posthttp://software.intel.com/en-us/forums/topic/278573

emmanuel_attia · ‎09-26-2013

Ok, after benchmarking random access load/store, seems VMOVUPS [XMM] = MOVAPS in term of computation time when memory is aligned.

Thanks a lot

Bernard · ‎09-26-2013

You are welcome.