icc 10.0 on Linux not generating SIMD instructions when using S

moleres · ‎05-19-2010

[bash]Hi,I'm using ICC 10.0.025 on a multi-core Linux IA-64 (itanium2) platform (running SUSE Linux 2.6.16)
I've verified, using cpuid, that the processors support MMX/SSE/SSE2.I've created a fairly simple
(although not claiming that it's optimal) program that uses SSE intrinsics, and when using objdump -D to
view the disassembly, I do not see SIMD instructions being used at all.
The compiler command line is simply:
icc -O3 -o sse sse2.c
The program sse2.c is:




#include 
#include 

#define STRIDE  4
#define SIZE   256 
#define ALIGNED __declspec(align(16))

int main(void)
{
    ALIGNED float dstFrame[SIZE];
    ALIGNED float baseFrame[SIZE];
    ALIGNED float scalar1;
    ALIGNED float scalar2;
    ALIGNED float tmp[STRIDE];
    int i;
    int nLoop = SIZE / STRIDE;
    __m128 scale1, dest1, base1, base2, prod1;

    scalar1 = 23.756;
    scalar2 = 0.0;
    scale1 = _mm_load1_ps(&scalar1);

    for (i=0; i < nLoop; i+=STRIDE)
    {
        dest1 = _mm_load_ps(&dstFrame);
        base1 = _mm_load_ps(&baseFrame);
        base2 = _mm_load_ps(&baseFrame);

        scale1 = _mm_mul_ps(scale1, base1);
        dest1 = _mm_sub_ps(dest1, scale1); 
        _mm_store_ps(&dstFrame, dest1);

        prod1 = _mm_mul_ps(dest1, base2);
        _mm_store_ps(tmp, prod1);
        scalar2 += tmp[0] + tmp[1] + tmp[2] + tmp[3];
    }

    printf("scalar2=%f\n", scalar2);
}
Does anyone know why this does not result in SIMD instructions being used?  I can compile
the same program on an x86-64 box and use icc v11 (I can't control which icc version for which platform)
and see SIMD instructions.  I've tried various optimization levels and compiler options with no help.
Thanks for any ideas...













[/bash]

TimP · ‎05-19-2010

The compiler targeting ia64 has to generate native ia64 instructions, presumably including load-pair. I don't know that optimizing every conceivable translation of SSE intrinsics would be a goal of that compiler, but that doesn't appear to be part of your question.

moleres · ‎05-20-2010

I thought the compiler would translate the intrinsics to SIMD instructions like "mulps" for the _mm_mul_ps intrinsicand make use of the xmm registers, not decide that it knows better and ignore the intrinsics (?). If this were straight C I could see the compiler deciding not to vectorize and use SIMD instructions, butI would think the intrinsics would translatealmost directly. I've attached sse2.s for those interested, what the compiler produces with "icc -S -O3 sse2.c".

TimP · ‎05-20-2010

The xmm registers on IA64 and limited SSE2 hardware support are provided to assist the EL application to emulate 32-bit applications. They don't provide performance competitive with native IA64, nor, of course, with Intel64 CPUs.
As far as I can see from my limited recollection of IA64 optimization, the compiler seems to do a reasonable job of translating the SSE2 intrinsics to native IA64 SWP code, according to this example.

Om_S_Intel · ‎05-20-2010

We do not have xmm registers in IA64. We use use the general pupose registers (64 bits in length) to pack data and apply Itanium specific instruction to manipulate the 8, 16, 32 or 64 bit data.

moleres · ‎05-21-2010

Thanks to all for your replies - that clears things up.

icc 10.0 on Linux not generating SIMD instructions when using SSE/2 intrinsics