Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7954 Discussions

icc 10.0 on Linux not generating SIMD instructions when using SSE/2 intrinsics

moleres
Beginner
406 Views
[bash]

Hi,I'm using ICC 10.0.025 on a multi-core Linux IA-64 (itanium2) platform (running SUSE Linux 2.6.16)

I've verified, using cpuid, that the processors support MMX/SSE/SSE2.I've created a fairly simple
(although not claiming that it's optimal) program that uses SSE intrinsics, and when using objdump -D to
view the disassembly, I do not see SIMD instructions being used at all.

The compiler command line is simply:

icc -O3 -o sse sse2.c

The program sse2.c is:





#include #include #define STRIDE 4 #define SIZE 256 #define ALIGNED __declspec(align(16)) int main(void) { ALIGNED float dstFrame[SIZE]; ALIGNED float baseFrame[SIZE]; ALIGNED float scalar1; ALIGNED float scalar2; ALIGNED float tmp[STRIDE]; int i; int nLoop = SIZE / STRIDE; __m128 scale1, dest1, base1, base2, prod1; scalar1 = 23.756; scalar2 = 0.0; scale1 = _mm_load1_ps(&scalar1); for (i=0; i < nLoop; i+=STRIDE) { dest1 = _mm_load_ps(&dstFrame); base1 = _mm_load_ps(&baseFrame); base2 = _mm_load_ps(&baseFrame); scale1 = _mm_mul_ps(scale1, base1); dest1 = _mm_sub_ps(dest1, scale1); _mm_store_ps(&dstFrame, dest1); prod1 = _mm_mul_ps(dest1, base2); _mm_store_ps(tmp, prod1); scalar2 += tmp[0] + tmp[1] + tmp[2] + tmp[3]; } printf("scalar2=%f\n", scalar2); } Does anyone know why this does not result in SIMD instructions being used? I can compile
the same program on an x86-64 box and use icc v11 (I can't control which icc version for which platform)
and see SIMD instructions. I've tried various optimization levels and compiler options with no help.
Thanks for any ideas...













[/bash]
0 Kudos
5 Replies
TimP
Honored Contributor III
406 Views
The compiler targeting ia64 has to generate native ia64 instructions, presumably including load-pair. I don't know that optimizing every conceivable translation of SSE intrinsics would be a goal of that compiler, but that doesn't appear to be part of your question.
0 Kudos
moleres
Beginner
406 Views
I thought the compiler would translate the intrinsics to SIMD instructions like "mulps" for the _mm_mul_ps intrinsicand make use of the xmm registers, not decide that it knows better and ignore the intrinsics (?). If this were straight C I could see the compiler deciding not to vectorize and use SIMD instructions, butI would think the intrinsics would translatealmost directly. I've attached sse2.s for those interested, what the compiler produces with "icc -S -O3 sse2.c".
0 Kudos
TimP
Honored Contributor III
406 Views
The xmm registers on IA64 and limited SSE2 hardware support are provided to assist the EL application to emulate 32-bit applications. They don't provide performance competitive with native IA64, nor, of course, with Intel64 CPUs.
As far as I can see from my limited recollection of IA64 optimization, the compiler seems to do a reasonable job of translating the SSE2 intrinsics to native IA64 SWP code, according to this example.
0 Kudos
Om_S_Intel
Employee
406 Views

We do not have xmm registers in IA64. We use use the general pupose registers (64 bits in length) to pack data and apply Itanium specific instruction to manipulate the 8, 16, 32 or 64 bit data.

0 Kudos
moleres
Beginner
406 Views
Thanks to all for your replies - that clears things up.
0 Kudos
Reply