Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
7677 Discussions

icc 10.0 on Linux not generating SIMD instructions when using SSE/2 intrinsics

moleres
Beginner
189 Views
[bash]

Hi,I'm using ICC 10.0.025 on a multi-core Linux IA-64 (itanium2) platform (running SUSE Linux 2.6.16)

I've verified, using cpuid, that the processors support MMX/SSE/SSE2.I've created a fairly simple
(although not claiming that it's optimal) program that uses SSE intrinsics, and when using objdump -D to
view the disassembly, I do not see SIMD instructions being used at all.

The compiler command line is simply:

icc -O3 -o sse sse2.c

The program sse2.c is:





#include #include #define STRIDE 4 #define SIZE 256 #define ALIGNED __declspec(align(16)) int main(void) { ALIGNED float dstFrame[SIZE]; ALIGNED float baseFrame[SIZE]; ALIGNED float scalar1; ALIGNED float scalar2; ALIGNED float tmp[STRIDE]; int i; int nLoop = SIZE / STRIDE; __m128 scale1, dest1, base1, base2, prod1; scalar1 = 23.756; scalar2 = 0.0; scale1 = _mm_load1_ps(&scalar1); for (i=0; i < nLoop; i+=STRIDE) { dest1 = _mm_load_ps(&dstFrame); base1 = _mm_load_ps(&baseFrame); base2 = _mm_load_ps(&baseFrame); scale1 = _mm_mul_ps(scale1, base1); dest1 = _mm_sub_ps(dest1, scale1); _mm_store_ps(&dstFrame, dest1); prod1 = _mm_mul_ps(dest1, base2); _mm_store_ps(tmp, prod1); scalar2 += tmp[0] + tmp[1] + tmp[2] + tmp[3]; } printf("scalar2=%f\n", scalar2); } Does anyone know why this does not result in SIMD instructions being used? I can compile
the same program on an x86-64 box and use icc v11 (I can't control which icc version for which platform)
and see SIMD instructions. I've tried various optimization levels and compiler options with no help.
Thanks for any ideas...













[/bash]
0 Kudos
5 Replies
TimP
Black Belt
189 Views
The compiler targeting ia64 has to generate native ia64 instructions, presumably including load-pair. I don't know that optimizing every conceivable translation of SSE intrinsics would be a goal of that compiler, but that doesn't appear to be part of your question.
moleres
Beginner
189 Views
I thought the compiler would translate the intrinsics to SIMD instructions like "mulps" for the _mm_mul_ps intrinsicand make use of the xmm registers, not decide that it knows better and ignore the intrinsics (?). If this were straight C I could see the compiler deciding not to vectorize and use SIMD instructions, butI would think the intrinsics would translatealmost directly. I've attached sse2.s for those interested, what the compiler produces with "icc -S -O3 sse2.c".
TimP
Black Belt
189 Views
The xmm registers on IA64 and limited SSE2 hardware support are provided to assist the EL application to emulate 32-bit applications. They don't provide performance competitive with native IA64, nor, of course, with Intel64 CPUs.
As far as I can see from my limited recollection of IA64 optimization, the compiler seems to do a reasonable job of translating the SSE2 intrinsics to native IA64 SWP code, according to this example.
Om_S_Intel
Employee
189 Views

We do not have xmm registers in IA64. We use use the general pupose registers (64 bits in length) to pack data and apply Itanium specific instruction to manipulate the 8, 16, 32 or 64 bit data.

moleres
Beginner
189 Views
Thanks to all for your replies - that clears things up.
Reply