- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[bash]Hi,I'm using ICC 10.0.025 on a multi-core Linux IA-64 (itanium2) platform (running SUSE Linux 2.6.16)
I've verified, using cpuid, that the processors support MMX/SSE/SSE2.I've created a fairly simple
(although not claiming that it's optimal) program that uses SSE intrinsics, and when using objdump -D to
view the disassembly, I do not see SIMD instructions being used at all.The compiler command line is simply:
icc -O3 -o sse sse2.c
The program sse2.c is:
#include#include #define STRIDE 4 #define SIZE 256 #define ALIGNED __declspec(align(16)) int main(void) { ALIGNED float dstFrame[SIZE]; ALIGNED float baseFrame[SIZE]; ALIGNED float scalar1; ALIGNED float scalar2; ALIGNED float tmp[STRIDE]; int i; int nLoop = SIZE / STRIDE; __m128 scale1, dest1, base1, base2, prod1; scalar1 = 23.756; scalar2 = 0.0; scale1 = _mm_load1_ps(&scalar1); for (i=0; i < nLoop; i+=STRIDE) { dest1 = _mm_load_ps(&dstFrame); base1 = _mm_load_ps(&baseFrame); base2 = _mm_load_ps(&baseFrame); scale1 = _mm_mul_ps(scale1, base1); dest1 = _mm_sub_ps(dest1, scale1); _mm_store_ps(&dstFrame, dest1); prod1 = _mm_mul_ps(dest1, base2); _mm_store_ps(tmp, prod1); scalar2 += tmp[0] + tmp[1] + tmp[2] + tmp[3]; } printf("scalar2=%f\n", scalar2); } Does anyone know why this does not result in SIMD instructions being used? I can compile
the same program on an x86-64 box and use icc v11 (I can't control which icc version for which platform)
and see SIMD instructions. I've tried various optimization levels and compiler options with no help.
Thanks for any ideas...
[/bash]
Link Copied
5 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler targeting ia64 has to generate native ia64 instructions, presumably including load-pair. I don't know that optimizing every conceivable translation of SSE intrinsics would be a goal of that compiler, but that doesn't appear to be part of your question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I thought the compiler would translate the intrinsics to SIMD instructions like "mulps" for the _mm_mul_ps intrinsicand make use of the xmm registers, not decide that it knows better and ignore the intrinsics (?). If this were straight C I could see the compiler deciding not to vectorize and use SIMD instructions, butI would think the intrinsics would translatealmost directly. I've attached sse2.s for those interested, what the compiler produces with "icc -S -O3 sse2.c".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The xmm registers on IA64 and limited SSE2 hardware support are provided to assist the EL application to emulate 32-bit applications. They don't provide performance competitive with native IA64, nor, of course, with Intel64 CPUs.
As far as I can see from my limited recollection of IA64 optimization, the compiler seems to do a reasonable job of translating the SSE2 intrinsics to native IA64 SWP code, according to this example.
As far as I can see from my limited recollection of IA64 optimization, the compiler seems to do a reasonable job of translating the SSE2 intrinsics to native IA64 SWP code, according to this example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We do not have xmm registers in IA64. We use use the general pupose registers (64 bits in length) to pack data and apply Itanium specific instruction to manipulate the 8, 16, 32 or 64 bit data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks to all for your replies - that clears things up.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page