Re: SSE intrinsics do not generate generate an efficient code

constantine-vassilev · ‎04-15-2009

Currently I am testing the parallel studio SSE code generation looking at assimbler code generated using
Assembly, Machine Code and Source (/FAcs) option.

The code is from Parallel Studio\Composer\Samples\en_US\C++\intrinsic_samples

float dot_product_intrin(float *a, float *b)
{
float total;
int i;
__m128 num1, num2, num3, num4;
num4= _mm_setzero_ps(); //sets sum to zero
for(i=0; i {
num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]
num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]
num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition
//num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0]
num4 = _mm_add_ps(num4, num3); //performs vertical addition
}

num4= _mm_hadd_ps(num4, num4);
_mm_store_ss(&total,num4);
return total;
}

The line of interest is:

num3 = _mm_mul_ps(num1, num2);

From definition:

__m128 num1, num2, num3, num4;

and
num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]

I am expecting num1 and num2 to be SSE registers and executing this line:

num3 = _mm_mul_ps(num1, num2);

I am expecting num1 and num2 to be already initialized with a and b elements and
_mm_mul_ps to perform the multiplication.

Analysing the assebler code generated I see different thing.

;;; num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]
a173c 0f 28 45 30 movaps xmm0, XMMWORD PTR [48+rbp] ;
a1740 0f 28 4d 40 movaps xmm1, XMMWORD PTR [64+rbp] ;
a1744 0f 59 c1 mulps xmm0, xmm1 ;
a1747 0f 29 45 50 movaps XMMWORD PTR [80+rbp], xmm0 ;

As you see there are 2movaps to xmm0 and xmm1 meaning the values are loaded in registers just before the multiplication.

looking at how nnum1 is initialized:

;;; num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]

there a lot of conditionals and then this code:

a16c1 48 63 85 10 01 00 00 movsxd rax, DWORD PTR [272+rbp] ;
a16c8 48 8b 95 30 01 00 00 mov rdx, QWORD PTR [304+rbp] ;
a16cf 0f 10 04 82 movups xmm0, XMMWORD PTR [rdx+rax*4] ;
a16d3 0f 29 45 30 movaps XMMWORD PTR [48+rbp], xmm0 ;

looking at num2 initialization:

a1726 48 63 85 10 01 00 00 movsxd rax, DWORD PTR [272+rbp] ;
a172d 48 8b 95 38 01 00 00 mov rdx, QWORD PTR [312+rbp] ;
a1734 0f 10 04 82 movups xmm0, XMMWORD PTR [rdx+rax*4] ;
a1738 0f 29 45 40 movaps XMMWORD PTR [64+rbp], xmm0 ;

As you see xmm0 is used again and result is stored in memory. I was expecting num1 and num2 to be already stored

in xmm registers so executing multiplication to use them from xmm0 and xmm1 for example:

num3 = _mm_mul_ps(num1, num2);

But as you see it is not the case. I am using 64 bit computer and 64 bit code generation. Generating 32 bit code the result is similar but there are not so much conditionals generated for _mm_loadu_ps - loading unaligned from memory.

I have the following questions:

1. is this normal and I have to use assmbly coding to get the most efficient code?

2. if that is the case I don't understand why the code generated using intrinsics is so unefficient?

3. also 64 bit code is much more inefficient - why. If so what is the performance benefit using 64 bit code then?

By the way I mentioned this only after using the SSE registers in the debugger and there I mentioned only xmm0 changing after num1 and num2 initialization. I was expecting two xmm registers to change. Then I generated the assembler code and anayzed the code.

Thanks in advance,

Constantine

Om_S_Intel · ‎04-17-2009

Quoting - thstart

Currently I am testing the parallel studio SSE code generation looking at assimbler code generated using
Assembly, Machine Code and Source (/FAcs) option.

The code is from Parallel StudioComposerSamplesen_USC++intrinsic_samples

float dot_product_intrin(float *a, float *b)
{
float total;
int i;
__m128 num1, num2, num3, num4;
num4= _mm_setzero_ps(); //sets sum to zero
for(i=0; i{
num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]
num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]
num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition
//num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0]
num4 = _mm_add_ps(num4, num3); //performs vertical addition
}

num4= _mm_hadd_ps(num4, num4);
_mm_store_ss(&total,num4);
return total;
}

The line of interest is:

num3 = _mm_mul_ps(num1, num2);

From definition:

__m128 num1, num2, num3, num4;

and
num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]

I am expecting num1 and num2 to be SSE registers and executing this line:

num3 = _mm_mul_ps(num1, num2);

I am expecting num1 and num2 to be already initialized with a and b elements and
_mm_mul_ps to perform the multiplication.

Analysing the assebler code generated I see different thing.

;;; num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]
a173c 0f 28 45 30 movaps xmm0, XMMWORD PTR [48+rbp] ;
a1740 0f 28 4d 40 movaps xmm1, XMMWORD PTR [64+rbp] ;
a1744 0f 59 c1 mulps xmm0, xmm1 ;
a1747 0f 29 45 50 movaps XMMWORD PTR [80+rbp], xmm0 ;

As you see there are 2movaps to xmm0 and xmm1 meaning the values are loaded in registers just before the multiplication.

looking at how nnum1 is initialized:

;;; num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]

there a lot of conditionals and then this code:

a16c1 48 63 85 10 01 00 00 movsxd rax, DWORD PTR [272+rbp] ;
a16c8 48 8b 95 30 01 00 00 mov rdx, QWORD PTR [304+rbp] ;
a16cf 0f 10 04 82 movups xmm0, XMMWORD PTR [rdx+rax*4] ;
a16d3 0f 29 45 30 movaps XMMWORD PTR [48+rbp], xmm0 ;

looking at num2 initialization:

a1726 48 63 85 10 01 00 00 movsxd rax, DWORD PTR [272+rbp] ;
a172d 48 8b 95 38 01 00 00 mov rdx, QWORD PTR [312+rbp] ;
a1734 0f 10 04 82 movups xmm0, XMMWORD PTR [rdx+rax*4] ;
a1738 0f 29 45 40 movaps XMMWORD PTR [64+rbp], xmm0 ;

As you see xmm0 is used again and result is stored in memory. I was expecting num1 and num2 to be already stored

in xmm registers so executing multiplication to use them from xmm0 and xmm1 for example:

num3 = _mm_mul_ps(num1, num2);

But as you see it is not the case. I am using 64 bit computer and 64 bit code generation. Generating 32 bit code the result is similar but there are not so much conditionals generated for _mm_loadu_ps - loading unaligned from memory.

I have the following questions:

1. is this normal and I have to use assmbly coding to get the most efficient code?

2. if that is the case I don't understand why the code generated using intrinsics is so unefficient?

3. also 64 bit code is much more inefficient - why. If so what is the performance benefit using 64 bit code then?

By the way I mentioned this only after using the SSE registers in the debugger and there I mentioned only xmm0 changing after num1 and num2 initialization. I was expecting two xmm registers to change. Then I generated the assembler code and anayzed the code.

Thanks in advance,

Constantine

Are you using /O2 compiler option. If you provide comlation command line we can take a look at the generated code.

constantine-vassilev · ‎04-17-2009

Quoting - Om Sachan (Intel)

Are you using /O2 compiler option. If you provide comlation command line we can take a look at the generated code.

I believe I tried all optimization options but the generated code is the same.

For this simple example could you please generate the best optimized code yourself and post your best result

with the corresponding compiler option?

Om_S_Intel · ‎04-27-2009

I have included the header and defined SIZE.

//

#include
#define SIZE 12 //assumes size is a multiple of 4 because MMX and SSE
//registers will store 4 elements.

float dot_product_intrin(float * a, float * b)
{
float total;
int i;
__m128 num1, num2, num3, num4;
num4= _mm_setzero_ps(); //sets sum to zero
for(i=0; i{
num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]
num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]
num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition
//num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0]
num4 = _mm_add_ps(num4, num3); //performs vertical addition
}

num4= _mm_hadd_ps(num4, num4);
_mm_store_ss(&total,num4);
return total;
}

I used -O2 option and generated the assembly given below:

; -- Machine type EFI2
; mark_description "Intel C++ Compiler for applications running on Intel 64, Version 11.1 Beta Build 20090227 %s";
; mark_description "-O2 -c -FAs";
OPTION DOTNAME
_TEXT SEGMENT 'CODE'
; COMDAT dot_product_intrin
TXTST0:
; -- Begin dot_product_intrin
; mark_begin;
ALIGN 16
PUBLIC dot_product_intrin
dot_product_intrin PROC
; parameter 1: rcx
; parameter 2: rdx
.B1.1:: ; Preds .B1.0

;;; {

;;; float total;
;;; int i;
;;; __m128 num1, num2, num3, num4;
;;; num4= _mm_setzero_ps(); //sets sum to zero
;;; for(i=0; i;;; {
;;; num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
;;; num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]
;;; num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]

movups xmm5, XMMWORD PTR [rcx] ;15.8
movups xmm0, XMMWORD PTR [rdx] ;15.8
movups xmm2, XMMWORD PTR [16+rcx] ;15.8
movups xmm1, XMMWORD PTR [16+rdx] ;15.8
movups xmm4, XMMWORD PTR [32+rcx] ;15.8
movups xmm3, XMMWORD PTR [32+rdx] ;15.8
mulps xmm5, xmm0 ;15.8
mulps xmm2, xmm1 ;15.8
mulps xmm4, xmm3 ;15.8

;;; num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition

haddps xmm5, xmm5 ;16.8
haddps xmm2, xmm2 ;16.8

;;; //num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0]
;;; num4 = _mm_add_ps(num4, num3); //performs vertical addition

addps xmm5, xmm2 ;18.8
haddps xmm4, xmm4 ;16.8
addps xmm5, xmm4 ;18.8

;;; }
;;;
;;; num4= _mm_hadd_ps(num4, num4);

haddps xmm5, xmm5 ;21.7

;;; _mm_store_ss(&total,num4);

; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.2:: ; Preds .B1.1

;;; return total;

movaps xmm0, xmm5 ;23.8
ret ;23.8
ALIGN 16
; LOE
.B1.3::
; mark_end;
dot_product_intrin ENDP
;dot_product_intrin ENDS
_TEXT ENDS
_DATA SEGMENT 'DATA'
_DATA ENDS
; -- End dot_product_intrin
_DATA SEGMENT 'DATA'
_DATA ENDS
EXTRN __ImageBase:PROC
EXTRN _fltused:BYTE
END

I also generated the assembly using /Od compiler option and is given below:

; -- Machine type EFI2
; mark_description "Intel C++ Compiler for applications running on Intel 64, Version 11.1 Beta Build 20090227 %s";
; mark_description "-Od -c -FAs";
OPTION DOTNAME
_TEXT SEGMENT 'CODE'
TXTST0:
; -- Begin dot_product_intrin
; mark_begin;
ALIGN 2
PUBLIC dot_product_intrin
dot_product_intrin PROC
; parameter 1: rcx
; parameter 2: rdx
.B1.1:: ; Preds .B1.0

;;; {

push rbp ;6.1
sub rsp, 112 ;6.1
lea rbp, QWORD PTR [32+rsp] ;6.1
mov QWORD PTR [96+rbp], rcx ;6.1
mov QWORD PTR [104+rbp], rdx ;6.1

;;; float total;
;;; int i;
;;; __m128 num1, num2, num3, num4;
;;; num4= _mm_setzero_ps(); //sets sum to zero

pxor xmm0, xmm0 ;10.7
movaps XMMWORD PTR [rbp], xmm0 ;10.1

;;; for(i=0; i
mov DWORD PTR [64+rbp], 0 ;11.5
mov eax, DWORD PTR [64+rbp] ;11.10
cmp eax, 12 ;11.12
jge .B1.4 ; Prob 50% ;11.12
; LOE
.B1.3:: ; Preds .B1.1 .B1.3

;;; {
;;; num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]

movsxd rax, DWORD PTR [64+rbp] ;13.23
mov rdx, QWORD PTR [96+rbp] ;13.21
movups xmm0, XMMWORD PTR [rdx+rax*4] ;13.21
movaps XMMWORD PTR [16+rbp], xmm0 ;13.1

;;; num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]

movsxd rax, DWORD PTR [64+rbp] ;14.23
mov rdx, QWORD PTR [104+rbp] ;14.21
movups xmm0, XMMWORD PTR [rdx+rax*4] ;14.21
movaps XMMWORD PTR [32+rbp], xmm0 ;14.1

;;; num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]

movaps xmm0, XMMWORD PTR [16+rbp] ;15.19
movaps xmm1, XMMWORD PTR [32+rbp] ;15.25
mulps xmm0, xmm1 ;15.8
movaps XMMWORD PTR [48+rbp], xmm0 ;15.1

;;; num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition

movaps xmm0, XMMWORD PTR [48+rbp] ;16.20
movaps xmm1, XMMWORD PTR [48+rbp] ;16.26
haddps xmm0, xmm1 ;16.8
movaps XMMWORD PTR [48+rbp], xmm0 ;16.1

;;; //num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0]
;;; num4 = _mm_add_ps(num4, num3); //performs vertical addition

movaps xmm0, XMMWORD PTR [rbp] ;18.19
movaps xmm1, XMMWORD PTR [48+rbp] ;18.25
addps xmm0, xmm1 ;18.8
movaps XMMWORD PTR [rbp], xmm0 ;18.1
add DWORD PTR [64+rbp], 4 ;11.18
mov eax, DWORD PTR [64+rbp] ;11.10
cmp eax, 12 ;11.12
jl .B1.3 ; Prob 50% ;11.12
; LOE
.B1.4:: ; Preds .B1.3 .B1.1

;;; }
;;;
;;; num4= _mm_hadd_ps(num4, num4);

movaps xmm0, XMMWORD PTR [rbp] ;21.19
movaps xmm1, XMMWORD PTR [rbp] ;21.25
haddps xmm0, xmm1 ;21.7
movaps XMMWORD PTR [rbp], xmm0 ;21.1

;;; _mm_store_ss(&total,num4);

movaps xmm0, XMMWORD PTR [rbp] ;22.21
movss DWORD PTR [68+rbp], xmm0 ;22.1
; LOE
.B1.5:: ; Preds .B1.4

;;; return total;

movss xmm0, DWORD PTR [68+rbp] ;23.8
lea rsp, QWORD PTR [80+rbp] ;23.8
pop rbp ;23.8
ret ;23.8
ALIGN 2
; LOE
.B1.6::
; mark_end;
dot_product_intrin ENDP
.xdata SEGMENT DWORD READ
$unwind$dot_product_intrin$B1_B5 DD 025030a01H
DD 0d205030aH
DD 05001H
.xdata ENDS
.pdata SEGMENT DWORD READ
$pdata$dot_product_intrin$B1_B5 DD imagerel .B1.1
DD imagerel .B1.6
DD imagerel $unwind$dot_product_intrin$B1_B5
.pdata ENDS
_TEXT ENDS
_DATA SEGMENT 'DATA'
_DATA ENDS
; -- End dot_product_intrin
_DATA SEGMENT 'DATA'
_DATA ENDS
EXTRN __ImageBase:PROC
EXTRN _fltused:BYTE
END

constantine-vassilev · ‎06-03-2009

Quoting - Om Sachan (Intel)

I have included the header and defined SIZE.

//

#include
#define SIZE 12 //assumes size is a multiple of 4 because MMX and SSE
//registers will store 4 elements.

float dot_product_intrin(float * a, float * b)
{
float total;
int i;
__m128 num1, num2, num3, num4;
num4= _mm_setzero_ps(); //sets sum to zero
for(i=0; i{
num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]
num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]
num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition
//num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0]
num4 = _mm_add_ps(num4, num3); //performs vertical addition
}

num4= _mm_hadd_ps(num4, num4);
_mm_store_ss(&total,num4);
return total;
}

I used -O2 option and generated the assembly given below:

; -- Machine type EFI2
; mark_description "Intel C++ Compiler for applications running on Intel 64, Version 11.1 Beta Build 20090227 %s";
; mark_description "-O2 -c -FAs";
OPTION DOTNAME
_TEXT SEGMENT 'CODE'
; COMDAT dot_product_intrin
TXTST0:
; -- Begin dot_product_intrin
; mark_begin;
ALIGN 16
PUBLIC dot_product_intrin
dot_product_intrin PROC
; parameter 1: rcx
; parameter 2: rdx
.B1.1:: ; Preds .B1.0

;;; {

;;; float total;
;;; int i;
;;; __m128 num1, num2, num3, num4;
;;; num4= _mm_setzero_ps(); //sets sum to zero
;;; for(i=0; i;;; {
;;; num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]
;;; num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]
;;; num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]

movups xmm5, XMMWORD PTR [rcx] ;15.8
movups xmm0, XMMWORD PTR [rdx] ;15.8
movups xmm2, XMMWORD PTR [16+rcx] ;15.8
movups xmm1, XMMWORD PTR [16+rdx] ;15.8
movups xmm4, XMMWORD PTR [32+rcx] ;15.8
movups xmm3, XMMWORD PTR [32+rdx] ;15.8
mulps xmm5, xmm0 ;15.8
mulps xmm2, xmm1 ;15.8
mulps xmm4, xmm3 ;15.8

;;; num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition

haddps xmm5, xmm5 ;16.8
haddps xmm2, xmm2 ;16.8

;;; //num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0]
;;; num4 = _mm_add_ps(num4, num3); //performs vertical addition

addps xmm5, xmm2 ;18.8
haddps xmm4, xmm4 ;16.8
addps xmm5, xmm4 ;18.8

;;; }
;;;
;;; num4= _mm_hadd_ps(num4, num4);

haddps xmm5, xmm5 ;21.7

;;; _mm_store_ss(&total,num4);

; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.2:: ; Preds .B1.1

;;; return total;

movaps xmm0, xmm5 ;23.8
ret ;23.8
ALIGN 16
; LOE
.B1.3::
; mark_end;
dot_product_intrin ENDP
;dot_product_intrin ENDS
_TEXT ENDS
_DATA SEGMENT 'DATA'
_DATA ENDS
; -- End dot_product_intrin
_DATA SEGMENT 'DATA'
_DATA ENDS
EXTRN __ImageBase:PROC
EXTRN _fltused:BYTE
END

I also generated the assembly using /Od compiler option and is given below:

; -- Machine type EFI2
; mark_description "Intel C++ Compiler for applications running on Intel 64, Version 11.1 Beta Build 20090227 %s";
; mark_description "-Od -c -FAs";
OPTION DOTNAME
_TEXT SEGMENT 'CODE'
TXTST0:
; -- Begin dot_product_intrin
; mark_begin;
ALIGN 2
PUBLIC dot_product_intrin
dot_product_intrin PROC
; parameter 1: rcx
; parameter 2: rdx
.B1.1:: ; Preds .B1.0

;;; {

push rbp ;6.1
sub rsp, 112 ;6.1
lea rbp, QWORD PTR [32+rsp] ;6.1
mov QWORD PTR [96+rbp], rcx ;6.1
mov QWORD PTR [104+rbp], rdx ;6.1

;;; float total;
;;; int i;
;;; __m128 num1, num2, num3, num4;
;;; num4= _mm_setzero_ps(); //sets sum to zero

pxor xmm0, xmm0 ;10.7
movaps XMMWORD PTR [rbp], xmm0 ;10.1

;;; for(i=0; i
mov DWORD PTR [64+rbp], 0 ;11.5
mov eax, DWORD PTR [64+rbp] ;11.10
cmp eax, 12 ;11.12
jge .B1.4 ; Prob 50% ;11.12
; LOE
.B1.3:: ; Preds .B1.1 .B1.3

;;; {
;;; num1 = _mm_loadu_ps(a+i); //loads unaligned array a into num1 num1= a[3] a[2] a[1] a[0]

movsxd rax, DWORD PTR [64+rbp] ;13.23
mov rdx, QWORD PTR [96+rbp] ;13.21
movups xmm0, XMMWORD PTR [rdx+rax*4] ;13.21
movaps XMMWORD PTR [16+rbp], xmm0 ;13.1

;;; num2 = _mm_loadu_ps(b+i); //loads unaligned array b into num2 num2= b[3] b[2] b[1] b[0]

movsxd rax, DWORD PTR [64+rbp] ;14.23
mov rdx, QWORD PTR [104+rbp] ;14.21
movups xmm0, XMMWORD PTR [rdx+rax*4] ;14.21
movaps XMMWORD PTR [32+rbp], xmm0 ;14.1

;;; num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]

movaps xmm0, XMMWORD PTR [16+rbp] ;15.19
movaps xmm1, XMMWORD PTR [32+rbp] ;15.25
mulps xmm0, xmm1 ;15.8
movaps XMMWORD PTR [48+rbp], xmm0 ;15.1

;;; num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition

movaps xmm0, XMMWORD PTR [48+rbp] ;16.20
movaps xmm1, XMMWORD PTR [48+rbp] ;16.26
haddps xmm0, xmm1 ;16.8
movaps XMMWORD PTR [48+rbp], xmm0 ;16.1

;;; //num3= a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0] a[3]*b[3]+ a[2]*b[2] a[1]*b[1]+a[0]*b[0]
;;; num4 = _mm_add_ps(num4, num3); //performs vertical addition

movaps xmm0, XMMWORD PTR [rbp] ;18.19
movaps xmm1, XMMWORD PTR [48+rbp] ;18.25
addps xmm0, xmm1 ;18.8
movaps XMMWORD PTR [rbp], xmm0 ;18.1
add DWORD PTR [64+rbp], 4 ;11.18
mov eax, DWORD PTR [64+rbp] ;11.10
cmp eax, 12 ;11.12
jl .B1.3 ; Prob 50% ;11.12
; LOE
.B1.4:: ; Preds .B1.3 .B1.1

;;; }
;;;
;;; num4= _mm_hadd_ps(num4, num4);

movaps xmm0, XMMWORD PTR [rbp] ;21.19
movaps xmm1, XMMWORD PTR [rbp] ;21.25
haddps xmm0, xmm1 ;21.7
movaps XMMWORD PTR [rbp], xmm0 ;21.1

;;; _mm_store_ss(&total,num4);

movaps xmm0, XMMWORD PTR [rbp] ;22.21
movss DWORD PTR [68+rbp], xmm0 ;22.1
; LOE
.B1.5:: ; Preds .B1.4

;;; return total;

movss xmm0, DWORD PTR [68+rbp] ;23.8
lea rsp, QWORD PTR [80+rbp] ;23.8
pop rbp ;23.8
ret ;23.8
ALIGN 2
; LOE
.B1.6::
; mark_end;
dot_product_intrin ENDP
.xdata SEGMENT DWORD READ
$unwind$dot_product_intrin$B1_B5 DD 025030a01H
DD 0d205030aH
DD 05001H
.xdata ENDS
.pdata SEGMENT DWORD READ
$pdata$dot_product_intrin$B1_B5 DD imagerel .B1.1
DD imagerel .B1.6
DD imagerel $unwind$dot_product_intrin$B1_B5
.pdata ENDS
_TEXT ENDS
_DATA SEGMENT 'DATA'
_DATA ENDS
; -- End dot_product_intrin
_DATA SEGMENT 'DATA'
_DATA ENDS
EXTRN __ImageBase:PROC
EXTRN _fltused:BYTE
END

Hi Om Sachan,

Thanks for your response.

Could you please look again at the generated code and tell me if it is optimal?
No it is not. I highlighted the code of interest.
1) num3 is calculatedand result isinxmm0.
2) xmm0 content is moved in memory.
3) the same value is moved back in xmm0 again.

Operation 2) is not needed. Moving back and forth between memory and
register is a slow operation.

I don't believe the compiler generates an optimal code.

Please correct me if I am wrong,
Constantine

------------------------------------------------------------------------------------------------------------------------------------
;;; num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]

movaps xmm0, XMMWORD PTR [16+rbp] ;15.19
movaps xmm1, XMMWORD PTR [32+rbp] ;15.25

1)->mulps xmm0, xmm1 ;15.8
2)->movaps XMMWORD PTR [48+rbp], xmm0 ;15.1

;;; num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition

3)->movaps xmm0, XMMWORD PTR [48+rbp] ;16.20

movaps xmm1, XMMWORD PTR [48+rbp] ;16.26
haddps xmm0, xmm1 ;16.8
movaps XMMWORD PTR [48+rbp], xmm0 ;16.1
------------------------------------------------------------------------------------------------------------------------------------

constantine-vassilev · ‎06-03-2009

Quoting - thstart

Hi Om Sachan,

Thanks for your response.

Could you please look again at the generated code and tell me if it is optimal?
No it is not. I highlighted the code of interest.
1) num3 is calculatedand result isinxmm0.
2) xmm0 content is moved in memory.
3) the same value is moved back in xmm0 again.

Operation 2) is not needed. Moving back and forth between memory and
register is a slow operation.

I don't believe the compiler generates an optimal code.

Please correct me if I am wrong,
Constantine

------------------------------------------------------------------------------------------------------------------------------------
;;; num3 = _mm_mul_ps(num1, num2); //performs multiplication num3 = a[3]*b[3] a[2]*b[2] a[1]*b[1] a[0]*b[0]

movaps xmm0, XMMWORD PTR [16+rbp] ;15.19
movaps xmm1, XMMWORD PTR [32+rbp] ;15.25

1)->mulps xmm0, xmm1 ;15.8
2)->movaps XMMWORD PTR [48+rbp], xmm0 ;15.1

;;; num3 = _mm_hadd_ps(num3, num3); //performs horizontal addition

3)->movaps xmm0, XMMWORD PTR [48+rbp] ;16.20

movaps xmm1, XMMWORD PTR [48+rbp] ;16.26
haddps xmm0, xmm1 ;16.8
movaps XMMWORD PTR [48+rbp], xmm0 ;16.1
------------------------------------------------------------------------------------------------------------------------------------

I need to make correction: the -O2 option generates an optimized code as you posted it
only in 64-bit mode. My comment is about generating code with -O2 option in 32-bit mode.

Om_S_Intel · ‎06-05-2009

I am getting different result from yours. Could you please attach the complete assembly generated using compiler option -c -O2 -FAcs?

Thanks,

Om