Inefficient generated code for division of complex numbers by r

styc · ‎11-12-2010

Compile

[fortran]program test

    double complex x
    double precision y

    read *, x, y
    print *, x / y
    print *, x * y

end program
[/fortran]

with ifort 12.0.0.084 (-O3 -xSSE4.2) produces (objdump output)

[plain]    print *, x / y
  402d48:	f2 0f 10 84 24 98 00 	movsd  0x98(%rsp),%xmm0
  402d4f:	00 00 
  402d51:	48 8d 3c 24          	lea    (%rsp),%rdi
  402d55:	0f 29 44 24 70       	movaps %xmm0,0x70(%rsp)
  402d5a:	f2 0f 11 44 24 60    	movsd  %xmm0,0x60(%rsp)
  402d60:	0f 57 c9             	xorps  %xmm1,%xmm1
  402d63:	dd 44 24 60          	fldl   0x60(%rsp)
  402d67:	f2 0f 11 4c 24 60    	movsd  %xmm1,0x60(%rsp)
  402d6d:	d9 c0                	fld    %st(0)
  402d6f:	d8 c9                	fmul   %st(1),%st
  402d71:	4c 8d 44 24 50       	lea    0x50(%rsp),%r8
  402d76:	d9 e8                	fld1   
  402d78:	0f 10 54 24 40       	movups 0x40(%rsp),%xmm2
  402d7d:	dd 44 24 60          	fldl   0x60(%rsp)
  402d81:	d9 c0                	fld    %st(0)
  402d83:	0f 28 da             	movaps %xmm2,%xmm3
  402d86:	d8 c9                	fmul   %st(1),%st
  402d88:	be ff ff ff ff       	mov    $0xffffffff,%esi
  402d8d:	f2 0f 11 54 24 60    	movsd  %xmm2,0x60(%rsp)
  402d93:	48 ba 00 ff 84 83 20 	mov    $0x208384ff00,%rdx
  402d9a:	00 00 00 
  402d9d:	66 0f 15 da          	unpckhpd %xmm2,%xmm3
  402da1:	b9 c0 ce 47 00       	mov    $0x47cec0,%ecx
  402da6:	33 c0                	xor    %eax,%eax
  402da8:	48 c7 04 24 00 00 00 	movq   $0x0,(%rsp)
  402daf:	00 
  402db0:	de c3                	faddp  %st,%st(3)
  402db2:	d9 ca                	fxch   %st(2)
  402db4:	de f9                	fdivrp %st,%st(1)
  402db6:	dd 44 24 60          	fldl   0x60(%rsp)
  402dba:	f2 0f 11 5c 24 60    	movsd  %xmm3,0x60(%rsp)
  402dc0:	d9 c0                	fld    %st(0)
  402dc2:	d8 cc                	fmul   %st(4),%st
  402dc4:	d9 c9                	fxch   %st(1)
  402dc6:	d8 cb                	fmul   %st(3),%st
  402dc8:	dd 44 24 60          	fldl   0x60(%rsp)
  402dcc:	d9 c0                	fld    %st(0)
  402dce:	de cd                	fmulp  %st,%st(5)
  402dd0:	0f 29 94 24 80 00 00 	movaps %xmm2,0x80(%rsp)
  402dd7:	00 
  402dd8:	d9 cc                	fxch   %st(4)
  402dda:	de c2                	faddp  %st,%st(2)
  402ddc:	d9 c9                	fxch   %st(1)
  402dde:	d8 ca                	fmul   %st(2),%st
  402de0:	dd 5c 24 60          	fstpl  0x60(%rsp)
  402de4:	d9 cb                	fxch   %st(3)
  402de6:	de ca                	fmulp  %st,%st(2)
  402de8:	f2 0f 10 64 24 60    	movsd  0x60(%rsp),%xmm4
  402dee:	d9 ca                	fxch   %st(2)
  402df0:	de e9                	fsubrp %st,%st(1)
  402df2:	de c9                	fmulp  %st,%st(1)
  402df4:	dd 5c 24 60          	fstpl  0x60(%rsp)
  402df8:	66 0f 16 64 24 60    	movhpd 0x60(%rsp),%xmm4
  402dfe:	0f 11 64 24 50       	movups %xmm4,0x50(%rsp)
  402e03:	e8 48 2d 01 00       	callq  415b50 
  402e08:	0f 28 94 24 80 00 00 	movaps 0x80(%rsp),%xmm2
  402e0f:	00 
    print *, x * y
  402e10:	48 8d 3c 24          	lea    (%rsp),%rdi
  402e14:	f2 0f 12 44 24 70    	movddup 0x70(%rsp),%xmm0
  402e1a:	4c 8d 44 24 60       	lea    0x60(%rsp),%r8
  402e1f:	66 0f 59 c2          	mulpd  %xmm2,%xmm0
  402e23:	be ff ff ff ff       	mov    $0xffffffff,%esi
  402e28:	48 ba 00 ff 84 83 20 	mov    $0x208384ff00,%rdx
  402e2f:	00 00 00 
  402e32:	b9 c8 ce 47 00       	mov    $0x47cec8,%ecx
  402e37:	33 c0                	xor    %eax,%eax
  402e39:	48 c7 04 24 00 00 00 	movq   $0x0,(%rsp)
  402e40:	00 
  402e41:	0f 11 44 24 60       	movups %xmm0,0x60(%rsp)
  402e46:	e8 05 2d 01 00       	callq  415b50 
[/plain]

Shouldn't the division calculated simply using the divpd instruction similarly to the multiplication?

jimdempseyatthecove · ‎11-13-2010

What size is "double"? REAL(8) or REAL(16)?

The size of double is affected by a command line option.
When REAL(16), internally it uses the FPP TBYTE format (10 bytes) for computations.

Jim Dempsey

TimP · ‎11-13-2010

I agree, it's curious that SSE4.2 option produces such code. But, if you accept the limited range complex arithmetic, -fp-model fast=2 appears to do what you want, and it shouldn't expose the limited range problem in this case. The question would be more interesting if you posed a case with a chance of measurably different performance, preferably one which included operations between complex and complex.

Steven_L_Intel1 · ‎11-14-2010

Jim, I don't know what you mean by "FPP TBYTE" format. In our Fortran, REAL(16) is a 16-byte IEEE-style format that has all computations done in a software library. Tim has the right idea here - the standard code is a bit more cautious regarding range issues.

styc · ‎11-15-2010

The posted code does calculations in double-precision (64-bit) arithmetic. But that is not the issue. The generate code is acceptable if y is a double complex variable. But when it is a double precision variable, the division x/y can be computed as simply as by x/y = (a/y, b/y) assuming x = (a, b), just like the multiplication x*y is computed by x*y = (a*y, b*y). Converting y to double complex type and doing a full complex-complex division is not any safer in terms of range issues.

TimP · ‎11-15-2010

No, the point is that the compiler defaults to a code generation mode which should support general complex arithmetic. If you wish the compiler to switch into the mode it uses with -fast=2 when it is possible to do so without loss of generality (supposing that is your meaning), you could make your case in a post on your premier.intel.com, but it would be worth while to have a case which shows a likely performance benefit.

Inefficient generated code for division of complex numbers by real numbers