Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29280 Discussions

Inefficient generated code for division of complex numbers by real numbers

styc
Beginner
957 Views
Compile
[fortran]program test

    double complex x
    double precision y

    read *, x, y
    print *, x / y
    print *, x * y

end program
[/fortran]
with ifort 12.0.0.084 (-O3 -xSSE4.2) produces (objdump output)
[plain]    print *, x / y
  402d48:	f2 0f 10 84 24 98 00 	movsd  0x98(%rsp),%xmm0
  402d4f:	00 00 
  402d51:	48 8d 3c 24          	lea    (%rsp),%rdi
  402d55:	0f 29 44 24 70       	movaps %xmm0,0x70(%rsp)
  402d5a:	f2 0f 11 44 24 60    	movsd  %xmm0,0x60(%rsp)
  402d60:	0f 57 c9             	xorps  %xmm1,%xmm1
  402d63:	dd 44 24 60          	fldl   0x60(%rsp)
  402d67:	f2 0f 11 4c 24 60    	movsd  %xmm1,0x60(%rsp)
  402d6d:	d9 c0                	fld    %st(0)
  402d6f:	d8 c9                	fmul   %st(1),%st
  402d71:	4c 8d 44 24 50       	lea    0x50(%rsp),%r8
  402d76:	d9 e8                	fld1   
  402d78:	0f 10 54 24 40       	movups 0x40(%rsp),%xmm2
  402d7d:	dd 44 24 60          	fldl   0x60(%rsp)
  402d81:	d9 c0                	fld    %st(0)
  402d83:	0f 28 da             	movaps %xmm2,%xmm3
  402d86:	d8 c9                	fmul   %st(1),%st
  402d88:	be ff ff ff ff       	mov    $0xffffffff,%esi
  402d8d:	f2 0f 11 54 24 60    	movsd  %xmm2,0x60(%rsp)
  402d93:	48 ba 00 ff 84 83 20 	mov    $0x208384ff00,%rdx
  402d9a:	00 00 00 
  402d9d:	66 0f 15 da          	unpckhpd %xmm2,%xmm3
  402da1:	b9 c0 ce 47 00       	mov    $0x47cec0,%ecx
  402da6:	33 c0                	xor    %eax,%eax
  402da8:	48 c7 04 24 00 00 00 	movq   $0x0,(%rsp)
  402daf:	00 
  402db0:	de c3                	faddp  %st,%st(3)
  402db2:	d9 ca                	fxch   %st(2)
  402db4:	de f9                	fdivrp %st,%st(1)
  402db6:	dd 44 24 60          	fldl   0x60(%rsp)
  402dba:	f2 0f 11 5c 24 60    	movsd  %xmm3,0x60(%rsp)
  402dc0:	d9 c0                	fld    %st(0)
  402dc2:	d8 cc                	fmul   %st(4),%st
  402dc4:	d9 c9                	fxch   %st(1)
  402dc6:	d8 cb                	fmul   %st(3),%st
  402dc8:	dd 44 24 60          	fldl   0x60(%rsp)
  402dcc:	d9 c0                	fld    %st(0)
  402dce:	de cd                	fmulp  %st,%st(5)
  402dd0:	0f 29 94 24 80 00 00 	movaps %xmm2,0x80(%rsp)
  402dd7:	00 
  402dd8:	d9 cc                	fxch   %st(4)
  402dda:	de c2                	faddp  %st,%st(2)
  402ddc:	d9 c9                	fxch   %st(1)
  402dde:	d8 ca                	fmul   %st(2),%st
  402de0:	dd 5c 24 60          	fstpl  0x60(%rsp)
  402de4:	d9 cb                	fxch   %st(3)
  402de6:	de ca                	fmulp  %st,%st(2)
  402de8:	f2 0f 10 64 24 60    	movsd  0x60(%rsp),%xmm4
  402dee:	d9 ca                	fxch   %st(2)
  402df0:	de e9                	fsubrp %st,%st(1)
  402df2:	de c9                	fmulp  %st,%st(1)
  402df4:	dd 5c 24 60          	fstpl  0x60(%rsp)
  402df8:	66 0f 16 64 24 60    	movhpd 0x60(%rsp),%xmm4
  402dfe:	0f 11 64 24 50       	movups %xmm4,0x50(%rsp)
  402e03:	e8 48 2d 01 00       	callq  415b50 
  402e08:	0f 28 94 24 80 00 00 	movaps 0x80(%rsp),%xmm2
  402e0f:	00 
    print *, x * y
  402e10:	48 8d 3c 24          	lea    (%rsp),%rdi
  402e14:	f2 0f 12 44 24 70    	movddup 0x70(%rsp),%xmm0
  402e1a:	4c 8d 44 24 60       	lea    0x60(%rsp),%r8
  402e1f:	66 0f 59 c2          	mulpd  %xmm2,%xmm0
  402e23:	be ff ff ff ff       	mov    $0xffffffff,%esi
  402e28:	48 ba 00 ff 84 83 20 	mov    $0x208384ff00,%rdx
  402e2f:	00 00 00 
  402e32:	b9 c8 ce 47 00       	mov    $0x47cec8,%ecx
  402e37:	33 c0                	xor    %eax,%eax
  402e39:	48 c7 04 24 00 00 00 	movq   $0x0,(%rsp)
  402e40:	00 
  402e41:	0f 11 44 24 60       	movups %xmm0,0x60(%rsp)
  402e46:	e8 05 2d 01 00       	callq  415b50 
[/plain]
Shouldn't the division calculated simply using the divpd instruction similarly to the multiplication?
0 Kudos
5 Replies
jimdempseyatthecove
Honored Contributor III
957 Views
What size is "double"? REAL(8) or REAL(16)?

The size of double is affected by a command line option.
When REAL(16), internally it uses the FPP TBYTE format (10 bytes) for computations.

Jim Dempsey
0 Kudos
TimP
Honored Contributor III
957 Views
I agree, it's curious that SSE4.2 option produces such code. But, if you accept the limited range complex arithmetic, -fp-model fast=2 appears to do what you want, and it shouldn't expose the limited range problem in this case. The question would be more interesting if you posed a case with a chance of measurably different performance, preferably one which included operations between complex and complex.
0 Kudos
Steven_L_Intel1
Employee
957 Views
Jim, I don't know what you mean by "FPP TBYTE" format. In our Fortran, REAL(16) is a 16-byte IEEE-style format that has all computations done in a software library. Tim has the right idea here - the standard code is a bit more cautious regarding range issues.
0 Kudos
styc
Beginner
957 Views
The posted code does calculations in double-precision (64-bit) arithmetic. But that is not the issue. The generate code is acceptable if y is a double complex variable. But when it is a double precision variable, the division x/y can be computed as simply as by x/y = (a/y, b/y) assuming x = (a, b), just like the multiplication x*y is computed by x*y = (a*y, b*y). Converting y to double complex type and doing a full complex-complex division is not any safer in terms of range issues.
0 Kudos
TimP
Honored Contributor III
957 Views
No, the point is that the compiler defaults to a code generation mode which should support general complex arithmetic. If you wish the compiler to switch into the mode it uses with -fast=2 when it is possible to do so without loss of generality (supposing that is your meaning), you could make your case in a post on your premier.intel.com, but it would be worth while to have a case which shows a likely performance benefit.
0 Kudos
Reply