ICC generates inefficient code for GNU-style inline assembly

hydroxyprolin · ‎10-26-2009

Compiler: Intel icc v11.1.056 (flags -xO -O2 -fasm-blocks)
Platform: Linux Intel64 (Debian 5.0, gcc 4.3.2)

[cpp]// Example #1 (very simple function in pure C):
double test_pure_c (double u)
{
	register double r;

	r = u * u;
	r += u;
	r *= r;
	
	return r;
}
[/cpp]

As expected, the icc compiler generates the optimal code:

[plain]        movaps    %xmm0, %xmm1
        mulsd     %xmm0, %xmm1
        addsd     %xmm1, %xmm0
        mulsd     %xmm0, %xmm0
        ret       
[/plain]

But now we replace the first and last calculation with GNU-style inline assembly:

[cpp]// Example #2:
double  test_asm_sse2 (double u)
{
	register double r;

	// r = u * u;
	__asm__ (
		"	movaps %1, %0	n"
		"	mulsd %0, %0	"
		: "=&x" : "x" (u)
	);
	
	r += u;

	// r *= r;
	__asm__ (
		"	mulsd %0, %0	"
		: "+x"
	);
	
	return r;
}
[/cpp]

The icc compiler generates this code:

[plain]        movsd     %xmm0, -40(%rsp)
        fldl      -40(%rsp)
        fstl      -24(%rsp)
        movsd     -24(%rsp), %xmm1
        movaps    %xmm1, %xmm2
        mulsd     %xmm2, %xmm2
        movsd     %xmm2, -32(%rsp)
        fldl      -32(%rsp)
        faddp     %st, %st(1)
        fstpl     -8(%rsp)
        movsd     -8(%rsp), %xmm3
        mulsd     %xmm3, %xmm3
        movsd     %xmm3, -16(%rsp)
        movsd     -16(%rsp), %xmm0
        ret       
[/plain]

This is really stupid mixture of SIMD & x87 floating-point and unnecessary storing of intermediates on the stack. GNU gcc has no problems 'fitting' the inline assembly with its own code and generates the optimal code for examples #1 and #2.

Defining the variable r as 'register double r asm ("xmm1");' results in pure SIMD code, but with some unnecessary 'register shuffling':

[plain]        movsd     %xmm0, %xmm1
        mulsd     %xmm1, %xmm1
        # why the register shuffling below ?
        movaps    %xmm1, %xmm2
        addsd     %xmm0, %xmm2
        movaps    %xmm2, %xmm1
        # the three instructions above can be replaced by
        # a single 'addsd %xmm0, %xmm1'
        mulsd     %xmm1, %xmm1
        movaps    %xmm1, %xmm0
        ret  
[/plain]

The compiler has a similar problem for variables declared with asm register constraints:

[cpp]// Example #3:
long test_asm_reg_constraint (long c)
{
	register long r asm ("rax");

	r = c * c;
	r += 2;
	r *= r;
	r += c;

	return r;
}
[/cpp]

The icc compiler generates this silly code:

[plain]        movq      %rdi, %rdx
        imulq     %rdi, %rdx
        movq      %rdx, %rax
        lea       2(%rax), %rcx
        movq      %rcx, %rax
        movq      %rax, %rsi
        imulq     %rax, %rsi
        movq      %rsi, %rax
        lea       (%rax,%rdi), %r8
        movq      %r8, %rax
        ret
[/plain]

For the example #3 GNU gcc again generates the optimal code with or without register constraint (icc only without).

Conclusion:

The Intel C++ compiler cannot handle the register output from GNU-style inline assembly in an efficient way. It always uses an intermediate storage (stack etc.) to save and restore such output and does not pass the registers to other code blocks in a smart way.

GNU gcc can handle inline assembly with register output much better, resulting in faster and smaller code.

Mark_S_Intel1 · ‎10-26-2009

Thanks for the problem report and the examples. We will look into this issue and give you an update.

Regards,
--mark

Mark_S_Intel1 · ‎11-09-2011

The issue in example #2 has been resolved in the Intel C++ Composer XE Update 7 Build11 Oct 2011. For example #3, could you please explainthe reason you are using the register variable feature and its importance to you?

Thanks,
--mark