Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

/Qprec-div- not working?

piet_de_weer
Beginner
495 Views

Hi,

I'm trying to optimize some C++ code that contains a numer of divisions. So I want the compiler to use the RCPSS instruction instead of the much slower DIVSS.

I'm compiling for a Pentium 4 (/QxaW /QxW), with the option "/Qprec-div-" to generate RCPSS instructions.

For a small piece of code, this is the assembly that's generated:

[plain]$B8$13:                         ; Preds $B8$12 $B8$54
cvtsi2ss xmm4, edx ;67.42
movss xmm1, DWORD PTR _2il0floatpacket$4 ;67.42
movaps xmm5, xmm1 ;67.42
divss xmm5, xmm4 ;67.42
mulss xmm5, DWORD PTR _2il0floatpacket$3 ;67.42
mulss xmm0, xmm5 ;67.13
mulss xmm3, xmm5 ;68.13
movss DWORD PTR [esp+20], xmm0 ;67.13
movss DWORD PTR [edi], xmm3 ;68.13
; LOE eax ecx xmm1 xmm2
$B8$14: ; Preds $B8$8 $B8$48 $B8$13
[/plain]

As you can see, there is a DIVSS instruction here, and no RCPSS.

So, I tried what would happen if I use /Qprec-div instead. And the code that is generated is different:

[bash]$B8$13:                         ; Preds $B8$12 $B8$53
cvtsi2ss xmm3, edx ;67.42
movss xmm4, DWORD PTR _2il0floatpacket$3 ;67.25
divss xmm4, xmm3 ;67.42
mulss xmm1, xmm4 ;67.13
movss DWORD PTR [esp+20], xmm1 ;67.13
mulss xmm2, xmm4 ;68.13
movss DWORD PTR [edi], xmm2 ;68.13
; LOE eax ecx xmm0
$B8$14: ; Preds $B8$8 $B8$5 $B8$13
[/bash]

This code is smaller (one multiply and one mov is gone!). Which would make sense if the code above contained an RCPSS (because then an extra multiply would be needed).

So it seems that the code above WAS rewritten as if RCPSS was used, but still, DIVSS is used. Similar things happen to all the (douzens of) divisions I can find in the assembly code; RCPSS is not used anywhere, not even when I calculate 1.0f / . The total assembly file is about 2% bigger with the /Qprev-div- option.

I'm using normal C++ code (no intrinsics, at least not in this piece of code).

Compiler is "C++ 10.0.013 [IA-32]".

Compiler flags: /GL /c /Ox /Og /Ob1 /Oi /Ot /Oy /GT /D "WIN32" /D "_WINDOWS" /D "NDEBUG" /D "_MBCS" /GF /FD /EHsc /MT /Zc:wchar_t /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd /Ow /Qansi-alias /Qno-alias-args /DSSE2 /fpfast=2 /Qfp-speculationfast /Qprec-div- /Qprec-sqrt- /Qftz /QaxW /QxW

(But i've also tried it by manually calling icl with simplified options: icl Overshoot.cpp /c /Fa".\\Release/" /Fo".\\Release/" /Qprec-div- /QxP, with exactly the same results).

Am I missing something?

If this is a compiler bug, would it be possible to write some different code that I can use for this? I've tried things like the following:

[cpp]__forceinline float OneDiv(float f)
{
    //return 1.0f / f;
    return _mm_cvtss_f32(_mm_rcp_ps(_mm_set_ss(f)));
}
[/cpp]
The code that comes out is close to what I want - but some "dumb" things are happening (some parts are not optimized out):
[plain]        movss     xmm1, xmm0        ;253.34 - this line should not be there
        rcpps     xmm0, xmm1        ;253.34
[/plain]
The rest of the assembly code changes quite a lot, so it's a bit difficult to compare.

0 Kudos
7 Replies
Om_S_Intel
Employee
495 Views
It would help if you can provide testcase that we can compiler and review the generated assembly code.
0 Kudos
TimP
Honored Contributor III
495 Views

I have never seen the compiler generate rcpss instruction. If it did so, it would follow with a Newton iteration step to improve the result to near IEEE divide precision. The purpose of doing so would be to improve throughput for the case where the FPU may be made available for independent operations. It would certainly not work to reduce generated code size.

The compiler does use rcpps by default in vectorized code. As you say, this was done largely on account of the weak division performance of the original P4 of 9 years ago. As it has been several years since CPUs have been produced with that characteristic, there isn't much incentive for current compilers to optimize for it.

You may be confusing the issue with your forest of sometimes conflicting options, particularly if you don't always spell them the same. It will probably be difficult to reproduce your issue if you don't give exact compiler versions, source code, and options, and preferably, a clearly statement of your goal. The current 32-bit compilers do set options to optimize for the old P4 as the default (/arch:SSE2).

0 Kudos
piet_de_weer
Beginner
495 Views

Here's the smallest piece of code where I can at least show that no RCPSS is generated - I don't see an extra multiplication here.

[bash]// CompilerTest.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

int _tmain(int argc, _TCHAR* argv[])
{
float b;
for (int a=0; a<=argc; a++)
{
b += (float)a;
}
float c = 2.0f / b;
printf("%fn", c);

return 0;
}
[/bash]

Compiling with /Qprec-div- gives exactly the same output here as with /Qprec-div, both with /QaxW /QxW for Pentium4/SSE2 support.

Output in both cases is (divss on line 110):

[bash];;; {

$LN1:
push ebp
mov ebp, esp
and esp, -64
push edi
sub esp, 60
$LN3:
mov edi, DWORD PTR [ebp+8]
$LN5:
push 3
call ___intel_new_proc_init

$B1$16:
pop ecx
stmxcsr DWORD PTR [esp+16]
or DWORD PTR [esp+16], 32768
ldmxcsr DWORD PTR [esp+16]
$LN7:

;;; float b;
;;; for (int a=0; a<=argc; a++)

test edi, edi
jl $B1$10

$B1$2:
$LN9:
lea edx, DWORD PTR [edi+1]
cmp edx, 4
jl $B1$13

$B1$3:
movdqa xmm2, XMMWORD PTR _2il0floatpacket$1
movdqa xmm1, XMMWORD PTR _2il0floatpacket$2
mov eax, edx
and eax, 3
neg eax
add eax, edx
xor ecx, ecx
pxor xmm0, xmm0

$B1$4:
$LN11:

;;; {
;;; b += (float)a;

cvtdq2ps xmm3, xmm1
$LN13:
addps xmm0, xmm3
paddd xmm1, xmm2
$LN15:
add ecx, 4
cmp ecx, eax
jb $B1$4

$B1$5:
$LN17:

;;; }
;;; float c = 2.0f / b;
;;; printf("%fn", c);

movaps xmm1, xmm0
movhlps xmm1, xmm0
addps xmm0, xmm1
movaps xmm2, xmm0
shufps xmm2, xmm0, 245
addss xmm0, xmm2

$B1$6:
$LN19:
cmp eax, edx
jae $B1$11

$B1$8:
$LN21:
cvtsi2ss xmm1, eax
$LN23:
add eax, 1
cmp eax, edx
$LN25:
addss xmm0, xmm1
$LN27:
jb $B1$8
jmp $B1$11

$B1$10:
pxor xmm0, xmm0

$B1$11:
$LN29:
movss xmm1, DWORD PTR _2il0floatpacket$3
$LN31:
divss xmm1, xmm0
$LN33:
mov DWORD PTR [esp], OFFSET FLAT: ??_C@_03A@?$CFf?6?$AA@
$LN35:
cvtps2pd xmm0, xmm1
movsd QWORD PTR [esp+4], xmm0
$LN37:
call _printf

$B1$12:
$LN39:

;;;
;;; return 0;

xor eax, eax
add esp, 60
pop edi
mov esp, ebp
pop ebp
ret

$B1$13:
xor eax, eax
pxor xmm0, xmm0
jmp $B1$6
ALIGN 2

[/bash]
Compiler options: /c /O3 /Ot /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /Yu"StdAfx.h" /Fp"Release/CompilerTest.pch" /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd /Qprec-div- /QaxW /QxW
0 Kudos
piet_de_weer
Beginner
495 Views

If I try to write my own intrinsics to be merged with the code that's generated by the compiler, I'm also getting strange and unexpected results:

[bash]int _tmain(int argc, _TCHAR* argv[])
{
float b;
for (int a=0; a<=argc; a++)
{
b += (float)a;
}
float c = OneDiv(b);
printf("%fn", c);

return 0;
}
[/bash]
For OneDiv:
[bash]__forceinline float OneDiv(float f)
{
return 1.0f / f;
}
[/bash]

I get (only the part of assembly that's different):

[plain]movss     xmm1, DWORD PTR _2il0floatpacket$3
divss xmm1, xmm0
mov DWORD PTR [esp], OFFSET FLAT: ??_C@_03A@?$CFf?6?$AA@
cvtps2pd xmm0, xmm1[/plain]
If I replace OneDiv by:
[cpp]__forceinline float OneDiv(float f)
{
return _mm_cvtss_f32(_mm_rcp_ss(_mm_set_ss(f)));
}
[/cpp]
I get:
[bash]mov DWORD PTR [esp], OFFSET FLAT: ??_C@_03A@?$CFf?6?$AA@
rcpss xmm0, xmm0
pxor xmm1, xmm1
movss xmm1, xmm0
cvtps2pd xmm2, xmm1
[/bash]
I don't understand why XMM1 is being XORred to 0 if it's overwritten in the next line anyway, and why XMM0(ss) is copied to XMM1. Is this caused by the _mm_cvtss_f32 - and if so, is there something else that I can use so this gets optimized out? (I know I'm doing something really weird here by providing one SSE instruction and just hoping that the compiler automatically blends this in with the SSE that it's generating itself out of non-SSE code).

0 Kudos
piet_de_weer
Beginner
495 Views

tim18: Somehow I missed your reply before making my other 2 posts.

I've done a few measurements, and - based on the __forceinline function that I used - even on my Q9450 (quad core) I'm getting clearly higher speeds when I use RCP instead of DIV. And that is with the 2 extra (unnecessary) instructions.

For my purpose, RCP (without any follow-ups) is precise enough (I'm working on audio processing, and based on a test with my __forceinline function the difference with DIV in the end result is below -90 dB).

Anyway, it's very useful to know that the compiler doesn't use RCP, I'll have to do it myself then. Any ideas on how to rewrite my __forceinline function to NOT generate the extra XOR and MOVSS instructions?

Compiler version: Compiling with Intel C++ 10.1.013 [IA-32]

Options (minimized version): /c /O3 /Ot /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /Yu"StdAfx.h" /Fp"Release/CompilerTest.pch" /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd /Qprec-div- /QaxW /QxW

0 Kudos
TimP
Honored Contributor III
495 Views

The pxor instruction actually is intended to enhance performance, by breaking a hardware dependency where high order slots of the parallel register are preserved, instead allowing for renaming usage of a shadow register. The parallel instruction cvtps2pd is used for the same reason, even though one of the 2 doubles it makes is to be discarded. On Intel CPUs, these choices cannot reduce performance, as the pxor can execute in parallel with or prior to rcpss, and they might improve performance significantly, if the registers were in use just prior to this code.

Of course, here, it seems the step of using xmm1 could have been omitted, with the rcpps result feeding directly to cvtps2pd. There's a chance that a pre-processor macro would produce cleaner code than the __forceinline.

The renaming issue for simple register moves was recognized when optimizing code for Athlon-32 with non-Intel compilers to out-perform P-III with Intel compilers. The one about cvtps2pd vs. cvtss2sd was recognized more recently; while it could be fixed in hardware (by not preserving the high order part of the register), the compiler fix could be introduced much quicker without waiting for new CPU steppings.

0 Kudos
piet_de_weer
Beginner
495 Views

Ok, after doing some tests it seems that using RCP has a hugh effect on some small sample code with a number of DIV's in a row (a short loop followed by 2 divs is more than 50% faster when RCP is used instead of DIV - with only a single DIV the difference is less than 10%). On the code I want to use it for it has almost no effect.

Since it does cause some rounding errors and does not give any benefits, I'm going back to using DIV.

Note that it should still be slightly faster when calculating 1.0f / value, because in that case the RCP version of the code also doesn't need to access memory to read the value 1.0f - which also saves a register.

In case anyone wants to try it on their own code, tim18's macro trick worked. The resulting code is:

[bash]#define FastDiv(f, g) (f * _mm_cvtss_f32(_mm_rcp_ss(_mm_set_ss(g))))[/bash]
Call with FastDiv(1.0f, g) (mind the f! Otherwise you're multiplying with a double, which causes a lot of conversions!) to get just a single RCPSS instruction, or any other value than 1.0f for 2 instructions (RCPSS and MULSS).

0 Kudos
Reply