/Qprec-div- not working?

piet_de_weer · ‎02-08-2010

Hi,

I'm trying to optimize some C++ code that contains a numer of divisions. So I want the compiler to use the RCPSS instruction instead of the much slower DIVSS.

I'm compiling for a Pentium 4 (/QxaW /QxW), with the option "/Qprec-div-" to generate RCPSS instructions.

For a small piece of code, this is the assembly that's generated:

[plain]$B8$13:                         ; Preds $B8$12 $B8$54
        cvtsi2ss  xmm4, edx                                     ;67.42
        movss     xmm1, DWORD PTR _2il0floatpacket$4            ;67.42
        movaps    xmm5, xmm1                                    ;67.42
        divss     xmm5, xmm4                                    ;67.42
        mulss     xmm5, DWORD PTR _2il0floatpacket$3            ;67.42
        mulss     xmm0, xmm5                                    ;67.13
        mulss     xmm3, xmm5                                    ;68.13
        movss     DWORD PTR [esp+20], xmm0                      ;67.13
        movss     DWORD PTR [edi], xmm3                         ;68.13
                                ; LOE eax ecx xmm1 xmm2
$B8$14:                         ; Preds $B8$8 $B8$48 $B8$13
[/plain]

As you can see, there is a DIVSS instruction here, and no RCPSS.

So, I tried what would happen if I use /Qprec-div instead. And the code that is generated is different:

[bash]$B8$13:                         ; Preds $B8$12 $B8$53
        cvtsi2ss  xmm3, edx                                     ;67.42
        movss     xmm4, DWORD PTR _2il0floatpacket$3            ;67.25
        divss     xmm4, xmm3                                    ;67.42
        mulss     xmm1, xmm4                                    ;67.13
        movss     DWORD PTR [esp+20], xmm1                      ;67.13
        mulss     xmm2, xmm4                                    ;68.13
        movss     DWORD PTR [edi], xmm2                         ;68.13
                                ; LOE eax ecx xmm0
$B8$14:                         ; Preds $B8$8 $B8$5 $B8$13
[/bash]

This code is smaller (one multiply and one mov is gone!). Which would make sense if the code above contained an RCPSS (because then an extra multiply would be needed).

So it seems that the code above WAS rewritten as if RCPSS was used, but still, DIVSS is used. Similar things happen to all the (douzens of) divisions I can find in the assembly code; RCPSS is not used anywhere, not even when I calculate 1.0f / . The total assembly file is about 2% bigger with the /Qprev-div- option.

I'm using normal C++ code (no intrinsics, at least not in this piece of code).

Compiler is "C++ 10.0.013 [IA-32]".

Compiler flags: /GL /c /Ox /Og /Ob1 /Oi /Ot /Oy /GT /D "WIN32" /D "_WINDOWS" /D "NDEBUG" /D "_MBCS" /GF /FD /EHsc /MT /Zc:wchar_t /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd /Ow /Qansi-alias /Qno-alias-args /DSSE2 /fpfast=2 /Qfp-speculationfast /Qprec-div- /Qprec-sqrt- /Qftz /QaxW /QxW

(But i've also tried it by manually calling icl with simplified options: icl Overshoot.cpp /c /Fa".\\Release/" /Fo".\\Release/" /Qprec-div- /QxP, with exactly the same results).

Am I missing something?

If this is a compiler bug, would it be possible to write some different code that I can use for this? I've tried things like the following:

[cpp]__forceinline float OneDiv(float f)
{
    //return 1.0f / f;
    return _mm_cvtss_f32(_mm_rcp_ps(_mm_set_ss(f)));
}
[/cpp]

The code that comes out is close to what I want - but some "dumb" things are happening (some parts are not optimized out):

[plain]        movss     xmm1, xmm0        ;253.34 - this line should not be there
        rcpps     xmm0, xmm1        ;253.34
[/plain]

The rest of the assembly code changes quite a lot, so it's a bit difficult to compare.

Om_S_Intel · ‎02-09-2010

It would help if you can provide testcase that we can compiler and review the generated assembly code.

TimP · ‎02-09-2010

I have never seen the compiler generate rcpss instruction. If it did so, it would follow with a Newton iteration step to improve the result to near IEEE divide precision. The purpose of doing so would be to improve throughput for the case where the FPU may be made available for independent operations. It would certainly not work to reduce generated code size.

The compiler does use rcpps by default in vectorized code. As you say, this was done largely on account of the weak division performance of the original P4 of 9 years ago. As it has been several years since CPUs have been produced with that characteristic, there isn't much incentive for current compilers to optimize for it.

You may be confusing the issue with your forest of sometimes conflicting options, particularly if you don't always spell them the same. It will probably be difficult to reproduce your issue if you don't give exact compiler versions, source code, and options, and preferably, a clearly statement of your goal. The current 32-bit compilers do set options to optimize for the old P4 as the default (/arch:SSE2).

piet_de_weer · ‎02-09-2010

Here's the smallest piece of code where I can at least show that no RCPSS is generated - I don't see an extra multiplication here.

[bash]// CompilerTest.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

int _tmain(int argc, _TCHAR* argv[])
{
    float b;
    for (int a=0; a<=argc; a++)
    {
        b += (float)a;
    }
    float c = 2.0f / b;
    printf("%fn", c);

	return 0;
}
[/bash]

Compiling with /Qprec-div- gives exactly the same output here as with /Qprec-div, both with /QaxW /QxW for Pentium4/SSE2 support.

Output in both cases is (divss on line 110):

[bash];;; {

$LN1:
        push      ebp
        mov       ebp, esp
        and       esp, -64
        push      edi
        sub       esp, 60
$LN3:
        mov       edi, DWORD PTR [ebp+8]
$LN5:
        push      3
        call      ___intel_new_proc_init

$B1$16:
        pop       ecx
        stmxcsr   DWORD PTR [esp+16]
        or        DWORD PTR [esp+16], 32768
        ldmxcsr   DWORD PTR [esp+16]
$LN7:

;;;     float b;
;;;     for (int a=0; a<=argc; a++)

        test      edi, edi
        jl        $B1$10

$B1$2:
$LN9:
        lea       edx, DWORD PTR [edi+1]
        cmp       edx, 4
        jl        $B1$13

$B1$3:
        movdqa    xmm2, XMMWORD PTR _2il0floatpacket$1
        movdqa    xmm1, XMMWORD PTR _2il0floatpacket$2
        mov       eax, edx
        and       eax, 3
        neg       eax
        add       eax, edx
        xor       ecx, ecx
        pxor      xmm0, xmm0

$B1$4:
$LN11:

;;;     {
;;;         b += (float)a;

        cvtdq2ps  xmm3, xmm1
$LN13:
        addps     xmm0, xmm3
        paddd     xmm1, xmm2
$LN15:
        add       ecx, 4
        cmp       ecx, eax
        jb        $B1$4 

$B1$5:
$LN17:

;;;     }
;;;     float c = 2.0f / b;
;;;     printf("%fn", c);

        movaps    xmm1, xmm0
        movhlps   xmm1, xmm0
        addps     xmm0, xmm1
        movaps    xmm2, xmm0
        shufps    xmm2, xmm0, 245
        addss     xmm0, xmm2

$B1$6:
$LN19:
        cmp       eax, edx
        jae       $B1$11

$B1$8:
$LN21:
        cvtsi2ss  xmm1, eax
$LN23:
        add       eax, 1
        cmp       eax, edx
$LN25:
        addss     xmm0, xmm1
$LN27:
        jb        $B1$8
        jmp       $B1$11

$B1$10:
        pxor      xmm0, xmm0

$B1$11:
$LN29:
        movss     xmm1, DWORD PTR _2il0floatpacket$3
$LN31:
        divss     xmm1, xmm0
$LN33:
        mov       DWORD PTR [esp], OFFSET FLAT: ??_C@_03A@?$CFf?6?$AA@
$LN35:
        cvtps2pd  xmm0, xmm1
        movsd     QWORD PTR [esp+4], xmm0
$LN37:
        call      _printf

$B1$12:
$LN39:

;;; 
;;; 	return 0;

        xor       eax, eax
        add       esp, 60
        pop       edi
        mov       esp, ebp
        pop       ebp
        ret

$B1$13:
        xor       eax, eax
        pxor      xmm0, xmm0
        jmp       $B1$6
        ALIGN     2

[/bash]

Compiler options: /c /O3 /Ot /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /Yu"StdAfx.h" /Fp"Release/CompilerTest.pch" /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd /Qprec-div- /QaxW /QxW

piet_de_weer · ‎02-09-2010

If I try to write my own intrinsics to be merged with the code that's generated by the compiler, I'm also getting strange and unexpected results:

[bash]int _tmain(int argc, _TCHAR* argv[])
{
    float b;
    for (int a=0; a<=argc; a++)
    {
        b += (float)a;
    }
    float c = OneDiv(b);
    printf("%fn", c);

	return 0;
}
[/bash]

For OneDiv:

[bash]__forceinline float OneDiv(float f)
{
    return 1.0f / f;
}
[/bash]

I get (only the part of assembly that's different):

[plain]movss     xmm1, DWORD PTR _2il0floatpacket$3
divss     xmm1, xmm0
mov       DWORD PTR [esp], OFFSET FLAT: ??_C@_03A@?$CFf?6?$AA@
cvtps2pd  xmm0, xmm1[/plain]

If I replace OneDiv by:

[cpp]__forceinline float OneDiv(float f)
{
    return _mm_cvtss_f32(_mm_rcp_ss(_mm_set_ss(f)));
}
[/cpp]

I get:

[bash]mov DWORD PTR [esp], OFFSET FLAT: ??_C@_03A@?$CFf?6?$AA@
rcpss     xmm0, xmm0
pxor      xmm1, xmm1
movss     xmm1, xmm0
cvtps2pd  xmm2, xmm1
[/bash]

I don't understand why XMM1 is being XORred to 0 if it's overwritten in the next line anyway, and why XMM0(ss) is copied to XMM1. Is this caused by the _mm_cvtss_f32 - and if so, is there something else that I can use so this gets optimized out? (I know I'm doing something really weird here by providing one SSE instruction and just hoping that the compiler automatically blends this in with the SSE that it's generating itself out of non-SSE code).

piet_de_weer · ‎02-09-2010

tim18: Somehow I missed your reply before making my other 2 posts.

I've done a few measurements, and - based on the __forceinline function that I used - even on my Q9450 (quad core) I'm getting clearly higher speeds when I use RCP instead of DIV. And that is with the 2 extra (unnecessary) instructions.

For my purpose, RCP (without any follow-ups) is precise enough (I'm working on audio processing, and based on a test with my __forceinline function the difference with DIV in the end result is below -90 dB).

Anyway, it's very useful to know that the compiler doesn't use RCP, I'll have to do it myself then. Any ideas on how to rewrite my __forceinline function to NOT generate the extra XOR and MOVSS instructions?

Compiler version: Compiling with Intel C++ 10.1.013 [IA-32]

Options (minimized version): /c /O3 /Ot /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /Yu"StdAfx.h" /Fp"Release/CompilerTest.pch" /FAs /Fa"Release/" /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd /Qprec-div- /QaxW /QxW

TimP · ‎02-09-2010

The pxor instruction actually is intended to enhance performance, by breaking a hardware dependency where high order slots of the parallel register are preserved, instead allowing for renaming usage of a shadow register. The parallel instruction cvtps2pd is used for the same reason, even though one of the 2 doubles it makes is to be discarded. On Intel CPUs, these choices cannot reduce performance, as the pxor can execute in parallel with or prior to rcpss, and they might improve performance significantly, if the registers were in use just prior to this code.

Of course, here, it seems the step of using xmm1 could have been omitted, with the rcpps result feeding directly to cvtps2pd. There's a chance that a pre-processor macro would produce cleaner code than the __forceinline.

The renaming issue for simple register moves was recognized when optimizing code for Athlon-32 with non-Intel compilers to out-perform P-III with Intel compilers. The one about cvtps2pd vs. cvtss2sd was recognized more recently; while it could be fixed in hardware (by not preserving the high order part of the register), the compiler fix could be introduced much quicker without waiting for new CPU steppings.

piet_de_weer · ‎02-10-2010

Ok, after doing some tests it seems that using RCP has a hugh effect on some small sample code with a number of DIV's in a row (a short loop followed by 2 divs is more than 50% faster when RCP is used instead of DIV - with only a single DIV the difference is less than 10%). On the code I want to use it for it has almost no effect.

Since it does cause some rounding errors and does not give any benefits, I'm going back to using DIV.

Note that it should still be slightly faster when calculating 1.0f / value, because in that case the RCP version of the code also doesn't need to access memory to read the value 1.0f - which also saves a register.

In case anyone wants to try it on their own code, tim18's macro trick worked. The resulting code is:

[bash]#define FastDiv(f, g) (f * _mm_cvtss_f32(_mm_rcp_ss(_mm_set_ss(g))))[/bash]

Call with FastDiv(1.0f, g) (mind the f! Otherwise you're multiplying with a double, which causes a lot of conversions!) to get just a single RCPSS instruction, or any other value than 1.0f for 2 instructions (RCPSS and MULSS).