Re: GCC and FPU

Altera_Forum · ‎10-05-2011

Hello everyone,

I am porting some legacy code which uses double precision floating point. Changing

double to single fp or fixed point would require a major rewrite which I want to avoid.

I implemented the needed double fp ops as custom instructions and told gcc about it

with -mcustom-*. It works, but gcc generates 3 insn per op which is slow. I am thinking

of adding fp registers to my fpu - shouldn't be a problem except for the gcc part... at the

moment llvm seems much easier to 'hack' with.

Does anybody have an idea how much work it would take to extend (describe a custom

fpu) the current nios2 gcc (or llvm) port and where to start?

Thanks for you help!

Altera_Forum · ‎10-05-2011

I'd read the gcc documentation about how to describe instructions, and look at the extisting ports and find one that is easy to copy!

Building the knowledge into gcc itself (rather than trying to directly use custom instructions) is probably easier.

I'd have thought that the 'clocks per operation' would tend to dominate over the absolute number of instructions though - especially if values have to be normalised (eg for add/subtract).

Might be worth using a combinatorial instruction to read from your FP register file - you'll still need 2 instructions to get a 64bit value, but at least you won't have to worry about 'late result' delays.

It is also worth remembering that if the 'writerc' bit is 0, then the 5 bits of C can be used for any purpose - you could use 3 bits to select a register and 2 bits as an opcode extension - st, add, sub, mul ?

Similarly if 'readrb' is zero, the B field could be used to determine how to convert the 32bit Ra to FP.

(I don't know if the cpu does a decode phase stall when readrb (or readra) is zero and the selected register value isn't available.)

In any case, this will reduce the number of custom instruction slots you need.

I'd build gcc, add some FP regsiter and instruction definitions, and look at the code!

Altera_Forum · ‎10-05-2011

The three instructions generated for each double precision operator is as follows:

1) Send first operand to custom instruction (usually single cycle)

2) Send second operand to custom instruction and execute (usually many clock cycles), read back half of result

3) Read back 2nd half of the result (usually single cycle)

All three are necessary since the processor uses a 32-bit data path and double precision is 64-bit. The only way I see having a register file will reduce the amount of communication (and instructions per operator) between the CPU and FPU is if you are performing a bunch operations that use a common operand. But like DSL said the number of instructions is fairly negligible compared to the total time required for the operator to complete the calculation (#2 above). At most you will only shave off a clock cycle if any at all.

Altera_Forum · ‎10-05-2011

Thanks for the explanations and suggestions.

3 instructions per op are not a big issue with e.g. division which takes 10 cycles,

but compare can be done in one cycle. Anyhow I think the real issue is register

allocation and optimization. I have a reference hardware - powerpc with floating

point unit which has 32 internal double precision registers. With it gcc can produce

much faster/smaller code - in some of my cases more than 2-3x times faster.

Altera_Forum · ‎10-06-2011

Possibly adding a single FP accumulator would be enough - but more regs probably doesn't add much fpga real estate.

Thinking.... If you don't try to read back part of the result during the second instruction (of the current 3), then the operation can be asynchronous.

Not exactly sure how to resync, but gcc will have support for async FP ops.

Writing an FP unit is probably a project in itself :-)

Altera_Forum · ‎10-06-2011

Looking at the gcc codebase it seems far from trivial to start fiddling with it.

I wouldn't dare (for now) to think about async ops. Llvm looks much better

in this regard. There is a unfinished nios2 port at ellcc.org, did anyone check

it out?

Following is a sample code generated by gcc for nios2 and powerpc:

double x3(double x)

{

return x*x*x;

}

-- nios2-elf-gcc -O2 (v3.4.6)

mov r6,r4

mov r7,r5

custom 0,zero,r6,r7 (fwrx)

custom 15,r5,r6,r7 (fmuld)

custom 4,r4,zero,zero (frdy)

custom 0,zero,r4,r5 (fwrx)

custom 15,r5,r6,r7 (fmuld)

custom 4,r4,zero,zero (frdy)

mov r2,r4

mov r3,r5

ret

-- powerpc-eabi-gcc -O2 (v3.3.6)

fmul f0,f1,f1

fmul f0,f0,f1

fmr f1,f0

blr

Altera_Forum · ‎10-06-2011

The most obvious optimisation is for the FP result be made available as the source for the next operation - ie add a single FP accumulator.

That would remove the first 'frdy' and the second 'fwrx'.

This may even be true of the current fpga - but gcc hasn't been told about it.

It might be that the ability to obtain half the result on the operation opcode makes it difficult to describe.

Not sure why gcc's register tracking generated the 4 'mov' instructions either!

To get anything like the ppc code, you'd also have to change the way FP arguments are passed - so they can be passed in FP registers.