Re: NIOS2 HW & DIV Instructions + FPU

Altera_Forum · ‎08-12-2011

Hello i am in charge of designing a system that does a lot of complex sin/cos/* instructions and i need them to be fast.

I am using NIOS2-F at 100mhz in a Cyclone IV.

I've enabled HW and DIV by hardware at NIOS2-F and i am using 64K data and instruction cache.

My software is being compiled with these flags

CFLAGS = -Wall  -DNOCRYPT -mhw-div -mhw-mul -mcustom-fpu-cfg=60-1 -mcustom-fpu-cfg=60-2 -O3

I tried to use the Custom FPU instruction at QSys but my performace is slower with Custom FPU instruction attached to NIOS2-F at QSYS (why?)

Is there anything else i can do? Any suggestions?

Altera_Forum · ‎08-15-2011

Depending on what you are doing you might be able to reduce the amount of maths required!

If you only need limited precision use a lookup table and linear interpolation (and maybe a single newton-raphson step).

If you are generating values in sequence (eg tone generation) then use the equality: sin(a+b) = 2sin(a)cos(b) - sin(a-b)

Use fixed-point maths (in integers) rather than floating point.

Altera_Forum · ‎08-15-2011

Thanks for your help but i can't do this math optimizations :S

Maybe if i change the GCC i can get a better result? I am using version 3.4.6

Altera_Forum · ‎08-15-2011

Changing the version of gcc won't make much difference.

It might be worth looking at the generated code (either pass '-S -fverbose_asm' to gcc and look at the generated .s file, or run 'objdump -d' on the object/program file).

Also remember that (IIRC) the FP opcodes only do 'float' not 'double', and you don't want to be converting between float and double either.

Altera_Forum · ‎08-15-2011

I will take a look into that, DSL why i get a worse performance when i add the float point unit custom instruction at NIOS2 @ QSys?

Doesnt make sense to me, NIOS2 already have those custom instructions built in?

Also what should i be looking for in the objdump? (Never used this command:eek:)

This is a part of my meter_run (main function)


   4e0c:       39000017        ldw     r4,0(r7)
    4e10:       31800104        addi    r6,r6,4
    4e14:       10c7ff32        custom  252,r3,r2,r3
    4e18:       1105ff32        custom  252,r2,r2,r4
    4e1c:       40d1ff72        custom  253,r8,r8,r3
    4e20:       4893ffb2        custom  254,r9,r9,r2
    4e24:       39c00104        addi    r7,r7,4
    4e28:       29400904        addi    r5,r5,36
    4e2c:       32bff51e        bne     r6,r10,4e04 <meter_run+0xb4>
    4e30:       73800044        addi    r14,r14,1
    4e34:       5a400015        stw     r9,0(r11)
    4e38:       62000015        stw     r8,0(r12)
    4e3c:       6b400104        addi    r13,r13,4
    4e40:       63000104        addi    r12,r12,4
    4e44:       5ac00104        addi    r11,r11,4
    4e48:       73ffe31e        bne     r14,r15,4dd8 <meter_run+0x88>
    4e4c:       d8c00217        ldw     r3,8(sp)
    4e50:       d1e09117        ldw     r7,-32188(gp)
    4e54:       d2e09017        ldw     r11,-32192(gp)
    4e58:       d9000617        ldw     r4,24(sp)
    4e5c:       da400117        ldw     r9,4(sp)
    4e60:       da800517        ldw     r10,20(sp)
    4e64:       d9400317        ldw     r5,12(sp)
    4e68:       d2209317        ldw     r8,-32180(gp)
    4e6c:       d3209217        ldw     r12,-32184(gp)
    4e70:       1ac5ff32        custom  252,r2,r3,r11
    4e74:       d9800717        ldw     r6,28(sp)
    4e78:       19c7ff32        custom  252,r3,r3,r7
    4e7c:       1a47ff72        custom  253,r3,r3,r9
    4e80:       390fff32        custom  252,r7,r7,r4
    4e84:       1285ff72        custom  253,r2,r2,r10
    4e88:       22c9ff32        custom  252,r4,r4,r11
    4e8c:       1907ffb2        custom  254,r3,r3,r4
    4e90:       11c5ff72        custom  253,r2,r2,r7
    4e94:       2b09ff32        custom  252,r4,r5,r12
    4e98:       2a0bff32        custom  252,r5,r5,r8
    4e9c:       1947ff72        custom  253,r3,r3,r5
    4ea0:       1105ff72        custom  253,r2,r2,r4
    4ea4:       4191ff32        custom  252,r8,r8,r6
    4ea8:       330dff32        custom  252,r6,r6,r12
    4eac:       19a9ffb2        custom  254,r20,r3,r6
    4eb0:       1227ff72        custom  253,r19,r2,r8

I Guess it is using custom instructions right? (Even thought i didnt add the custom instruction guy @ QSys)

Altera_Forum · ‎08-15-2011

The sine/cosine implementations of the IEEE variant of newlib use lookup tables to compute the result so I wouldn't expect adding the FPU to make any difference at all. If you want a faster implementation you either need to look for software optimizations or implement the sine/cosine in hardware and bolt it up to the CPU as a custom instruction. There are compiler flags you can pass in to tell the tools to use the custom instruction implementation instead of the software library.

Optimizations you can look at are taylor series, cordic, etc..... Some may work well in software with the FPU and some would make more sense being offloaded into hardware. There are others you can look at if you want to trade off accuracy or have inputs that are bound to the point where you can use lookup tables efficiently.

If you are seeing those 'custom' opcodes without a custom instruction then I would think you are having the wrong code linked in. It could be that code is being linked in but never executed but that would surprise me.

Altera_Forum · ‎08-15-2011

--- Quote Start ---

If you are seeing those 'custom' opcodes without a custom instruction then I would think you are having the wrong code linked in. It could be that code is being linked in but never executed but that would surprise me.

--- Quote End ---

Yes that IS really weird because i am not using the custom instruction IP at QSys.

However if i dont use the compiler flags my code runs 70~80% slower.

Altera_Forum · ‎08-15-2011

Well if those instructions are called with no implementation then it'll probably result in one cycle per instruction but the results would be incorrect. Also you are not adding those flags manually are you? In your earlier post I see two usages of "mcustom-fpu-cfg" which shouldn't be happening since you are either using 60-1 (+, -, *) or 60-2 (+, -, *, /) and I'm not sure what happens when both are specified. This flag should have been passed in automatically for you.

Altera_Forum · ‎08-16-2011

I'm not sure which functions are in 'newlib' (libc), but for speed you'd want the sin/cos functions that act on float (not double) and have compiled newlib itself to use the FP custom instructions.

Altera_Forum · ‎08-16-2011

--- Quote Start ---

Well if those instructions are called with no implementation then it'll probably result in one cycle per instruction but the results would be incorrect. Also you are not adding those flags manually are you? In your earlier post I see two usages of "mcustom-fpu-cfg" which shouldn't be happening since you are either using 60-1 (+, -, *) or 60-2 (+, -, *, /) and I'm not sure what happens when both are specified. This flag should have been passed in automatically for you.

--- Quote End ---

Humm when both are specified the last one is valid i guess, and yes i am adding those flags manually, what other way can i do it? I am compiling my aplication from Linux

--- Quote Start ---

I'm not sure which functions are in 'newlib' (libc), but for speed you'd want the sin/cos functions that act on float (not double) and have compiled newlib itself to use the FP custom instructions.

--- Quote End ---

How can i do this?

Altera_Forum · ‎08-16-2011

Hmmm..... I've just been looking at the newlib sources (from the gcc 3 build).

There seem to be 2 copies of the trig functions, both seem to use the taylor series (possibly with lightly modified coeffs to reduce the error).

I'm not sure how well either version will actually compile for nios2, for speed you probably want the constants loaded from the 'small data' segment - and I don't think that will happen with the given code and the altera built compiler. I'd also check that GET_FLOAT_WORD() doesn't involve a memory write-read - if so replace with an inline asm function.

You might find it worth while getting those sources and compiling the functions as part of your app - that will make changing them much easier.

Altera_Forum · ‎08-16-2011

If you are manually passing in the flags for your builds make sure you keep them in sync with the hardware implementation (i.e. don't pass in the 60-1 or 60-2 flags if you don't have the FPU added to the Nios II core).

The Taylor series I remember finding was under the POSIX implementation in newlib and it wasn't long enough to give an accurate answer. I ended up hardware accelerating it and since it only used the first 3 terms of the series I would end up with things like sqrt(1) = 0.999999.

Altera_Forum · ‎08-17-2011

I am using the 60-1 and 60-2 flags without having a FPU added to the Nios II core.

I made this tests to see if float operations are working


static float legal; 
static float legal2; 
   legal = 3.4; 
   legal2 = 2; 
   log_msg("Multiplicando: %f  -- Dividindo: %f", legal * legal2,  legal/legal2); 
   legal = 4; 
   legal2 = 2.3; 
   log_msg("Multiplicando: %f  -- Dividindo: %f", legal * legal2,  legal/legal2); 
   legal = 4.5; 
   legal2 = 2.3; 
   log_msg("Multiplicando: %f  -- Dividindo: %f", legal * legal2,  legal/legal2); 
   legal = 4.3148981934; 
   legal2 = 2.313319843; 
   log_msg("Multiplicando: %f  -- Dividindo: %f", legal * legal2,  legal/legal2); 
   legal = 4.3148981934; 
   legal2 = 2.313319843;

The results were all good.. is my test right? If i remove those flags my processing suffer a lot of loss

Altera_Forum · ‎08-17-2011

Hmmm.... Either:

1) your nios has the fpu instructions.

2) the compiler is optimising out the arithmetic - which it can do for the above code.

3) there are no custom instructions at all, and the cpu takes an 'unknown instruction' trap, and some code emulates them. (I'm not sure this is possible at all)

Try with 'volatile float legal;' which will force the compiler to do the memory accesses.

Altera_Forum · ‎08-17-2011

Can NIOS2 have the FPU instruction by default? I just selected hw divide/multiplication on the NIOS2-F core.

I will test with the volatile

Altera_Forum · ‎08-17-2011

Since your code is working on constants I wouldn't expect the FPU to be used even if one was present since those results can be pre-computed.

The FPU is pretty big compared to the Nios II core so the FPU is not added automatically since many would be annoyed by this (especially the folks who don't need float support or who take the time to perform fixed point math).

The hardware multiple and divide options of the various Nios II cores are for integer data types, not float/double.

Altera_Forum · ‎08-17-2011

Damn as DSL said when i changed the type of my variable to volatile all the values from multiplication and division stayed at 0

Altera_Forum · ‎08-17-2011

Well but when i use volatile int my operations also result at 0 (even when i remove the FPU flags at Makefile)....

weird

Altera_Forum · ‎08-17-2011

At least this is making more sense now.... So with the volatile keyword you were preventing the compiler from performing the floating point operations at compile time so that they would have to be executed at runtime. I would expect 0 to be read back when you attempt to use a custom instruction that is not implemented.

Make sure you regenerate your makefiles for the application and BSP otherwise those flags will still be present when you go to compile. It also appears you were confused by the floating point support. The multiplication and division options in the CPU parameterization are *only* for integer operations. If you want single precision floating point support you need to add the floating point custom instruction to the Nios II core.

Altera_Forum · ‎08-17-2011

I've attached the FPU unit and now everything works (ofc this is the obvious)

sorry for wasting ur time about that =)

I changed the FPU to this one at alterawiki: http://www.alterawiki.com/wiki/configurable_fpu

I am testing with this code


# include <stdio.h>
int main()
{
  float a = 10.;
  float b = 11.;
  float c = a * b;
  printf("c = %f\n", c);
  return 0;
}

And everything works fine

Altera_Forum · ‎08-17-2011

One other heads up, when you add the FPU floating point constants are treated as single precision. The 'normal' behavior is to have floating point constants treated as type double. So if you have constants in your code and you want to keep them represented as constants use the 'l' suffix which will ensure the constant is treated as a long double (Nios II doesn't support long doubles so this will become a double).

So instead of this:

double a = 2.3; // with the FPU enabled you'll end up with a single precision 2.3 type cast over to a double precision 2.3 (truncation could happen if the constant was very large)

Do this:

double a = 2.3l; // no type casting will occur