Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Honored Contributor I
1,163 Views

NIOS2 HW & DIV Instructions + FPU

Hello i am in charge of designing a system that does a lot of complex sin/cos/* instructions and i need them to be fast. 

 

I am using NIOS2-F at 100mhz in a Cyclone IV. 

I've enabled HW and DIV by hardware at NIOS2-F and i am using 64K data and instruction cache. 

 

My software is being compiled with these flags 

CFLAGS = -Wall -DNOCRYPT -mhw-div -mhw-mul -mcustom-fpu-cfg=60-1 -mcustom-fpu-cfg=60-2 -O3  

 

I tried to use the Custom FPU instruction at QSys but my performace is slower with Custom FPU instruction attached to NIOS2-F at QSYS (why?) 

 

Is there anything else i can do? Any suggestions?
0 Kudos
28 Replies
Highlighted
Honored Contributor I
10 Views

Depending on what you are doing you might be able to reduce the amount of maths required! 

If you only need limited precision use a lookup table and linear interpolation (and maybe a single newton-raphson step). 

If you are generating values in sequence (eg tone generation) then use the equality: sin(a+b) = 2sin(a)cos(b) - sin(a-b) 

Use fixed-point maths (in integers) rather than floating point.
0 Kudos
Highlighted
Honored Contributor I
10 Views

Thanks for your help but i can't do this math optimizations :S 

Maybe if i change the GCC i can get a better result? I am using version 3.4.6
0 Kudos
Highlighted
Honored Contributor I
10 Views

Changing the version of gcc won't make much difference. 

It might be worth looking at the generated code (either pass '-S -fverbose_asm' to gcc and look at the generated .s file, or run 'objdump -d' on the object/program file). 

Also remember that (IIRC) the FP opcodes only do 'float' not 'double', and you don't want to be converting between float and double either.
0 Kudos
Highlighted
Honored Contributor I
10 Views

I will take a look into that, DSL why i get a worse performance when i add the float point unit custom instruction at NIOS2 @ QSys? 

Doesnt make sense to me, NIOS2 already have those custom instructions built in? 

 

Also what should i be looking for in the objdump? (Never used this command:eek:) 

 

This is a part of my meter_run (main function) 

 

4e0c: 39000017 ldw r4,0(r7) 4e10: 31800104 addi r6,r6,4 4e14: 10c7ff32 custom 252,r3,r2,r3 4e18: 1105ff32 custom 252,r2,r2,r4 4e1c: 40d1ff72 custom 253,r8,r8,r3 4e20: 4893ffb2 custom 254,r9,r9,r2 4e24: 39c00104 addi r7,r7,4 4e28: 29400904 addi r5,r5,36 4e2c: 32bff51e bne r6,r10,4e04 <meter_run+0xb4> 4e30: 73800044 addi r14,r14,1 4e34: 5a400015 stw r9,0(r11) 4e38: 62000015 stw r8,0(r12) 4e3c: 6b400104 addi r13,r13,4 4e40: 63000104 addi r12,r12,4 4e44: 5ac00104 addi r11,r11,4 4e48: 73ffe31e bne r14,r15,4dd8 <meter_run+0x88> 4e4c: d8c00217 ldw r3,8(sp) 4e50: d1e09117 ldw r7,-32188(gp) 4e54: d2e09017 ldw r11,-32192(gp) 4e58: d9000617 ldw r4,24(sp) 4e5c: da400117 ldw r9,4(sp) 4e60: da800517 ldw r10,20(sp) 4e64: d9400317 ldw r5,12(sp) 4e68: d2209317 ldw r8,-32180(gp) 4e6c: d3209217 ldw r12,-32184(gp) 4e70: 1ac5ff32 custom 252,r2,r3,r11 4e74: d9800717 ldw r6,28(sp) 4e78: 19c7ff32 custom 252,r3,r3,r7 4e7c: 1a47ff72 custom 253,r3,r3,r9 4e80: 390fff32 custom 252,r7,r7,r4 4e84: 1285ff72 custom 253,r2,r2,r10 4e88: 22c9ff32 custom 252,r4,r4,r11 4e8c: 1907ffb2 custom 254,r3,r3,r4 4e90: 11c5ff72 custom 253,r2,r2,r7 4e94: 2b09ff32 custom 252,r4,r5,r12 4e98: 2a0bff32 custom 252,r5,r5,r8 4e9c: 1947ff72 custom 253,r3,r3,r5 4ea0: 1105ff72 custom 253,r2,r2,r4 4ea4: 4191ff32 custom 252,r8,r8,r6 4ea8: 330dff32 custom 252,r6,r6,r12 4eac: 19a9ffb2 custom 254,r20,r3,r6 4eb0: 1227ff72 custom 253,r19,r2,r8  

 

I Guess it is using custom instructions right? (Even thought i didnt add the custom instruction guy @ QSys)
0 Kudos
Highlighted
Honored Contributor I
10 Views

The sine/cosine implementations of the IEEE variant of newlib use lookup tables to compute the result so I wouldn't expect adding the FPU to make any difference at all. If you want a faster implementation you either need to look for software optimizations or implement the sine/cosine in hardware and bolt it up to the CPU as a custom instruction. There are compiler flags you can pass in to tell the tools to use the custom instruction implementation instead of the software library. 

 

Optimizations you can look at are taylor series, cordic, etc..... Some may work well in software with the FPU and some would make more sense being offloaded into hardware. There are others you can look at if you want to trade off accuracy or have inputs that are bound to the point where you can use lookup tables efficiently. 

 

If you are seeing those 'custom' opcodes without a custom instruction then I would think you are having the wrong code linked in. It could be that code is being linked in but never executed but that would surprise me.
0 Kudos
Highlighted
Honored Contributor I
10 Views

 

--- Quote Start ---  

 

If you are seeing those 'custom' opcodes without a custom instruction then I would think you are having the wrong code linked in. It could be that code is being linked in but never executed but that would surprise me. 

--- Quote End ---  

 

Yes that IS really weird because i am not using the custom instruction IP at QSys. 

However if i dont use the compiler flags my code runs 70~80% slower.
0 Kudos
Highlighted
Honored Contributor I
10 Views

Well if those instructions are called with no implementation then it'll probably result in one cycle per instruction but the results would be incorrect. Also you are not adding those flags manually are you? In your earlier post I see two usages of "mcustom-fpu-cfg" which shouldn't be happening since you are either using 60-1 (+, -, *) or 60-2 (+, -, *, /) and I'm not sure what happens when both are specified. This flag should have been passed in automatically for you.

0 Kudos
Highlighted
Honored Contributor I
10 Views

I'm not sure which functions are in 'newlib' (libc), but for speed you'd want the sin/cos functions that act on float (not double) and have compiled newlib itself to use the FP custom instructions.

0 Kudos
Highlighted
Honored Contributor I
10 Views

 

--- Quote Start ---  

Well if those instructions are called with no implementation then it'll probably result in one cycle per instruction but the results would be incorrect. Also you are not adding those flags manually are you? In your earlier post I see two usages of "mcustom-fpu-cfg" which shouldn't be happening since you are either using 60-1 (+, -, *) or 60-2 (+, -, *, /) and I'm not sure what happens when both are specified. This flag should have been passed in automatically for you. 

--- Quote End ---  

 

 

Humm when both are specified the last one is valid i guess, and yes i am adding those flags manually, what other way can i do it? I am compiling my aplication from Linux 

 

 

--- Quote Start ---  

I'm not sure which functions are in 'newlib' (libc), but for speed you'd want the sin/cos functions that act on float (not double) and have compiled newlib itself to use the FP custom instructions. 

--- Quote End ---  

 

How can i do this?
0 Kudos
Highlighted
Honored Contributor I
10 Views

Hmmm..... I've just been looking at the newlib sources (from the gcc 3 build). 

There seem to be 2 copies of the trig functions, both seem to use the taylor series (possibly with lightly modified coeffs to reduce the error). 

 

I'm not sure how well either version will actually compile for nios2, for speed you probably want the constants loaded from the 'small data' segment - and I don't think that will happen with the given code and the altera built compiler. I'd also check that GET_FLOAT_WORD() doesn't involve a memory write-read - if so replace with an inline asm function. 

 

You might find it worth while getting those sources and compiling the functions as part of your app - that will make changing them much easier.
0 Kudos
Highlighted
Honored Contributor I
10 Views

If you are manually passing in the flags for your builds make sure you keep them in sync with the hardware implementation (i.e. don't pass in the 60-1 or 60-2 flags if you don't have the FPU added to the Nios II core). 

 

The Taylor series I remember finding was under the POSIX implementation in newlib and it wasn't long enough to give an accurate answer. I ended up hardware accelerating it and since it only used the first 3 terms of the series I would end up with things like sqrt(1) = 0.999999.
0 Kudos
Highlighted
Honored Contributor I
10 Views

I am using the 60-1 and 60-2 flags without having a FPU added to the Nios II core. 

 

I made this tests to see if float operations are working 

static float legal; static float legal2; legal = 3.4; legal2 = 2; log_msg("Multiplicando: %f -- Dividindo: %f", legal * legal2, legal/legal2); legal = 4; legal2 = 2.3; log_msg("Multiplicando: %f -- Dividindo: %f", legal * legal2, legal/legal2); legal = 4.5; legal2 = 2.3; log_msg("Multiplicando: %f -- Dividindo: %f", legal * legal2, legal/legal2); legal = 4.3148981934; legal2 = 2.313319843; log_msg("Multiplicando: %f -- Dividindo: %f", legal * legal2, legal/legal2); legal = 4.3148981934; legal2 = 2.313319843; The results were all good.. is my test right? If i remove those flags my processing suffer a lot of loss
0 Kudos
Highlighted
Honored Contributor I
10 Views

Hmmm.... Either: 

1) your nios has the fpu instructions. 

2) the compiler is optimising out the arithmetic - which it can do for the above code. 

3) there are no custom instructions at all, and the cpu takes an 'unknown instruction' trap, and some code emulates them. (I'm not sure this is possible at all) 

 

Try with 'volatile float legal;' which will force the compiler to do the memory accesses.
0 Kudos
Highlighted
Honored Contributor I
10 Views

Can NIOS2 have the FPU instruction by default? I just selected hw divide/multiplication on the NIOS2-F core. 

 

I will test with the volatile
0 Kudos
Highlighted
Honored Contributor I
10 Views

Since your code is working on constants I wouldn't expect the FPU to be used even if one was present since those results can be pre-computed. 

 

The FPU is pretty big compared to the Nios II core so the FPU is not added automatically since many would be annoyed by this (especially the folks who don't need float support or who take the time to perform fixed point math). 

 

The hardware multiple and divide options of the various Nios II cores are for integer data types, not float/double.
0 Kudos
Highlighted
Honored Contributor I
10 Views

Damn as DSL said when i changed the type of my variable to volatile all the values from multiplication and division stayed at 0

0 Kudos
Highlighted
Honored Contributor I
10 Views

Well but when i use volatile int my operations also result at 0 (even when i remove the FPU flags at Makefile)....  

weird
0 Kudos
Highlighted
Honored Contributor I
10 Views

At least this is making more sense now.... So with the volatile keyword you were preventing the compiler from performing the floating point operations at compile time so that they would have to be executed at runtime. I would expect 0 to be read back when you attempt to use a custom instruction that is not implemented. 

 

Make sure you regenerate your makefiles for the application and BSP otherwise those flags will still be present when you go to compile. It also appears you were confused by the floating point support. The multiplication and division options in the CPU parameterization are *only* for integer operations. If you want single precision floating point support you need to add the floating point custom instruction to the Nios II core.
0 Kudos
Highlighted
Honored Contributor I
10 Views

I've attached the FPU unit and now everything works (ofc this is the obvious) 

sorry for wasting ur time about that =) 

 

 

I changed the FPU to this one at alterawiki: http://www.alterawiki.com/wiki/configurable_fpu 

 

I am testing with this code 

# include <stdio.h> int main() { float a = 10.; float b = 11.; float c = a * b; printf("c = %f\n", c); return 0; }  

 

And everything works fine
0 Kudos