Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12589 Discussions

FP Custom Instructions - exp() won't get any faster

Altera_Forum
Honored Contributor II
1,470 Views

Hi everybody. 

I'm working on a design using Nios II/f that heavily relies on floating point math. 

In Qsys I addedd Floating Point Hardware and I connected it to the Custom Instruction Master of the Nios II.  

Apparently everything works. 

 

My design is a Neural Network, so basically I have a loop of multiplication\addition and a loop of exponential calculus. 

The first loop gets faster (about 10x), the second one doesen't change at all. 

 

 

I tried to make a benchmark code, something like: 

 

Performance Start float a = 10; for(i = 0; i < 1000; i++) a = exp(a); Performance End  

 

And run it with and without the following pragmas: 

# pragma no_custom_fadds# pragma no_custom_fsubs# pragma no_custom_fmuls# pragma no_custom_fdivs 

 

The performance (about 6Mcycles if I remember correctly) doesen't change. 

 

Hardware divide is enabled both in the Floating Point Hardware module and in the Nios II/f. 

 

My doubt is that since i'm using qsys and not sopc i didn't find the "custom instructions" tab in the Nios II module where to specify the usage of FP Hardware, I assumed that in Qsys when you connect the module the FP hardware is always used unless de-activating pragmas are declared.  

Maybe I was wrong here? 

 

These are my includes: 

 

# include <stdio.h># include "io.h"# include <sys/alt_alarm.h># include <altera_avalon_performance_counter.h># include "system.h"# include <math.h># include <float.h> 

 

Thanks for the help! :)
0 Kudos
12 Replies
Altera_Forum
Honored Contributor II
395 Views

I suspect that exp() is using double, not float. Try expf() instead. 

You'll need to make sure the library code is built with the correct compiler options.
0 Kudos
Altera_Forum
Honored Contributor II
395 Views

Yes I tried expf too, the global performance increase but I still don't see any difference with and without pragmas. 

 

About the compiler options... are you talking about the BSP settings? I'm using the settings from the "hello world" template. 

Any particular option I should check?
0 Kudos
Altera_Forum
Honored Contributor II
395 Views

You need to make sure the float instructions are used when the BSP is built. 

I might be worth looking at the symbol table for the final image - you probably don't want the float match functions in your image. 

(You probably don't want the double ones either, but printf() might pull those in.) 

 

You might also decide that you only need a limited accuracy exp() function - and write one that is faster but less accurate.
0 Kudos
Altera_Forum
Honored Contributor II
395 Views

In the BSP editor I've searched for some reference on the FP custom functions. 

 

Under Advanced -> Hal -> Make  

 

fpu_present 

hardware_divide_present 

hardware_fp_cust_inst_divider_present 

hardware_fp_cust_inst_no_divider_present 

hardware_multiplier_present 

 

 

I tried to turn these options On and Off but with no effect.  

I'm not quite following you on the symbol table. Where can I find it? 

 

About the exp() accuracy, since I need to compute a much more complex function (a TanSig) I'll surely build a custom instruction using a LUT.  

Still I'd like to be sure to have implemented the FP Hardware correctly ;)
0 Kudos
Altera_Forum
Honored Contributor II
395 Views

You might need to just make sure that the BSP is all recompiled. Somewhere there is a box for its compiler options (IMHO it should default to -O3, but it doesn't). I think you have to pass the custom instruction numbers for the fp instructions on the compiler command line - so they probably have to go in there. 

 

Alternatively, find the code for expf() and compile it as part of your program.
0 Kudos
Altera_Forum
Honored Contributor II
395 Views

I have a same trouble with divide in Floating Point Hardware. But i don't known why,too. :cry: 

Hardware divide is enabled both in the Floating Point Hardware module and in the Nios II/f. And i sure divide avalible in Floating Point Hardware because i compare total logic elements in 2 cases (turn on/off dividion Floating Point Hardware): 

Turn on: 9724 LEs 

Turn off: 5041 LEs 

i sure Floating Point Hardware is used because it work with multiply float. 

Who is resolved it? Can you help me?
0 Kudos
Altera_Forum
Honored Contributor II
395 Views

1) Check that you are doing 'float' division, not 'double' division (would call divdf3). 

2) Check that the relevant file(s) have been compiled to use the custom instruction, not calling divsf3.
0 Kudos
Altera_Forum
Honored Contributor II
395 Views

i didn't find divsf3 in my nios ii application and BSP. Can you show me where it is?

0 Kudos
Altera_Forum
Honored Contributor II
395 Views

Check the generated code (objdump -d) and/or the linker map file (might be generated by default). 

I might have got the function name slightly wrong!
0 Kudos
Altera_Forum
Honored Contributor II
395 Views

Hi! I resolved it. Input, output type in FP is float but in my project is integer. Thank you very much!

0 Kudos
Altera_Forum
Honored Contributor II
395 Views

I 'm starter with Custom Instruction and the document for it is only Nios II Custom Instruction User Guide. It's not enought. Have you got more document or/and tutorial... for it?

0 Kudos
Altera_Forum
Honored Contributor II
395 Views

I've not seen any 'useful' docs ... 

The only ones I found just explained how to tick the boxes to enable the FP instructions. 

I've written a couple of combinatorial custom instructions, what I realised is that the nios doesn't really have an instruction 'decoder', just a great big mux that selects the required result. The A and B register values are read for every instruction, even tightly coupled data memory is read every clock - the value is discarded unless the instruction is a memory read for the required address range. 

So a combinatorial instuction creates a result value every clock that can be based on all 32bits of the instruction word and the the 32bit values read from the register file.
0 Kudos
Reply