Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12589 Discussions

Accelerating a C Function with User Logic

Altera_Forum
Honored Contributor II
1,246 Views

Hey. I've implemented an algorithm in NIOS II which is in C. The thing is that it's running too slow, as expected, because I have a function which is called many times. I want to accelerate that function but can't decide which route to take. The function has an array, a structure and some floating point variables as input, and returns a floating point. The problem is that another user function is called inside that function. In order to switch to user logic would I have to convert that function to HDL as well? Also I definitely can't use a simple custom instruction so would I have to use a custom component or is there another method which would work better in this situation? I'm using Quartus 13.0 along with the NIOS EDS. My target board is a DE1 for now. 

 

Ammar
0 Kudos
10 Replies
Altera_Forum
Honored Contributor II
496 Views

Maybe you can break the function up into several pieces - so that the nios cpu can execute the second 'user function' between the blocks of VHDL. 

If the FP operations dominate, trying to do them from VHDL may just eat up fpga real estate - without really making things significantly faster. 

 

The other question is 'how much too slow', you may be able to gain enough by ensuring there are no unnecessary memory accesses, cycle stalls (eg late result after memory read or custom instructions), cache misses (use tightly coupled memory), or incorrectly predicted branches (disable dynamic branch prediction and ensure the static prediction is correct).
0 Kudos
Altera_Forum
Honored Contributor II
496 Views

Thanks for the reply. I was able to decrease the time taken considerably by altering the algorithm to use integer arithmetic instead of FP operations where possible. I've given a performance counter result for the whole algorithm before and after the change. By 'too slow' I meant that I'm trying to reduce the img_detection time as much as possible. I have already met my requirements so to speak but I would like reduce the time taken by the algorithm as much as possible so I can try to use it for a real time application.  

 

I'm learning how NIOS works by using the online training that Altera offers but my main confusion is on how to use user logic to speed up an application. The bottleneck is the tree_detection function because of the number of times it's called but I can't decide whether I should learn how to implement a custom instruction or a custom component. How does one decide which to employ and is it more beneficial to convert the whole function to a custom component or to convert parts of that function into custom instructions? I apologize if these are stupid questions and thank you for your suggestions.  

 

Just for reference, the img_detection calls window_detection multiple times, which in turn calls tree_detection. 

 

| Section | Time (sec)| Time (clocks)|Occurrences| 

+---------------+-----+-----------+---------------+-----------+ 

|tree_detection| 5.53454| 276727241| 363046| 

+---------------+-----+-----------+---------------+-----------+ 

|img_detection | 6.84809| 342404652| 1| 

+---------------+-----+-----------+---------------+-----------+ 

 

before: 

|img_detection | 17.43497| 871748352| 1|
0 Kudos
Altera_Forum
Honored Contributor II
496 Views

A custom instruction only has two 32bit inputs (+ the opcode word) and can generate one 32bit result (although it can access other fpga resources), typically they will be synchronous - but you could arrange to do things asynchronously. 

A custom component would have to be accessed via the avalon bus, so getting data to/from nios registers is slower (the nios cpu always stalls for the duration of an Avalon cycle). It is probably more appropriate if you need to access other fgpa resources - especially large memory blocks. 

I've only used combinatorial custom instructions - mainly to speed up CRC16 and byteswapping.
0 Kudos
Altera_Forum
Honored Contributor II
496 Views

The rule of thumb I have used is to use custom instructions if the operands and results are going to be in NIOS registers anyway (because other software is going to create or reference them), and to use Avalon-MM or Avalon-ST based components if the operands and results are going to/from memory or other components with no other software processing. 

 

You can also use custom instructions as a prototyping tool since it's a lot easier to use the Eclipse debugger and trap inputs/outputs from your buggy logic than it is to do the same with SignalTap. The issue with staying with custom instructions long term is that although your core logic may complete execution in 1 clock cycle, the NIOS itself may take many clock cycles to issue the instruction (especially if there are any load/store operations surrounding it). 

 

For an image processing algorithm, you can develop your kernel with custom instructions and then later migrate to a non-software based component; probably aiming for structure similar to the components in the Altera Video and Image Processing Suite (Avalon-ST for image data, Avalon-MM for control registers).
0 Kudos
Altera_Forum
Honored Contributor II
496 Views

After reading the information given by both of you, I also read over the custom instruction manual and then implemented a simple custom instruction as a test. It's only a simple combinatorial instruction that handles addition and bit shifting but took almost half the clock cycles that my original implementation did. I was surprised that it was so straightforward! I'm going to try to see how much of a performance increase I can achieve with custom instructions for now and will then look into custom components since I think the criteria given by both of you may be applicable.  

 

Ted, I would ideally like to do as you suggested since a non-software based component would be superior to the implementation I have now. However I'm not sure I could take on such a task with my current skill set. It's definitely something I will try in the future though. 

 

Thank you for all the advice and putting up with my questions. This part of the forum is definitely more active than the UP part.
0 Kudos
Altera_Forum
Honored Contributor II
496 Views

Custom instructions are probably a good way to teach software engineers a bit of VHDL. 

You also get some insight into the way the nios works by finding your custom instruction logic in the RTL viewer. 

I have a string suspicion (not verified) that the 'readra' and 'readrb' bits are ignored. 

All other instructions (except call/jmp) will stall if the A register has been written, and stall on the B register if the bottom two bits of the opcode differ (or are the same - forgotten which). In many cases the A/B fields are coded as zero in order to aviod the stall. 

In any case, if readrb is zero the B field can be used for any purpose - I use it to select between 16bit and 32bit byteswap.
0 Kudos
Altera_Forum
Honored Contributor II
496 Views

Well as of yet I have only used combinatorial instructions and a multicycle instruction for a megacore function so I haven't had to deal with the read bits. I was wondering though, why are only 2 inputs available (excluding external connections)? I know that there is also the option for an 8 bit input n as well but don't understand the reason behind limiting the number of input values.

0 Kudos
Altera_Forum
Honored Contributor II
496 Views

Just think about how the cpu pipeline works. 

AFAICT the clock sequence is basically: 

1) Read instruction word. 

2) Use A and B fields to read two register values from dual ported M9K. 

3) Combinatorial ALU results feed in a big mux, result selected by the opcode. 

4) Writeback any result to specified register. 

(there is probably an additional clock in there somewhere - mainly for avalon accesses) 

 

This means that an instruction only has the instruction word and the values of the A and B registers available, and can write to a single register. 

 

For combinatorial ALU operations there is a store-to-load forwarder so that the result from one instruction is available to the next without going through the register file. 

For other instructions (eg memory read) the value has to go through the register file, so the pipeline stalls (re-executes clock 2) until the needed data will be valid. 

For custom instructions this stall is documented as being controlled by the readra/b bits.
0 Kudos
Altera_Forum
Honored Contributor II
496 Views

I see. Thanks once again for all your help.

0 Kudos
Altera_Forum
Honored Contributor II
496 Views

I should have mentioned that a lot of instructions have the A and B fields encoded as zero - this ensures they don't stall of writes to unwanted registers.

0 Kudos
Reply