How much overhead using NIOS?

Altera_Forum · ‎05-30-2011

Just starting out in the FPGA programming process. I’ve tested the BeMicro FPGA MCU Evaluation Board with Nios II and have run some of the samples. Now that I have an idea of how it all works together it has led me to some questions for the project I’m working on. My ultimate solution will be to run a small mathematical function as quickly as possible and check the output against some parameters. No LED’s, no push buttons or user interaction. Just going for raw speed. I think a close example would be computing a SHA512/MD5/etc hash for a given input.

I’ve managed to create a prototype using some of the sample projects and adding my C code into the build and running it under NIOS II.

What I’m wondering is how much overhead is there using NIOS II to run C versus directly programming the FPGA?

Can I build the math routine into the FPGA and call it from NIOS without adding much overhead in the underlying math? I understand NIOS gives me a lot of simplicity but at what expense in speed.

I think the larger application can live on a PC and call into the board for the computations. Given that the functions will be relatively simple, is there a recommend way to interface a computer with FPGAs to get the maximum speed for a specific computation? As I get more comfortable with the project I would anticipate the PC based program controlling multiple FPGAs, maybe multiple FPGAs per board and multiple boards, and distributing work and collecting results. Is a NIOS based approach for this a good idea or should be taking a completely different route for what is effectively offloading a single math function for max performance.

Thanks.

Altera_Forum · ‎05-30-2011

--- Quote Start ---

I think a close example would be computing a SHA512/MD5/etc hash for a given input.

--- Quote End ---

For an application where there is a large volume of data, having a NIOS processor perform calculations on the data would result in pretty poor performance. What you want is a streaming data processing system, where the NIOS processor is used for control.

There is an example somewhere on the Altera web site showing a checksum accelerator. This looks like it might be it:

http://www.altera.com/support/examples/nios2/exm-checksum-acc.html

You will want something similar to this for the processing part of your design. The next thing you need to analyze is; what is the I/O bandwidth you require to get data on and off the FPGA? I suspect a PCIe based board might be your best bet. You can determine the maximum bandwidth to the PCIe board, and then determine whether a Cyclone device or a Stratix device is needed for processing that bandwidth. If that bandwidth is below what you were hoping for, then that tells you how many boards you will need, or whether you need to consider a higher bandwidth interface (eg., more PCIe lanes per PCIe board).

Cheers,

Dave

Altera_Forum · ‎05-30-2011

NIOS isn't a particularly fast processor in comparison to, say, an ARM CORTEX M4 @ 150 MHz sorts of speeds or something like a CORTEX A8/A9, most modern DSP chips, or pretty much any X86 CPU. It is convenient for SOC designs that need some programmability, and which can live with 20-100 MIPS level performance levels give or take some. The way that a NIOS based CPU can really shine in terms of performance, however, is the ability to create custom FPGA logic based hardware / algorithms and interface them tightly to the NIOS. There are two ways to do that -- the first is that you can create a soft FPGA 'peripheral' which will be memory mapped to the NIOS address space, ane you may use it as any other "off CPU" peripheral via memory mapped I/O. The second way is to use the custom instruction mechanism offered by NIOS to more tightly couple a hardware calculation engine into the NIOS execution by having the black box act as the instruction execution engine for such a special user defined instruction. There are recent threads you should look at concerning the benefits and trade-offs concerning the use of custom instructions versus generic peripherals that are memory mapped.

Some algorithms are so structured that they don't parallelize well either in software or in hardware, so there will be severe serial execution based performance limits to executing them either in software via a sequence of assembly instructions, or in hardware via some specially crafted state machine or synchronous logic + look up table based implementation. Some algorithms simply need so many registers and memory blocks and so on that they aren't practical to implement "in hardware" other than as via some kind of sequenced state machine similar to a CPU which achieves efficiency by using RAM/FLASH for much of the program and data storage when the resources for registers and ALU resources are exhausted.

If you have an algorithm that is limited in performance by serial paths that cannot be parallelized, but you

need to independently calculate that algorithm independently on many distinct inputs, you may be able to

calculate in parallel y1 = f(x1) ; y2 = f(x2) ; y3 = f(x3) .... yn = f(xn) and then achieve a speed up of N

parallel executions even for a serial algorithm provided that you have the parallel FPGA memory/register/ALU resources available to do that.

If you can efficiently parallelize the algorithm within a single invocation y = f(x) then there is a possible high performance scenario in which can calculate the result very quickly, depending only on having enough FPGA resources operating at high enough speeds to accomplish the parallel calculation until you're serial execution limited or resource / timing limited.

There is a maximum rate at which NIOS can execute a given instruction such as a simple y = f(x) and load a new X and store a new Y result for each iteration. If you can't usefully get your algorithm to execute faster than that limiting rate even with a fully hardware implemented solution, maybe using NIOS as an engine to feed data into your algorithm and store the results and possibly help in calculating the non performance critical aspects will be beneficial. If on the other hand you really would be held back in performance by the NIOS and it is not performing essential functions for you, maybe it is best to just use a custom hardware / state machine implementation.

You should understand how to efficiently partition your algorithm into hardware logic, state machine, and memory / register resources in order to make an informed decision of how best to accelerate its implementation. You can't neglect the fetching of input data, and the storage of output data, since the block RAM or DRAM which you might use will also be a throughput limiting factor in high performance algorithms. First you probably ought to implement the algorithm in C on X86/SSE, possibly also MATLAB/SCILAB, learn about its algorithmic complexity / bottlenecks via analysis, then construct a verilog implementation of it such that it uses the resources available within your target FPGA by compromising resource vs speed efficiency intelligently depending on the performance criticality of various pieces of the algorithm. Then when that works you should easily be able to tie the "black box" into either a custom instruction/peripheral or a fully indpendent hardware engine not using NIOS.

If performance is ALL you care about, and the FPGA is just a means to that end, you might be disappointed with the performance vs cost of FPGA solutions versus what you can garner from using X86 or GPGPU implementations. For many classes of problems the dedicated architecture but software programmable silicon of the latter will exceed the performance of FPGA solutions espeically given factors like RAM bus bandwidth / speed, quantity of memory available, gigahertz clock speeds -> MIPS for algorithms that can run core loops in L1/L2 cache, et. al. FPGAs will generally win for cases where there are no efficient mappings of a given core ALU operation to CPU/GPU instructions or fast cache/register based look up tables. FPGAs are also good at narrow width data variables like operating on 1-bit, 2-bit, 4-bit data, et. al. For 32-bit, 64-bit, floating point, et. al. data types, the mainstream CPUs tend to be pretty highly optimized versus what you can synthesize on a medium sized FPGA.

Altera_Forum · ‎05-31-2011

Great responses. Thanks. Reading them carefully and will probably be back with a few more clarifying questions.

Altera_Forum · ‎05-31-2011

Following up...

@dwh, that sample is a great starting point for the next step in my learning. Thanks. The recommended platform is the NIOS II Embedded Evaluation kit. I don’t mind spending the money on the kit but it looks like it has several features (LCD for example) that I really don’t need right now. Is there a less feature rich/more focused product I could use to implement this example?

@af1010, is the ‘custom instruction mechanism’ the C-2-Hardware code optimization that Altera produces? For the implementation testing, I’ve already got the routine working in multiple high level languages (C, java, c#) just to test the performance of various systems. It looks like Matlab and then Verilog are the next places to go. There should not be any floating point calculations and there should be low amounts of data transfer relative to the algorithmic work applied to the data, so I am really hoping that the FPGA will be the platform to target to achieve good speed.

Altera_Forum · ‎05-31-2011

--- Quote Start ---

@dwh, that sample is a great starting point for the next step in my learning. Thanks. The recommended platform is the NIOS II Embedded Evaluation kit. I don’t mind spending the money on the kit but it looks like it has several features (LCD for example) that I really don’t need right now. Is there a less feature rich/more focused product I could use to implement this example?

--- Quote End ---

Any kit with a Cyclone/Arria/Stratix device will do; these all support NIOS processors. A MAX II kit will not, as those devices have no on-chip RAM.

Most of the basic NIOS examples do not need to use pins on the FPGA, other than JTAG, so porting to any kit is simple, eg. a design with NIOS II, JTAG-UART, JTAG-to-Avalon Master (for system console), on-chip RAM, and an accelerator, does not require any FPGA pins at all. If your board has LEDs, then you can add a PIO and toggle them; first using system console via JTAG (from your development PC), and then using the NIOS II processor.

If you have trouble, just ask.

Cheers,

Dave

Altera_Forum · ‎05-31-2011

C2H is something different, it is basically facilitating the use of C code as a HDL to ease porting existing C algorithms to FPGA resident implementations. For an algorithm that is well understood, and a FPGA architecture that is well understood, and for a developer that knows how to work efficiently in VHDL or Verilog, there may be no great need to use C2H since it is just trying to accomplish the mapping of algorithm (as defined by existing code) -> HDL/FPGA implementation, just as you'd do manually when making a HDL port of the algorithm. You may benefit from C2H if you have a very optimized C implementation which is difficult to port to HDL, but it isn't the custom instruction mechanism for NIOS that I was referring to.

NIOS custom instructions are just mechanisms to take a certain set of reserved NIOS opcodes and link their execution to a user defined 'execution unit' so that when the opcode is encountered in the code, the custom logic is triggered and it may read input data from the NIOS and produce output data for the NIOS as a result of its execution, just as if you had to implement an instruction like ADD Src1, Src2, Dest to perform an addition of two source operands and store the result in a destination location, except instead of 'ADD' it would be whatever function you want to implement.

Since you say your algorithm is math dense relative to memory load/store data transfer operations, you possibly may to be able to get it working pretty quickly using X86 C/SSE-assembler and register/L1 cache resources, or maybe using a GPU and OpenCL / CUDA with heavy use of cache / local / global memory as compared to general GDDR RAM / host RAM. If that is the case and you can achieve giga-ops/second execution speed via CPU or GPU with an efficient implementation of the algorithm, that will raise a pretty significant bar for the performance a FPGA would have to exceed for it to process faster than a $400 PC perhaps with a $200 GPU added, considering that most mid-range FPGAs cost similar money for far less GIPS/s compute capacity. There are certainly many things a good FPGA could do faster than a mid-range CPU/GPU, particularly if you need a hardware I/O interface to a data acquisition system as part of the system, but for a purely computational problem, I'd look more at GPGPU / X86 SSE / ASM before going FPGA.

MATLAB/SCILAB is handy for looking at mathematical algorithm implementations (e.g. linear algebra / matrix math / FFT etc.) that might be harder to prototype using C or assembly before you switch to C/ASM after having decided how to efficiently implement your algorithm using lower level but faster languages. If your C is already efficient relative to the achievable bound given the algorithmic complexity, you may not need to consider MATLAB/SCILAB (though sometimes they can be fast in their own right if they can use GPGPU or LAPACK or similar efficient execution engines for most of the core calculations).

This is some of the NIOS stuff that I'd suggest looking at, though first I'd look at the FPGAs in general,

their clock rates, memory / register density, et. al. and figure out by napkin / ballpark calculations if in the best case it is even possible for the hardware to run your algorithm sufficiently fast compared to other options to make sense to implement in Cyclone / Arria / Stratix vs. X86 / GPU / whatever.

ftp://ftp.altera.com/outgoing/download/support/ip/processors/nios2/niosii_docs_11_0.zip

http://www.altera.com/literature/ds/ds_nios2_perf.pdf

http://www.altera.com/devices/processor/nios2/benefits/performance/ni2-high-performance.html

http://www.altera.com/devices/processor/nios2/cores/fast/ni2-fast-core.html

If so, going to synthesizable verilog might be a good step, and there are good simulators for that for the PC before you need to try to execute on a FPGA.

--- Quote Start ---

Following up...

@dwh, that sample is a great starting point for the next step in my learning. Thanks. The recommended platform is the NIOS II Embedded Evaluation kit. I don’t mind spending the money on the kit but it looks like it has several features (LCD for example) that I really don’t need right now. Is there a less feature rich/more focused product I could use to implement this example?

@af1010, is the ‘custom instruction mechanism’ the C-2-Hardware code optimization that Altera produces? For the implementation testing, I’ve already got the routine working in multiple high level languages (C, java, c#) just to test the performance of various systems. It looks like Matlab and then Verilog are the next places to go. There should not be any floating point calculations and there should be low amounts of data transfer relative to the algorithmic work applied to the data, so I am really hoping that the FPGA will be the platform to target to achieve good speed.

--- Quote End ---

Altera_Forum · ‎05-31-2011

If your processing time is long (more than 10 instructions) then I would build a standard Avalon MM slave rather than using the custom instruction interface. The latter is hard to do right, and if you don't get it right then you will pull the fmax of your processor down.

If you put your accelerator in its own clock domain then you can run it at higher frequencies for better throughput.

If you're using matlab then have you considered using DSP builder advanced block set to turn your design into hardware?

Altera_Forum · ‎06-01-2011

Again, thanks for all of the advice. Taking it in stride and doing some of the example projects to get a better understanding. I'm sure there will be more questions soon.

Altera_Forum · ‎06-01-2011

No matter what hardware based approach there will always be some overhead but with a bit of planning and an algorithm that pipelines well you can hide this overhead. If you have an algorithm with few/no data dependencies you can typically pipeline them which will give you a potential high throughput at the expensive of computation latency. To overcome the latency you just send data at the data engine as fast as possible, if your input data can be placed in a memory buffer then sending the data using a DMA engine (or just building the master directly into your data engine) would be the most efficient means to do this.

Some of what I just mentioned is covered in this document: http://www.altera.com/literature/hb/nios2/edh_ed5v1_03.pdf

Altera_Forum · ‎06-02-2011

@BadOmen, thanks. That's a great document with a lot of good information in it. I'm working my way through the "FPGA Designer Curriculum". I'm sure that will answer many questions.. and cause many more.