How I can speedup of execution my program?

Altera_Forum · ‎02-11-2005

Hello!

How I can speedup of execution my program on NIOS 2?

I'm trying to make calculation of video data before output it to

monitor, but cpu makes it very slowly. I want to calculate video sprites and

alpha canal.

System clock and SDRAM clock are 100 MHz,

Nios CPU is f-version with icash=2K and dcash=1K,

I'm using EP1C6Q240C6.

Program memory and video memory are working on the different SDRAM controllers

(2 video chips and 1 program chip). Data to video controller are transmitting by DMA.

Maybe someone had same problem before or has any idea?

Thanks.

Altera_Forum · ‎02-12-2005

For examples this operation execute very slowly:

unsigned char *mem = ... program memory //(16 bits data bus)

unsignde char *vmem = ... video memory //(8 bits data bus)

for (int i=600; --i; )

for (int j=800; --j; )

if (*mem != 0xF7) {

*--vmem = *mem;

mem--;

}

I need to do this cycle about for 25-30 times per second, but really it works only for 5-10 times per second.

How I can speedup this block?

Altera_Forum · ‎02-13-2005

Hi Camelot,

You may not know that your sdram reads are probably taking 11 or more clock *per read*. Even from onchip sram this is no better than 5 clocks.

Writes to ram/sdram are fast so do whatever you can to avoid reads but don't worry about writes.

Our solution was to create a Custom Instruction that bypasses the Avalon bus and all of its registered states. So now we read image data coming in from an external fifo in 2 clocks per word, do our operations on it and then store the results in sdram buffers. These buffers are later dma'd out of the usb port. (dma is always fast, but not too usefull if you need to operate on the data)

The other killer for us was a heavy reliance on bit shifting registers to manipulate image data. It turns out that on the Cyclone these shifts take *one clock per bit*! Well to solve that we upgraded to the StratixI. (the lowest end StratixI barely squeezed into our BOM budget)

I think there are other issues in play as well, but our image processing code runs much much better on the Stratix NiosII than on the Cyclone NiosII.

Another thing is by using the /S instead of the /F Core you avoid one clock checking the data cache on reads.

In your example, if you really only need to check for xf7 at say the end of a row you could dma each row-1 and then do a normal read for the tag.

Hope this helps.

Ken

Altera_Forum · ‎02-13-2005

Many thanks.

Altera_Forum · ‎02-14-2005

Is that chunk of code modular (as in can it be moved to hardware?)

If that's the case you could make a custom peripheral that you send values to and have it cycle through memory doing the compares. I see this being 4 counters, a comparator, and some sync. logic. After you find that you have to modify memory <div class='quotetop'>QUOTE </div>

--- Quote Start ---

*--vmem = *mem;

mem--;[/b]

--- Quote End ---

then you can either fire those pointers back to the Nios and let it do the transfer, or have your hardware do it for you.

If that sounds like overkill you could try speeding up that bottleneck doing optimization (your image size of 800x600 doesn't get used by the for structure so you can make that a looping structure of up to 600 times 800 iterations (i.e. one loop)). However to hit your goal of 25/30 fps I don't think any optimization will help you that much (assembly might but you would need to look at the objdump file to make that decision).

Altera_Forum · ‎02-14-2005

I have to agree with BadOmen on this one. The nature of the inner loop makes this achievable in a page or two of Verilog code, at most. If you wanted more control it could be parameterized quite easily with control registers that are driven by Nios.

The peripheral would look like this:

- An Avalon slave port, with register set that the CPU can access. The first loop would require a bit of setup on the part of the CPU: loading in the source & destination buffer base addresses, number of times through the loop, what to do the logic operation against, and kicking off the transfer with a control register. Subsequent transfers (assuming that none of these startup parameters changed) would be done by again telling the control register to start -- very low overhead on the CPU's part. This leaves it free to run other threads to sustain the system.

- An Avalon master port, read only, that reads the source buffer (address & read outputs, readdata input)

- An Avalon master port, write only, that writes to the destination buffer (address, write, and writedata outputs)

- Control registers as described above in the Avalon slave, selected with read/write & address

- Counters to increment the pointers

- A comparator to look at the readdata, and it with your coefficient, etc.

I think that this is something we (Altera) have to do a better job of hammering on: we're in an fpga! Our largest value proposition is that we aren't limited to an instruction set & processor architecture to solve a problem... if we take the processor-centric view of this we are left with increasing clock speed, memory cost, etc. to increase the bandwidth of the transfer (just like you'd do in a non-FPGA-based processor system).

However, by spending a small number of LEs (my guess for a peripheral like this, including Avalon logic that would be generated with it: about 150 LEs) this inner loop would fly along at one transfer per clock, including the comparison to see whether the transfer could occur (assuming that your program & video memories can sustain that throughput).

Here are some resources that may be of interest if you think this is worth studying more. The first two are excellent articles written by a colleague of mine that demonstrate thing such as CRC calculation in C code and converting that to hardware. The last I wrote, is kind of dated now (from the early Nios I days) but still covers the fundamentals:

http://www.embedded.com/showarticle.jhtml?...17500157&pgno=1 (http://www.embedded.com/showarticle.jhtml?articleid=17500157&pgno=1)

http://www.embedded.com/showarticle.jhtml?...icleid=12800116 (http://www.embedded.com/showarticle.jhtml?articleid=12800116)

http://www.altera.com/education/events/nor...002/pldf097.pdf (http://www.altera.com/education/events/northamerica/sdr_forum_2002/pldf097.pdf) (look for the checksum-portion of the article)

In this last one, look for Nios vs. custom peripheral comparison at the end. Actually Nios II turned out to be much faster at this algorithm than Nios I (which the article is based on) as it was so math-intensive and the 32-bit instruction set helps with that.. but the custom hardware -- again the Verilog equivalent of the C code -- still performs at 10x the Nios II speed at the same clock speed:

http://www.altera.com/literature/wp/wp_qrd.pdf (http://www.altera.com/literature/wp/wp_qrd.pdf)

Altera_Forum · ‎02-15-2005

We solve that problem much more easy http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif )

We just mask DATA in video SDRAM's buy DQM

When DATA_OUT in SDRAM meet some value - DQM mask's that packet of data http://forum.niosforum.com/work2/style_emoticons/<#EMO_DIR#>/smile.gif )

And we get hardware sprite with very lillte hardware addition

Altera_Forum · ‎02-15-2005

Very elegant solution Alex!!!!