Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Honored Contributor I
757 Views

Achieving higher performance?

Hello i want to make my design as fast as possible to run uCLinux, my software is running at 40ms and should run at 8ms. 

 

I am using a linux with MMU, 50mhz clock, 32k Data and instruction Cache, My TCMemory has 1KB. 

In the make menuconfig i've selected ENABLE MUL INSTRUCTION (is mulx better?) 

 

What can i do to get a better performace? Any tips? 

 

Thanks
0 Kudos
21 Replies
Highlighted
Honored Contributor I
10 Views

You can run the NIOS much faster that 50 MHz, We are running in a Cyclone 3 at 96 MHz with no problems and have had it running as high as 168 MHz. 

 

That would be the first change. 

 

The other is figure out what is the longest computational path, and if you can, add a hardware engine that does that specific function.  

 

IE if you are spending most of your time computing an FFT, add a Hardware FFT and just pass it the data and look at the results. 

 

This will require a "Software/Hardware" interface, and won't usually be just plug and play, so you'll have to know how to work in Verilog or VHDL to make it work, but it can give you significant improvements. 

 

Pete
0 Kudos
Highlighted
Honored Contributor I
10 Views

Make sure you have compiled everything with -O2 or -O3. 

Look at the generated code for your critical loops and check the compiler has made a reasonable job of it - and isn't spilling locals to the stack all the time, if it is change the C to give it a better chance.
0 Kudos
Highlighted
Honored Contributor I
10 Views

How can i compile with -O2 or -O3 using MMU Linux DSL? 

 

And i have a sdram on my design (DE2-70 board) and it's running at 100mhz while my NIOS is at 50mhz. 

If i try to raise my NIOS clock i get either failed to verify, 00ps kernel not syncing or when i type nios-2terminal nothing shows error. 

 

I also tried to add a float point unit at Qsys and enable hardware divide but it didnt made a big difference. 

 

My code do mainly float point operations and some i/o
0 Kudos
Highlighted
Honored Contributor I
10 Views

I've tried -O3 and -O2 flags and the result isn't really good also (these flags are only at my application Makefile) 

 

Well to be honest the only thing that did make a difference was changing the data/instruction cache from 32 to 64k.. 

 

I really think i am not doing something correctly when compiling my linux image
0 Kudos
Highlighted
Honored Contributor I
10 Views

I'd have thought you'd be able to run at 100MHz, maybe something is wrong with the clock sources when you are trying to do that. 

Getting the clocking right is tricky (I'm no expert), ISTR some magic -3.3ns value and also using the memory clock as the master clock... 

If increasing the cache sizes makes a significant difference, then you must be thrashing the caches - probably worth determining whether it is the data or instruction cache. 

Floating point will be slow. If you've got the fpga real-estate the fp custom instructions will help float (but not double) operations. 

One option is to convert your floating point to fixed point - then use integer operations. For that to work well you'll really want the mulx instructions (which seem to be only available with DSP multipliers), and maybe a custom instruction to extract the required 32bits from the 64bit product. 

Thinks - would the 32x32 full adder array execute in a single clock to perform a multiply (throwing gates at it!).
0 Kudos
Highlighted
Honored Contributor I
10 Views

My clock configuration at the QSys is correct. The SDRAM clock is -67 degrees from my sys_clock and it is running at a supported speed (took a look at the datasheet) 

 

I made this SDC file: 

create_clock -period 20.000 -name clkin_50 derive_pll_clocks derive_clock_uncertainty
0 Kudos
Highlighted
Honored Contributor I
10 Views

I took a look and my fmax is only 76mhz using a CIV (de2-115) 

this is bad.. maybe is because of my design size?  

1,549,000 memory bits / 20kLE? 

 

I have two processors, a phy, a lot of onchip mem..
0 Kudos
Highlighted
Honored Contributor I
10 Views

I removed everything related to the second processor and my FMAX went from 80 to 115. 

 

I used the flags of custom instruction @alterawiki and my time went from 26ms to 7ms. Really awesome.
0 Kudos
Highlighted
Honored Contributor I
10 Views

I'm surprised the 2nd processor made that much difference to fmax. Maybe your device was getting full so some long tracks were being used.

0 Kudos
Highlighted
Honored Contributor I
10 Views

Yeah i am surprised also.. 

How can i do some pipeline bridge to improve my fmax even more?
0 Kudos
Highlighted
Honored Contributor I
10 Views

Possibly you had problems with the internal memory regions, using a few less M9K blocks might help. 

 

I think a pipeline bridge (or even a non-pipeline bridge) can be used to: 

1) reduce the number of complex slave arbitors by putting multiple slaves behind the bridge. 

2) allow you to run some slaves at a lower clock frequency. 

Both will increase the time taken to access the bridged slaves. 

 

It ought to be possible to use the timing tools to work out the critical path. But I'm a software engineer... :-)
0 Kudos
Highlighted
Honored Contributor I
10 Views

Yes i agree with you. 

How can i use pipeline bridgers? 

I am also a software engineer :P
0 Kudos
Highlighted
Honored Contributor I
10 Views

If you with to run at 100 MHz, set the clock constrain in the .SDC to 10 ns and then synthesize the design again. 

Assuming it will fail to meet that constrain, then one option is to change the synthesis and fitting settings for speed and try again. 

 

If it still fails, you need to track down the critical paths. 

If you find the critical path is related to a slave, you can consider using a dual clock bridge to run that clock at a lower frequency. 

 

If you find the critical path is related to the Avalon fabric itself, then: 

a) do you have unnecessary master-slave connections in your bus? 

b) consider using pipeline or dual clock bridges to reduce the complexity. 

 

In order to use bridged, you need to place a bridge in your SOPC. 

The bridge will have a master and a slave port. 

Connect your masters to the bridge's slave port. 

Connect the bridge's master port to the slaves you want to place under the bridge.
0 Kudos
Highlighted
Honored Contributor I
10 Views

Thanks for your reply rbugalho!!

0 Kudos
Highlighted
Honored Contributor I
10 Views

Oh, important: 

You need to ajust not only the clock period in the .SDC but you also need to make sure the clock frequency settings are correct in any components that have them (ie, PLLs, SDRAM controller, etc).
0 Kudos
Highlighted
Honored Contributor I
10 Views

Rbugalho can i connect PLL, SDRAM etc under pipeline bridge? Just one to all my components?

0 Kudos
Highlighted
Honored Contributor I
10 Views

Yes. 

It may not be the best performing option, though. 

But that you'll have to go by try and error.
0 Kudos
Highlighted
Honored Contributor I
10 Views

Humm linux stoped working after i inserted a pipeline bridge but i will do other tests to see how far i can go.. 

 

And when is it indicated to use a pipeline bridge?
0 Kudos
Highlighted
Honored Contributor I
10 Views

Couldn't make Linux work using a pipeline bridger nor a clock crossing bridge. 

 

Taking a look at the Quartus II report my FMax raised from 110 to 115 MHZ using a clock crossing bridge between my CPU & SDRAM and my Timers/uart/sysid/pios. Nothing relevant.
0 Kudos