Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Altera_Forum
Honored Contributor I
3,004 Views

Nios II/f performance on Cyclone IV E

Hi, 

 

I am trying to evaluate Nios II/f on a dev. board fitted with 

EP4CE55F23C7. I've ported the Cyclone III Dhrystone example from Altera (Fast Nios 

II Hardware Design Example) to my board. Design specs are: 

 

* Nios II/f, 4 Kbytes i-cache, 2 Kbytes d-cache 

* System clock: 140MHz 

* JTAG debug module: Yes 

* On-chip RAM: 64 Kbytes 

* JTAG UART: 1 

* Timer: 1 

 

I've kept the settings, clock, build flags etc. same as in original example. When 

running the dhrystone example I get: 

 

... 

Microseconds for one run through Dhrystone: 6.5 

Dhrystones per Second: 153506.9 

VAX MIPS rating = 87.369 

 

This is way less than what the attached readme.txt states: 

...This system achieves over 150 DMIPS on the Cyclone III FPGA Development Kit, 

and can achieve over 190 DMIPS when targeting the fastest speed grade of a  

CycloneIII device... 

 

According to ds_nios2_perf.pdf MIPS/MHz ratio of Cyclone IV GX is the same as 

Cyclone III LS and a bit higher than Cyclone III. Regardless of that MIPS != DMISP 

I can't imagine such DMIPS/MHz ratio difference between Cyclone IV and III. Or 

am I missing some difference with Nios2 between Cyclone IV GX and E? 

 

Can somebody shed some light on this? 

 

Thanks for your help.
0 Kudos
11 Replies
Altera_Forum
Honored Contributor I
52 Views

I'd guess at either code optimisation level or cache thrashing. 

The Altera figure might be for tightly coupled memory (no cache line reads at all). 

The other possibility is the type of multiplier used - Altera won't let you throw large numbers of logic gates into a fast multiplier!
Altera_Forum
Honored Contributor I
52 Views

SOPC system layout is exactly the same as their (Altera) CycloneIII design 

example which (they say) achieves over 150 DMIPS clocked at 140MHz. 

Dhrystone app has a set of build scripts carefully specifying how to (or rather 

not to) optimize the code. HW multiplier is build with embedded multipliers, 

all code runs from internal RAM (dual port - one to instruction and other to 

data master). 

 

I've also tested running the same code from SDRAM - 41 DMIPS at 100MHz. 

I get the same result on another design on which I run uClinux with included 

dhrystone (different compiler, different program - but same dhrystone 2.1 

codebase). 

 

There is an old post about NiosII/f dhrystone on Cyclone EP1C20F400C7. 

The author got 35 DMIPS at 50 MHz and Cyclone I doesn't even have 

embedded multipliers. I get 31 DMIPS as 50 MHz...
Altera_Forum
Honored Contributor I
52 Views

You might have a clock crossing adapter in the memory path. Make sure the Nios II core and the memory are on the same clock domain otherwise each memory read will take around 9 clock cycles which will have a huge impact on the Dhyrstone algorithm. Also make sure the timer resolution is the same as the orginal design, if it's set to fine then the CPU will get blasted with interrupts and take longer to complete the benchmark.

Altera_Forum
Honored Contributor I
52 Views

Thanks for the info. The problem seems to be in the compiler - dhrystone test compiled 

with gcc 4.1.2 estimates 0.62 DMIPS/MHz and with gcc 3.4.6 1.04 DMIPS/MHz. It 

doesn't matter which version was used to compile the BSP, just the dhrystone itself. 

I've used the same makefile / compiler flags, but changing e.g. optimization level doesn't 

yield much difference with 4.1.2. I didn't investigate this any further, but obviously it's 

something to keep in mind.
Altera_Forum
Honored Contributor I
52 Views

I did try compiling my code with gcc4, my static analysis showed the code to both larger and slower than that generated by gcc3. I didn't look to see why. 

(That code is rather carefully crafted to avoid anything on-stack.) 

That would have been after I'd applied my changes to the gcc4 code generator and reversed some of Altera's changes. 

 

I'm also surprised that changing the optimisation level had little/no effect. The difference between -O0 and -O2 should be massive.
Altera_Forum
Honored Contributor I
52 Views

gcc 4.1.2 -O0 gives 0.37 DMIPS/MHz, -O1 0.59, -O2 0.62, -O3 0.664 and -O6 0.665 

gcc 3.4.6 -O0 0.41, -O1 0.83, -O2 0.91, -O3/6 1.04  

 

gcc 4.1.2 from nios2-linux-20100621.tar produces similar results as 4.1.2 version 

provided with QuartusII v11.
Altera_Forum
Honored Contributor I
52 Views

Try the following changes to the gcc sources, the 2nd may be relevant. 

 

gcc/config/nios2/nios2.h may need some changes undone/fixed:  

 

1) Change the value of JUMP_TABLES_IN_TEXT_SECTION from "1" to "flag_pic".  

 

2) Remove "!SYMBOL_REF_EXTERNAL_P(RTX) &&" from the definition of SYMBOL_REF_IN_NIOS2_SMALL_DATA_P(RTX).  

 

From http://www.alterawiki.com/wiki/crossgcc see also http://www.alterawiki.com/wiki/gcc_patches which may improve the code for both versions of gcc.
Altera_Forum
Honored Contributor I
52 Views

Thanks, I'll try to find time to test this. For now I'll stick with gcc3.

Altera_Forum
Honored Contributor I
52 Views

Dear Izidor,  

How to compile c program in nios ii in (-O2)?  

could you show me some brief steps?
Altera_Forum
Honored Contributor I
52 Views

Application makefile created by Nios2 sw build tools should contain APP_CFLAGS_OPTIMIZATION, set it to 

-O2 or -O3..6. Changing BSP optimization flags can be done by passing --set hal.make.bsp_cflags_optimization -O2 

to nios2-bsp. You chould check Nios2 Software Build Tools documentation for more. 

 

On a side note you might wan't to check the newly released sourcery gcc 4.7.3. I've found it generates much better 

- faster - code in some cases than the old gcc from Altera. But beware it has some serious issues with custom 

floating point instructions.
Altera_Forum
Honored Contributor I
52 Views

Thanks a lot Izidor,  

I'll try to find the doc you mentioned.